PHSpectra
Persistent homology for spectral line decomposition.
The problem
Along any line of sight, radio-astronomical spectra -- HI 21-cm, CO, and other molecular tracers -- sample emission from multiple gas clouds at different radial velocities. In the optically thin limit each cloud contributes a Gaussian profile, and the observed spectrum is a linear superposition of these components. Recovering the individual amplitudes, centroid velocities, and widths is a blind decomposition problem: given a noisy 1-D signal, determine the number of components and fit their parameters without prior knowledge. For modern surveys that contain millions of spectra (GALFA-HI, THOR, GASKAP, GRS), the decomposition method must be both accurate and scalable.
Existing approaches
Several families of algorithms have been developed for this task:
| Approach | Examples | Key idea | Limitation |
|---|---|---|---|
| Derivative spectroscopy | GaussPy (Lindner et al. 2015), GaussPy+ (Riener et al. 2019), BTS (Clarke et al. 2018) | Smoothed derivatives locate zero-crossings that mark peak positions | Smoothing parameters (, ) must be trained on synthetic spectra; smoothing can suppress closely-spaced components |
| Semi-interactive | SCOUSEPY (Henshaw et al. 2016, 2019) | Human selects spectral averaging areas; automated fitting within regions | Requires user guidance; difficult to scale to millions of spectra |
| Bayesian model comparison | PySpecNest (Sokolov et al. 2020) | Full posterior sampling with evidence-ratio model selection | Rigorous but computationally expensive per spectrum |
| Brute-force optimisation | Pyspeckit (Ginsburg & Mirocha 2011), SPIF (Juvela et al. 2024) | Many random initialisations; keep best | Robustness depends on number of restarts; GPU hardware needed for survey scale |
| Hyperfine-specialised | MWYDIN (Rigby et al. 2024) | Shared excitation temperature and FWHM; BIC model selection | Limited to 3 velocity components; assumes shared physical conditions |
Common challenges across these methods include: choosing the number of components (model selection), sensitivity to initial guesses, parameter degeneracies in optically thick or blended lines, and the trade-off between computational cost and uncertainty quantification (Juvela et al. 2024).
A topological alternative
phspectra takes a different approach. Instead of smoothing and differentiating, or sampling a posterior, it uses 0-dimensional persistent homology -- a tool from topological data analysis -- to identify peaks and rank them by their topological persistence: the height difference between a peak's birth and the level at which it merges with a taller neighbour. This persistence is a natural, scale-free measure of peak significance that requires no smoothing, no random initialisation, and no training data.
The single tuning parameter is , which sets the minimum persistence threshold in units of the estimated noise :
A peak is retained only if its persistence exceeds , i.e. only if it is at least significant. The default works across both real and synthetic data; sweeping from 2.0 to 4.5 changes by less than 0.09 (see Beta sensitivity).
How phspectra compares
| Derivative methods | Bayesian / brute-force | phspectra | |
|---|---|---|---|
| Peak detection | Smoothed derivative zero-crossings | Random initialisation + optimisation | Persistent homology (all scales simultaneously) |
| Tuning parameters | , (regularisation strength) | Priors, number of restarts | (persistence threshold in noise units) |
| Training | Supervised, on synthetic spectra | Not required, but slow per spectrum | Not required -- default generalises |
| Determinism | Deterministic | Stochastic (random seeds / MCMC) | Deterministic |
| Scalability | Survey-scale on CPU | Survey-scale with GPU (SPIF) | Survey-scale on serverless (AWS Lambda); ~2.3M spectra for ~$40 |
Persistent homology provides principled initial guesses -- peak locations, amplitudes, and widths read directly from the persistence diagram -- so the subsequent least-squares fit converges reliably without random restarts. The result is a method that is training-free, deterministic, and scalable without GPU infrastructure.
Quick links
- Motivation -- Why topology instead of derivatives?
- Persistent Homology -- Mathematics, algorithm, and integration in phspectra
- Data Sources -- Surveys and catalogs used for benchmarking
- GRS survey -- Full-survey decomposition of the Galactic Ring Survey