Motivation
The problem: spectral line decomposition
Along any line of sight, radio-astronomical spectra -- HI 21-cm, CO, and other tracers -- sample emission from multiple gas clouds at different radial velocities. Within each cloud, random thermal motions (Maxwell-Boltzmann) and small-scale turbulent motions both produce Gaussian velocity distributions. The observed linewidth of a single cloud is , where is the thermal broadening and captures non-thermal contributions. In the optically thin limit the brightness temperature at each velocity is proportional to the column density of emitting gas, so each cloud contributes a Gaussian profile centred on its bulk velocity. The observed spectrum is therefore a linear superposition of Gaussians, one per kinematically distinct component along the line of sight.
Recovering these individual components -- their amplitudes, centroid velocities, and widths -- is essential for understanding the structure and kinematics of the interstellar medium (ISM). This is a blind decomposition problem: given a noisy 1D signal, determine the number of components and fit their parameters without prior knowledge.
Persistent homology for peak detection
Persistent homology detects peaks by analysing the topology of the function's upper-level sets as a threshold descends from the maximum.
The persistence of a peak -- the difference between its birth (height) and death (merge level) -- provides a natural measure of significance. Small-persistence features correspond to noise; large-persistence features are real peaks.
The beta parameter
Raw persistence is measured in the same units as the signal, so a fixed threshold cannot distinguish noise from signal across datasets with different noise levels. The main tuning parameter is , which sets the persistence threshold in units of noise:
where is estimated robustly from the data itself using signal-masked RMS estimation (following Riener et al. 2019, Sect. 3.1.1): positive-channel runs are masked, a MAD-based clip removes outliers, and the final is computed as the RMS of surviving noise channels.
Component validation
After persistence selects candidate peaks and Gaussians are fitted, each component must pass three additional quality checks before it is accepted:
| Check | Criterion | Default |
|---|---|---|
| SNR floor | Amplitude snr_min | 1.5 |
| Matched-filter SNR | mf_snr_min | 5.0 |
| Minimum width | FWHM fwhm_min_channels | 1.0 channel |
The matched-filter SNR is the optimal detection signal-to-noise for a Gaussian component in white noise. Because it scales as , narrow peaks must have proportionally higher amplitude to survive -- this rejects noise spikes without an ad-hoc width threshold.
Components that survive validation are kept only if they improve the AICc of the overall model.
These thresholds have sensible physical defaults that work across the datasets we have tested. In practice, is the only parameter that meaningfully affects decomposition results -- the validation thresholds act as safety nets rather than tuning knobs.
Benchmark: GaussPy
Several families of decomposition algorithms exist (see Introduction for a full overview). We benchmark phspectra against GaussPy / GaussPy+ because both are open-source, fully automated, and target the same use case -- blind Gaussian decomposition of radio spectral cubes. Their derivative-based approach also provides a useful contrast with topology-based peak detection.
GaussPy uses derivative spectroscopy: it computes the second and fourth derivatives of the spectrum and locates their zero-crossings, which mark candidate peak positions and boundaries. Because finite-difference derivatives amplify noise, GaussPy regularizes them via Total Variation (TV) regularization, solving an optimization problem that balances data fidelity against a smoothness penalty:
where is the regularized derivative, is the anti-derivative operator, is the observed spectrum, and controls the regularization strength. A two-phase decomposition uses for broad features and for narrower residual structure. Both parameters must be trained on synthetic spectra via supervised gradient descent that maximizes . GaussPy+ adds spatial coherence constraints for spectral cubes, but the dependence on trained regularization parameters remains.
The key difference is in how peaks are found: GaussPy smooths the signal and differentiates; phspectra reads the peak structure directly from the topology of the data. The table below summarizes the comparison:
| GaussPy | phspectra | |
|---|---|---|
| Peak detection | TV-regularized 2nd/4th derivative zero-crossings | Persistent homology (all scales simultaneously) |
| Tuning parameters | (regularization strength) + SNR cutoff | (persistence threshold in noise units) |
| Training | Supervised gradient descent on synthetic spectra | Not required -- default generalizes |
| Peak significance | Implicit (via regularization strength) | Explicit (topological persistence) |
| Validation | SNR threshold (trained) | SNR floor + matched-filter SNR + AICc (fixed defaults) |
Why this matters
For large-scale surveys (e.g. GALFA-HI, THOR, GASKAP), millions of spectra must be decomposed. A method that:
- Has a single tuning parameter () with sensible fixed defaults for everything else,
- Provides a principled significance measure grounded in topology,
- Naturally handles multi-scale features without smoothing,
- Is deterministic and reproducible,
could significantly improve both the reliability and scalability of spectral line decomposition.