Skip to main content

Motivation

The problem: spectral line decomposition

Along any line of sight, radio-astronomical spectra -- HI 21-cm, 13{}^{13}CO, and other tracers -- sample emission from multiple gas clouds at different radial velocities. Within each cloud, random thermal motions (Maxwell-Boltzmann) and small-scale turbulent motions both produce Gaussian velocity distributions. The observed linewidth of a single cloud is σ=σth2+σturb2\sigma = \sqrt{\sigma_\mathrm{th}^2 + \sigma_\mathrm{turb}^2}, where σth=kBT/m\sigma_\mathrm{th} = \sqrt{k_B T / m} is the thermal broadening and σturb\sigma_\mathrm{turb} captures non-thermal contributions. In the optically thin limit the brightness temperature at each velocity is proportional to the column density of emitting gas, so each cloud contributes a Gaussian profile centred on its bulk velocity. The observed spectrum is therefore a linear superposition of Gaussians, one per kinematically distinct component along the line of sight.

Recovering these individual components -- their amplitudes, centroid velocities, and widths -- is essential for understanding the structure and kinematics of the interstellar medium (ISM). This is a blind decomposition problem: given a noisy 1D signal, determine the number of components and fit their parameters without prior knowledge.

Persistent homology for peak detection

Persistent homology detects peaks by analysing the topology of the function's upper-level sets as a threshold descends from the maximum.

The persistence of a peak -- the difference between its birth (height) and death (merge level) -- provides a natural measure of significance. Small-persistence features correspond to noise; large-persistence features are real peaks.

The beta parameter

Raw persistence is measured in the same units as the signal, so a fixed threshold cannot distinguish noise from signal across datasets with different noise levels. The main tuning parameter is β\beta, which sets the persistence threshold in units of noise:

πmin=β×σrms\pi_{\min} = \beta \times \sigma_{\mathrm{rms}}

where σrms\sigma_{\rm rms} is estimated robustly from the data itself using signal-masked RMS estimation (following Riener et al. 2019, Sect. 3.1.1): positive-channel runs are masked, a MAD-based clip removes outliers, and the final σ\sigma is computed as the RMS of surviving noise channels.

Component validation

After persistence selects candidate peaks and Gaussians are fitted, each component must pass three additional quality checks before it is accepted:

CheckCriterionDefault
SNR floorAmplitude \geq snr_min ×σrms\times \sigma_\mathrm{rms}1.5
Matched-filter SNR(Ai/σrms)σi  π1/4(A_i / \sigma_\mathrm{rms})\,\sqrt{\sigma_i}\;\pi^{1/4} \geq mf_snr_min5.0
Minimum widthFWHM \geq fwhm_min_channels1.0 channel

The matched-filter SNR is the optimal detection signal-to-noise for a Gaussian component in white noise. Because it scales as σ\sqrt{\sigma}, narrow peaks must have proportionally higher amplitude to survive -- this rejects noise spikes without an ad-hoc width threshold.

Components that survive validation are kept only if they improve the AICc of the overall model.

These thresholds have sensible physical defaults that work across the datasets we have tested. In practice, β\beta is the only parameter that meaningfully affects decomposition results -- the validation thresholds act as safety nets rather than tuning knobs.

Benchmark: GaussPy

Several families of decomposition algorithms exist (see Introduction for a full overview). We benchmark phspectra against GaussPy / GaussPy+ because both are open-source, fully automated, and target the same use case -- blind Gaussian decomposition of radio spectral cubes. Their derivative-based approach also provides a useful contrast with topology-based peak detection.

GaussPy uses derivative spectroscopy: it computes the second and fourth derivatives of the spectrum and locates their zero-crossings, which mark candidate peak positions and boundaries. Because finite-difference derivatives amplify noise, GaussPy regularizes them via Total Variation (TV) regularization, solving an optimization problem that balances data fidelity against a smoothness penalty:

R[u]=α(Dxu)2+βTV2  +  Axuf2R[u] = \alpha \int \sqrt{(D_x u)^2 + \beta_{\mathrm{TV}}^2} \;+\; \int |A_x u - f|^2

where uu is the regularized derivative, AxA_x is the anti-derivative operator, ff is the observed spectrum, and α\alpha controls the regularization strength. A two-phase decomposition uses α1\alpha_1 for broad features and α2\alpha_2 for narrower residual structure. Both parameters must be trained on synthetic spectra via supervised gradient descent that maximizes F1F_1. GaussPy+ adds spatial coherence constraints for spectral cubes, but the dependence on trained regularization parameters remains.

The key difference is in how peaks are found: GaussPy smooths the signal and differentiates; phspectra reads the peak structure directly from the topology of the data. The table below summarizes the comparison:

GaussPyphspectra
Peak detectionTV-regularized 2nd/4th derivative zero-crossingsPersistent homology (all scales simultaneously)
Tuning parametersα1,α2\alpha_1, \alpha_2 (regularization strength) + SNR cutoffβ\beta (persistence threshold in noise units)
TrainingSupervised gradient descent on synthetic spectraNot required -- default β=3.5\beta = 3.5 generalizes
Peak significanceImplicit (via regularization strength)Explicit (topological persistence)
ValidationSNR threshold (trained)SNR floor + matched-filter SNR + AICc (fixed defaults)

Why this matters

For large-scale surveys (e.g. GALFA-HI, THOR, GASKAP), millions of spectra must be decomposed. A method that:

  1. Has a single tuning parameter (β\beta) with sensible fixed defaults for everything else,
  2. Provides a principled significance measure grounded in topology,
  3. Naturally handles multi-scale features without smoothing,
  4. Is deterministic and reproducible,

could significantly improve both the reliability and scalability of spectral line decomposition.