Skip to main content

PHSpectra

Persistent homology for spectral line decomposition.

The problem

Along any line of sight, radio-astronomical spectra -- HI 21-cm, 13{}^{13}CO, and other molecular tracers -- sample emission from multiple gas clouds at different radial velocities. In the optically thin limit each cloud contributes a Gaussian profile, and the observed spectrum is a linear superposition of these components. Recovering the individual amplitudes, centroid velocities, and widths is a blind decomposition problem: given a noisy 1-D signal, determine the number of components and fit their parameters without prior knowledge. For modern surveys that contain millions of spectra (GALFA-HI, THOR, GASKAP, GRS), the decomposition method must be both accurate and scalable.

Existing approaches

Several families of algorithms have been developed for this task:

ApproachExamplesKey ideaLimitation
Derivative spectroscopyGaussPy (Lindner et al. 2015), GaussPy+ (Riener et al. 2019), BTS (Clarke et al. 2018)Smoothed derivatives locate zero-crossings that mark peak positionsSmoothing parameters (α1\alpha_1, α2\alpha_2) must be trained on synthetic spectra; smoothing can suppress closely-spaced components
Semi-interactiveSCOUSEPY (Henshaw et al. 2016, 2019)Human selects spectral averaging areas; automated fitting within regionsRequires user guidance; difficult to scale to millions of spectra
Bayesian model comparisonPySpecNest (Sokolov et al. 2020)Full posterior sampling with evidence-ratio model selectionRigorous but computationally expensive per spectrum
Brute-force optimisationPyspeckit (Ginsburg & Mirocha 2011), SPIF (Juvela et al. 2024)Many random initialisations; keep best χ2\chi^2Robustness depends on number of restarts; GPU hardware needed for survey scale
Hyperfine-specialisedMWYDIN (Rigby et al. 2024)Shared excitation temperature and FWHM; BIC model selectionLimited to 3 velocity components; assumes shared physical conditions

Common challenges across these methods include: choosing the number of components (model selection), sensitivity to initial guesses, parameter degeneracies in optically thick or blended lines, and the trade-off between computational cost and uncertainty quantification (Juvela et al. 2024).

A topological alternative

phspectra takes a different approach. Instead of smoothing and differentiating, or sampling a posterior, it uses 0-dimensional persistent homology -- a tool from topological data analysis -- to identify peaks and rank them by their topological persistence: the height difference between a peak's birth and the level at which it merges with a taller neighbour. This persistence is a natural, scale-free measure of peak significance that requires no smoothing, no random initialisation, and no training data.

The single tuning parameter is β\beta, which sets the minimum persistence threshold in units of the estimated noise σrms\sigma_\mathrm{rms}:

πmin=β×σrms\pi_{\min} = \beta \times \sigma_{\mathrm{rms}}

A peak is retained only if its persistence exceeds βσrms\beta \cdot \sigma_\mathrm{rms}, i.e. only if it is at least βσ\beta\sigma significant. The default β=3.5\beta = 3.5 works across both real and synthetic data; sweeping from 2.0 to 4.5 changes F1F_1 by less than 0.09 (see Beta sensitivity).

How phspectra compares

Derivative methodsBayesian / brute-forcephspectra
Peak detectionSmoothed derivative zero-crossingsRandom initialisation + optimisationPersistent homology (all scales simultaneously)
Tuning parametersα1\alpha_1, α2\alpha_2 (regularisation strength)Priors, number of restartsβ\beta (persistence threshold in noise units)
TrainingSupervised, on synthetic spectraNot required, but slow per spectrumNot required -- default β=3.5\beta = 3.5 generalises
DeterminismDeterministicStochastic (random seeds / MCMC)Deterministic
ScalabilitySurvey-scale on CPUSurvey-scale with GPU (SPIF)Survey-scale on serverless (AWS Lambda); ~2.3M spectra for ~$40

Persistent homology provides principled initial guesses -- peak locations, amplitudes, and widths read directly from the persistence diagram -- so the subsequent least-squares fit converges reliably without random restarts. The result is a method that is training-free, deterministic, and scalable without GPU infrastructure.

  • Motivation -- Why topology instead of derivatives?
  • Persistent Homology -- Mathematics, algorithm, and integration in phspectra
  • Data Sources -- Surveys and catalogs used for benchmarking
  • GRS survey -- Full-survey decomposition of the Galactic Ring Survey