Beta parameter sensitivity
The main tuning parameter in phspectra is , the persistence threshold in units of noise . A peak must have topological persistence exceeding to be retained as a candidate component. In practice, controls the trade-off between detecting faint features (low ) and rejecting noise artifacts (high ).
Insensitivity across a wide range
We evaluated on two independent benchmarks:
- Synthetic spectra (350 spectra with known ground-truth components across 7 difficulty categories)
- Real GRS spectra (52 hand-curated spectra from the Galactic Ring Survey, scored against human-labeled decompositions)
Synthetic data: controlled benchmark
The real-data benchmark above compares phspectra against GaussPy+ (another algorithm), not against ground truth. To isolate sensitivity from algorithmic disagreement, we constructed a synthetic benchmark with known true components.
Test design. We generate 350 spectra across seven categories of increasing difficulty:
| Category | Label | Components | Amplitudes (K) | Widths (ch) | Constraint |
|---|---|---|---|---|---|
| Single Bright | SB | 1 | 1.0–5.0 | 3–10 | SNR > 7 |
| Single Faint | SF | 1 | 0.3–0.8 | 3–10 | SNR 2–6 |
| Single Narrow | SN | 1 | 1.0–5.0 | 1–2.5 | Sub-resolution widths |
| Single Broad | SBd | 1 | 0.5–3.0 | 10–20 | Extended features |
| Multi Separated | MS | 2–3 | 0.5–4.0 | 2–8 | Separation > |
| Multi Blended | MB | 2–3 | 0.5–4.0 | 3–8 | Separation – |
| Crowded | C | 4–5 | 0.3–3.0 | 2–6 | Mixed separations |
All spectra use GRS-realistic parameters: 424 channels with additive Gaussian noise at K. Because the true components are known exactly, measures true accuracy rather than agreement with another algorithm.
For each spectrum we sweep from 2.0 to 4.5, decompose with PHSpectra, and score using Hungarian matching with the Lindner et al. (2015) criteria.
uv run benchmarks train-synthetic
Results. Figure 1 shows vs for each category and overall:

Figure 1. score as a function of the persistence threshold for seven synthetic spectrum categories (350 spectra, K). Each panel groups categories by difficulty type; the solid line shows the overall . Performance varies by only 0.040 across the full sweep from to . Generated by uv run benchmarks train-synthetic.
The key observations:
-
varies by only 0.040 across the full sweep (0.918 at to 0.885 at , peaking at 0.925 at ). This confirms that sensitivity is low on ground-truth data.
-
The difficulty gradient follows expectations. Multi-component separated spectra score highest, while blended multi-component spectra are the hardest. This validates that the benchmark categories genuinely span the difficulty spectrum.
-
Parameter recovery is accurate. The box plots in Figure 2 show for amplitude, position, and width at the optimal . A value of zero indicates perfect recovery; the log-ratio is symmetric around zero and comparable across all three quantities.

Figure 2. Log-ratio error distributions for matched component parameters at the optimal . From left to right: amplitude, position, and width. Box plots show the median and interquartile range for each category; the dashed line marks perfect recovery. Generated by uv run benchmarks train-synthetic.
In Figure 2, all three panels are tightly centred on zero for most categories. Position recovery is particularly precise, with log-ratios of order . The Multi Blended (MB) category shows the widest spread, consistent with its higher intrinsic difficulty.
The per-category at the optimal :
| Category | Label | |
|---|---|---|
| Multi Separated | MS | 0.975 |
| Single Bright | SB | 0.962 |
| Single Broad | SBd | 0.952 |
| Single Narrow | SN | 0.943 |
| Crowded | C | 0.936 |
| Single Faint | SF | 0.907 |
| Multi Blended | MB | 0.820 |
Limitation: tightly blended components
The Multi Blended (MB) category consistently scores lowest across all values. The root cause is structural:
Persistence merges close peaks. Persistent homology identifies peaks by their prominence: a feature must rise above its surrounding valley to register as a separate birth–death pair. When two Gaussians are separated by less than , their sum looks like a single broad peak to the filtration — the weaker component appears as a shoulder rather than a distinct local maximum. In these cases the persistence diagram contains only one high-persistence feature where the ground truth has two, and the algorithm never gets a chance to fit the missing component.
This is a known structural limitation of persistence-based peak detection for closely blended lines, shared with any prominence-based method. For spectra where components are separated by , the algorithm performs well (MS category ); for separations below , the merged persistence feature is the binding constraint.
Real data: beta training
The training set used here was built with the train-gui package -- an interactive Matplotlib tool for manually curating Gaussian component labels on real GRS spectra. Starting from comparison data produced by benchmarks compare, the GUI lets you navigate pixels, toggle individual components, and manually fit new ones. The curated labels are saved to a JSON file that serves as ground truth for the sweep below.
uv run benchmarks download
uv run benchmarks train --training-set packages/train-gui/data/training_set.json
Sweeping from 2.0 to 4.5 on 52 hand-curated GRS spectra shows a gently peaked curve (Figure 3):

Figure 3. , precision, and recall as a function of on 52 real GRS spectra, scored against hand-curated training set decompositions. The large precision--recall gap reflects unlabeled components in the curated set rather than false detections (see text). Generated by uv run benchmarks train --training-set packages/train-gui/data/training_set.json.
The optimal on real data is 3.67 (, , ), with variation of 0.084 across the sweep.
Why the default is , not
The synthetic benchmark peaks at and the real-data benchmark at , but the difference between 2.8 and 3.5 is negligible: 0.925 vs 0.919 on synthetic data (a drop of 0.006). Both values sit on the flat plateau of the curve, so accuracy is not the deciding factor.
The practical reason for preferring 3.5 is speed. A lower admits more candidate peaks from the persistence filtration, many of which sit just above the noise floor. These marginal peaks generate initial Gaussian guesses that must be fitted by the Levenberg-Marquardt solver -- the most expensive step in the pipeline -- only to be discarded during component validation (SNR floor, matched-filter SNR). The fitting cost scales with the number of components, so a large number of doomed candidates slows the algorithm without improving the final decomposition.
At , the persistence threshold is high enough that most noise peaks are rejected before fitting, while genuine features (which have persistence well above ) are retained. This provides a good balance between not missing real peaks near the detection limit and not wasting computation on candidates that will be removed downstream.
Interpreting precision and recall
Figure 3 shows a large gap between recall () and precision (). The training set here consists of hand-curated decompositions from real GRS spectra, so these metrics are measured against human-labeled ground truth.
Recall (good). PHSpectra recovers about 89% of the components in the curated training set. When a human labels a feature as a real component, PHSpectra almost always finds it. Very few labeled components are missed.
Precision (needs context). Only about 43% of PHSpectra's detected components have a matching component in the curated set. This means PHSpectra consistently finds more components than the human labeler marked. These extra detections fall into two categories:
- Real features the labeler did not annotate. Hand-curated sets are rarely exhaustive — faint or partially blended lines may be left unlabeled, particularly in crowded regions. PHSpectra's persistence-based detection can resolve features that are easy to overlook during manual inspection.
- Possible over-decomposition. Some of the extra components may split a single broad feature into multiple narrower ones, producing a valid but different decomposition.
The key observation is that this precision/recall split is stable across the full sweep: raising does not meaningfully improve precision, because the extra components are not noise artifacts (those would vanish at higher ). They reflect genuine detections that fall outside the scope of the curated labels.
This interpretation is supported by the synthetic benchmark, where ground truth is known exactly: there, precision and recall are both high (), confirming that PHSpectra does not systematically hallucinate components.
Why this matters
GaussPy requires a trained smoothing parameter that is sensitive to the noise properties and spectral structure of each survey. The training procedure (Lindner et al. 2015) requires labeled decompositions and can produce different optimal values for different regions of the same survey.
In contrast, PHSpectra's parameter is:
- Survey-agnostic: values in the range – work well across both real and synthetic data with fundamentally different noise structures.
- Robust to perturbation: performance degrades gracefully rather than collapsing at non-optimal values. There is no cliff — both curves vary by less than 0.09 across the full – sweep.
- Physically interpretable: directly controls the minimum significance (in ) for a peak to be considered real. A value of means "reject anything less significant than a fluctuation," which is a natural and intuitive threshold.
- Efficient: at , most noise peaks are filtered out before the expensive fitting step, avoiding wasted computation on candidates that would be rejected by validation anyway.