Skip to main content

Beta parameter sensitivity

The main tuning parameter in phspectra is β\beta, the persistence threshold in units of noise σ\sigma. A peak must have topological persistence exceeding βσrms\beta \cdot \sigma_\mathrm{rms} to be retained as a candidate component. In practice, β\beta controls the trade-off between detecting faint features (low β\beta) and rejecting noise artifacts (high β\beta).

Insensitivity across a wide range

We evaluated β\beta on two independent benchmarks:

  1. Synthetic spectra (350 spectra with known ground-truth components across 7 difficulty categories)
  2. Real GRS spectra (52 hand-curated spectra from the Galactic Ring Survey, scored against human-labeled decompositions)

Synthetic data: controlled benchmark

The real-data benchmark above compares phspectra against GaussPy+ (another algorithm), not against ground truth. To isolate β\beta sensitivity from algorithmic disagreement, we constructed a synthetic benchmark with known true components.

Test design. We generate 350 spectra across seven categories of increasing difficulty:

CategoryLabelComponentsAmplitudes (K)Widths σ\sigma (ch)Constraint
Single BrightSB11.0–5.03–10SNR > 7
Single FaintSF10.3–0.83–10SNR 2–6
Single NarrowSN11.0–5.01–2.5Sub-resolution widths
Single BroadSBd10.5–3.010–20Extended features
Multi SeparatedMS2–30.5–4.02–8Separation > 4σ4\sigma
Multi BlendedMB2–30.5–4.03–8Separation 1.51.53σ3\sigma
CrowdedC4–50.3–3.02–6Mixed separations

All spectra use GRS-realistic parameters: 424 channels with additive Gaussian noise at σ=0.25\sigma = 0.25 K. Because the true components are known exactly, F1F_1 measures true accuracy rather than agreement with another algorithm.

For each spectrum we sweep β\beta from 2.0 to 4.5, decompose with PHSpectra, and score using Hungarian matching with the Lindner et al. (2015) criteria.

uv run benchmarks train-synthetic

Results. Figure 1 shows F1F_1 vs β\beta for each category and overall:

Synthetic F1 vs beta

Figure 1. F1F_1 score as a function of the persistence threshold β\beta for seven synthetic spectrum categories (350 spectra, σ=0.25\sigma = 0.25 K). Each panel groups categories by difficulty type; the solid line shows the overall F1F_1. Performance varies by only 0.040 across the full sweep from β=2.0\beta = 2.0 to 4.54.5. Generated by uv run benchmarks train-synthetic.

The key observations:

  1. F1F_1 varies by only 0.040 across the full β\beta sweep (0.918 at β=2.0\beta=2.0 to 0.885 at β=4.5\beta=4.5, peaking at 0.925 at β=2.8\beta=2.8). This confirms that β\beta sensitivity is low on ground-truth data.

  2. The difficulty gradient follows expectations. Multi-component separated spectra score highest, while blended multi-component spectra are the hardest. This validates that the benchmark categories genuinely span the difficulty spectrum.

  3. Parameter recovery is accurate. The box plots in Figure 2 show ln(Qfit/Qtrue)\ln(Q_\mathrm{fit} / Q_\mathrm{true}) for amplitude, position, and width at the optimal β\beta. A value of zero indicates perfect recovery; the log-ratio is symmetric around zero and comparable across all three quantities.

Synthetic error distributions

Figure 2. Log-ratio error distributions ln(Qfit/Qtrue)\ln(Q_{\rm fit} / Q_{\rm true}) for matched component parameters at the optimal β=2.8\beta = 2.8. From left to right: amplitude, position, and width. Box plots show the median and interquartile range for each category; the dashed line marks perfect recovery. Generated by uv run benchmarks train-synthetic.

In Figure 2, all three panels are tightly centred on zero for most categories. Position recovery is particularly precise, with log-ratios of order 10310^{-3}. The Multi Blended (MB) category shows the widest spread, consistent with its higher intrinsic difficulty.

The per-category F1F_1 at the optimal β=2.8\beta = 2.8:

CategoryLabelF1F_1
Multi SeparatedMS0.975
Single BrightSB0.962
Single BroadSBd0.952
Single NarrowSN0.943
CrowdedC0.936
Single FaintSF0.907
Multi BlendedMB0.820

Limitation: tightly blended components

The Multi Blended (MB) category consistently scores lowest across all β\beta values. The root cause is structural:

Persistence merges close peaks. Persistent homology identifies peaks by their prominence: a feature must rise above its surrounding valley to register as a separate birth–death pair. When two Gaussians are separated by less than 2σ\sim 2\sigma, their sum looks like a single broad peak to the filtration — the weaker component appears as a shoulder rather than a distinct local maximum. In these cases the persistence diagram contains only one high-persistence feature where the ground truth has two, and the algorithm never gets a chance to fit the missing component.

This is a known structural limitation of persistence-based peak detection for closely blended lines, shared with any prominence-based method. For spectra where components are separated by >3σ> 3\sigma, the algorithm performs well (MS category F1=0.975F_1 = 0.975); for separations below 2σ\sim 2\sigma, the merged persistence feature is the binding constraint.

Real data: beta training

The training set used here was built with the train-gui package -- an interactive Matplotlib tool for manually curating Gaussian component labels on real GRS spectra. Starting from comparison data produced by benchmarks compare, the GUI lets you navigate pixels, toggle individual components, and manually fit new ones. The curated labels are saved to a JSON file that serves as ground truth for the β\beta sweep below.

uv run benchmarks download
uv run benchmarks train --training-set packages/train-gui/data/training_set.json

Sweeping β\beta from 2.0 to 4.5 on 52 hand-curated GRS spectra shows a gently peaked F1F_1 curve (Figure 3):

Beta training on GRS spectra

Figure 3. F1F_1, precision, and recall as a function of β\beta on 52 real GRS spectra, scored against hand-curated training set decompositions. The large precision--recall gap reflects unlabeled components in the curated set rather than false detections (see text). Generated by uv run benchmarks train --training-set packages/train-gui/data/training_set.json.

The optimal β\beta on real data is 3.67 (F1=0.576F_1 = 0.576, P=0.426P = 0.426, R=0.888R = 0.888), with F1F_1 variation of 0.084 across the sweep.

Why the default is β=3.5\beta = 3.5, not 2.82.8

The synthetic benchmark peaks at β2.8\beta \approx 2.8 and the real-data benchmark at β3.7\beta \approx 3.7, but the F1F_1 difference between 2.8 and 3.5 is negligible: 0.925 vs 0.919 on synthetic data (a drop of 0.006). Both values sit on the flat plateau of the F1F_1 curve, so accuracy is not the deciding factor.

The practical reason for preferring 3.5 is speed. A lower β\beta admits more candidate peaks from the persistence filtration, many of which sit just above the noise floor. These marginal peaks generate initial Gaussian guesses that must be fitted by the Levenberg-Marquardt solver -- the most expensive step in the pipeline -- only to be discarded during component validation (SNR floor, matched-filter SNR). The fitting cost scales with the number of components, so a large number of doomed candidates slows the algorithm without improving the final decomposition.

At β=3.5\beta = 3.5, the persistence threshold is high enough that most noise peaks are rejected before fitting, while genuine features (which have persistence well above 3.5σ3.5\sigma) are retained. This provides a good balance between not missing real peaks near the detection limit and not wasting computation on candidates that will be removed downstream.

Interpreting precision and recall

Figure 3 shows a large gap between recall (0.9\approx 0.9) and precision (0.4\approx 0.4). The training set here consists of hand-curated decompositions from real GRS spectra, so these metrics are measured against human-labeled ground truth.

Recall 0.89\approx 0.89 (good). PHSpectra recovers about 89% of the components in the curated training set. When a human labels a feature as a real component, PHSpectra almost always finds it. Very few labeled components are missed.

Precision 0.43\approx 0.43 (needs context). Only about 43% of PHSpectra's detected components have a matching component in the curated set. This means PHSpectra consistently finds more components than the human labeler marked. These extra detections fall into two categories:

  • Real features the labeler did not annotate. Hand-curated sets are rarely exhaustive — faint or partially blended lines may be left unlabeled, particularly in crowded regions. PHSpectra's persistence-based detection can resolve features that are easy to overlook during manual inspection.
  • Possible over-decomposition. Some of the extra components may split a single broad feature into multiple narrower ones, producing a valid but different decomposition.

The key observation is that this precision/recall split is stable across the full β\beta sweep: raising β\beta does not meaningfully improve precision, because the extra components are not noise artifacts (those would vanish at higher β\beta). They reflect genuine detections that fall outside the scope of the curated labels.

This interpretation is supported by the synthetic benchmark, where ground truth is known exactly: there, precision and recall are both high (F1=0.925F_1 = 0.925), confirming that PHSpectra does not systematically hallucinate components.

Why this matters

GaussPy requires a trained smoothing parameter α\alpha that is sensitive to the noise properties and spectral structure of each survey. The training procedure (Lindner et al. 2015) requires labeled decompositions and can produce different optimal values for different regions of the same survey.

In contrast, PHSpectra's β\beta parameter is:

  • Survey-agnostic: values in the range β=2.8\beta = 2.83.73.7 work well across both real and synthetic data with fundamentally different noise structures.
  • Robust to perturbation: performance degrades gracefully rather than collapsing at non-optimal values. There is no cliff — both F1F_1 curves vary by less than 0.09 across the full 2.02.04.54.5 sweep.
  • Physically interpretable: β\beta directly controls the minimum significance (in σ\sigma) for a peak to be considered real. A value of β=3.5\beta = 3.5 means "reject anything less significant than a 3.5σ3.5\sigma fluctuation," which is a natural and intuitive threshold.
  • Efficient: at β=3.5\beta = 3.5, most noise peaks are filtered out before the expensive fitting step, avoiding wasted computation on candidates that would be rejected by validation anyway.