Beta parameter sensitivity

The main tuning parameter in phspectra is $\beta$ , the persistence threshold in units of noise $\sigma$ . A peak must have topological persistence exceeding $\beta \cdot \sigma_\mathrm{rms}$ to be retained as a candidate component. In practice, $\beta$ controls the trade-off between detecting faint features (low $\beta$ ) and rejecting noise artifacts (high $\beta$ ).

Insensitivity across a wide range

We evaluated $\beta$ on two independent benchmarks:

Synthetic spectra (350 spectra with known ground-truth components across 7 difficulty categories)
Real GRS spectra (52 hand-curated spectra from the Galactic Ring Survey, scored against human-labeled decompositions)

Synthetic data: controlled benchmark

The real-data benchmark above compares phspectra against GaussPy+ (another algorithm), not against ground truth. To isolate $\beta$ sensitivity from algorithmic disagreement, we constructed a synthetic benchmark with known true components.

Test design. We generate 350 spectra across seven categories of increasing difficulty:

Category	Label	Components	Amplitudes (K)	Widths $\sigma$ (ch)	Constraint
Single Bright	SB	1	1.0–5.0	3–10	SNR > 7
Single Faint	SF	1	0.3–0.8	3–10	SNR 2–6
Single Narrow	SN	1	1.0–5.0	1–2.5	Sub-resolution widths
Single Broad	SBd	1	0.5–3.0	10–20	Extended features
Multi Separated	MS	2–3	0.5–4.0	2–8	Separation > $4\sigma$
Multi Blended	MB	2–3	0.5–4.0	3–8	Separation $1.5$ – $3\sigma$
Crowded	C	4–5	0.3–3.0	2–6	Mixed separations

All spectra use GRS-realistic parameters: 424 channels with additive Gaussian noise at $\sigma = 0.25$ K. Because the true components are known exactly, $F_1$ measures true accuracy rather than agreement with another algorithm.

For each spectrum we sweep $\beta$ from 2.0 to 4.5, decompose with PHSpectra, and score using Hungarian matching with the Lindner et al. (2015) criteria.

uv run benchmarks train-synthetic

Results. Figure 1 shows $F_1$ vs $\beta$ for each category and overall:

Synthetic F1 vs beta — **Figure 1.** $F_1$ score as a function of the persistence threshold $\beta$ for seven synthetic spectrum categories (350 spectra, $\sigma = 0.25$ K). Each panel groups categories by difficulty type; the solid line shows the overall $F_1$ . Performance varies by only 0.040 across the full sweep from $\beta = 2.0$ to $4.5$ . Generated by `uv run benchmarks train-synthetic`.

The key observations:

$F_1$ varies by only 0.040 across the full $\beta$ sweep (0.918 at $\beta=2.0$ to 0.885 at $\beta=4.5$ , peaking at 0.925 at $\beta=2.8$ ). This confirms that $\beta$ sensitivity is low on ground-truth data.
The difficulty gradient follows expectations. Multi-component separated spectra score highest, while blended multi-component spectra are the hardest. This validates that the benchmark categories genuinely span the difficulty spectrum.
Parameter recovery is accurate. The box plots in Figure 2 show $\ln(Q_\mathrm{fit} / Q_\mathrm{true})$ for amplitude, position, and width at the optimal $\beta$ . A value of zero indicates perfect recovery; the log-ratio is symmetric around zero and comparable across all three quantities.

Synthetic error distributions — **Figure 2.** Log-ratio error distributions $\ln(Q_{\rm fit} / Q_{\rm true})$ for matched component parameters at the optimal $\beta = 2.8$ . From left to right: amplitude, position, and width. Box plots show the median and interquartile range for each category; the dashed line marks perfect recovery. Generated by `uv run benchmarks train-synthetic`.

In Figure 2, all three panels are tightly centred on zero for most categories. Position recovery is particularly precise, with log-ratios of order $10^{-3}$ . The Multi Blended (MB) category shows the widest spread, consistent with its higher intrinsic difficulty.

The per-category $F_1$ at the optimal $\beta = 2.8$ :

Category	Label	$F_1$
Multi Separated	MS	0.975
Single Bright	SB	0.962
Single Broad	SBd	0.952
Single Narrow	SN	0.943
Crowded	C	0.936
Single Faint	SF	0.907
Multi Blended	MB	0.820

Limitation: tightly blended components

The Multi Blended (MB) category consistently scores lowest across all $\beta$ values. The root cause is structural:

Persistence merges close peaks. Persistent homology identifies peaks by their prominence: a feature must rise above its surrounding valley to register as a separate birth–death pair. When two Gaussians are separated by less than $\sim 2\sigma$ , their sum looks like a single broad peak to the filtration — the weaker component appears as a shoulder rather than a distinct local maximum. In these cases the persistence diagram contains only one high-persistence feature where the ground truth has two, and the algorithm never gets a chance to fit the missing component.

This is a known structural limitation of persistence-based peak detection for closely blended lines, shared with any prominence-based method. For spectra where components are separated by $> 3\sigma$ , the algorithm performs well (MS category $F_1 = 0.975$ ); for separations below $\sim 2\sigma$ , the merged persistence feature is the binding constraint.

Real data: beta training

The training set used here was built with the train-gui package -- an interactive Matplotlib tool for manually curating Gaussian component labels on real GRS spectra. Starting from comparison data produced by benchmarks compare, the GUI lets you navigate pixels, toggle individual components, and manually fit new ones. The curated labels are saved to a JSON file that serves as ground truth for the $\beta$ sweep below.

uv run benchmarks download
uv run benchmarks train --training-set packages/train-gui/data/training_set.json

Sweeping $\beta$ from 2.0 to 4.5 on 52 hand-curated GRS spectra shows a gently peaked $F_1$ curve (Figure 3):

Beta training on GRS spectra — **Figure 3.** $F_1$ , precision, and recall as a function of $\beta$ on 52 real GRS spectra, scored against hand-curated training set decompositions. The large precision--recall gap reflects unlabeled components in the curated set rather than false detections (see text). Generated by `uv run benchmarks train --training-set packages/train-gui/data/training_set.json`.

The optimal $\beta$ on real data is 3.67 ( $F_1 = 0.576$ , $P = 0.426$ , $R = 0.888$ ), with $F_1$ variation of 0.084 across the sweep.

Why the default is $\beta = 3.5$ , not $2.8$

The synthetic benchmark peaks at $\beta \approx 2.8$ and the real-data benchmark at $\beta \approx 3.7$ , but the $F_1$ difference between 2.8 and 3.5 is negligible: 0.925 vs 0.919 on synthetic data (a drop of 0.006). Both values sit on the flat plateau of the $F_1$ curve, so accuracy is not the deciding factor.

The practical reason for preferring 3.5 is speed. A lower $\beta$ admits more candidate peaks from the persistence filtration, many of which sit just above the noise floor. These marginal peaks generate initial Gaussian guesses that must be fitted by the Levenberg-Marquardt solver -- the most expensive step in the pipeline -- only to be discarded during component validation (SNR floor, matched-filter SNR). The fitting cost scales with the number of components, so a large number of doomed candidates slows the algorithm without improving the final decomposition.

At $\beta = 3.5$ , the persistence threshold is high enough that most noise peaks are rejected before fitting, while genuine features (which have persistence well above $3.5\sigma$ ) are retained. This provides a good balance between not missing real peaks near the detection limit and not wasting computation on candidates that will be removed downstream.

Interpreting precision and recall

Figure 3 shows a large gap between recall ( $\approx 0.9$ ) and precision ( $\approx 0.4$ ). The training set here consists of hand-curated decompositions from real GRS spectra, so these metrics are measured against human-labeled ground truth.

Recall $\approx 0.89$ (good). PHSpectra recovers about 89% of the components in the curated training set. When a human labels a feature as a real component, PHSpectra almost always finds it. Very few labeled components are missed.

Precision $\approx 0.43$ (needs context). Only about 43% of PHSpectra's detected components have a matching component in the curated set. This means PHSpectra consistently finds more components than the human labeler marked. These extra detections fall into two categories:

Real features the labeler did not annotate. Hand-curated sets are rarely exhaustive — faint or partially blended lines may be left unlabeled, particularly in crowded regions. PHSpectra's persistence-based detection can resolve features that are easy to overlook during manual inspection.
Possible over-decomposition. Some of the extra components may split a single broad feature into multiple narrower ones, producing a valid but different decomposition.

The key observation is that this precision/recall split is stable across the full $\beta$ sweep: raising $\beta$ does not meaningfully improve precision, because the extra components are not noise artifacts (those would vanish at higher $\beta$ ). They reflect genuine detections that fall outside the scope of the curated labels.

This interpretation is supported by the synthetic benchmark, where ground truth is known exactly: there, precision and recall are both high ( $F_1 = 0.925$ ), confirming that PHSpectra does not systematically hallucinate components.

Why this matters

GaussPy requires a trained smoothing parameter $\alpha$ that is sensitive to the noise properties and spectral structure of each survey. The training procedure (Lindner et al. 2015) requires labeled decompositions and can produce different optimal values for different regions of the same survey.

In contrast, PHSpectra's $\beta$ parameter is:

Survey-agnostic: values in the range $\beta = 2.8$ – $3.7$ work well across both real and synthetic data with fundamentally different noise structures.
Robust to perturbation: performance degrades gracefully rather than collapsing at non-optimal values. There is no cliff — both $F_1$ curves vary by less than 0.09 across the full $2.0$ – $4.5$ sweep.
Physically interpretable: $\beta$ directly controls the minimum significance (in $\sigma$ ) for a peak to be considered real. A value of $\beta = 3.5$ means "reject anything less significant than a $3.5\sigma$ fluctuation," which is a natural and intuitive threshold.
Efficient: at $\beta = 3.5$ , most noise peaks are filtered out before the expensive fitting step, avoiding wasted computation on candidates that would be rejected by validation anyway.

Insensitivity across a wide range​

Synthetic data: controlled benchmark​

Limitation: tightly blended components​

Real data: beta training​

Why the default is β=3.5\beta = 3.5β=3.5, not 2.82.82.8​

Interpreting precision and recall​

Why this matters​