Accuracy

Plots in this section can be reproduced using, both execute in under a couple of seconds

uv run benchmarks compare-plot
uv run benchmarks ncomp-rms-plot

True accuracy on synthetic data

When ground truth is known exactly (synthetic spectra with prescribed Gaussian components), PHSpectra achieves an overall $F_1$ = 0.899. The only challenging regime is heavily blended multi-component spectra ( $F_1$ = 0.749), where any algorithm faces fundamental ambiguity. See the Beta parameter sensitivity section for the full breakdown.

Comparison with GaussPy+

We run both PHSpectra and GaussPy+ on all 4200 spectra in the GRS test field. GaussPy+ is run in Docker using GaussPyDecompose with the trained parameters from Riener et al. (2019): $\alpha_1 = 2.89$ , $\alpha_2 = 6.65$ , two-phase decomposition, SNR threshold = 3.0.

Fit quality (RMS)

Metric	PHSpectra	GaussPy+
Mean RMS (K)	0.1356	0.1345
Lower RMS wins	2592 / 4200 (62%)	1488 / 4200 (35%)

The two tools achieve nearly identical mean RMS. PHSpectra achieves lower residuals on 62% of spectra in head-to-head comparisons.

The RMS distributions overlap heavily and both exhibit a bimodal structure. This bimodality is a property of the input data, not the fitting: the GRS test field contains two spatial populations with different noise levels ( $\sigma \approx 0.09$ K and $\sigma \approx 0.15$ K), likely due to varying integration time or field-edge effects. Both tools' residual RMS tracks the local noise floor.

**Figure 2.** Per-spectrum RMS scatter: each point is one of the 4200 spectra. Points below the 1:1 line (dashed) have lower GaussPy+ RMS; points above have lower PHSpectra RMS. Generated by `uv run benchmarks compare-plot`.

The scatter plot (Figure 2) reveals an asymmetry at high RMS: for spectra with residual RMS $\gtrsim 0.2$ K, a distinct population of points sits well above the 1:1 line, meaning GaussPy+ achieves lower residuals than PHSpectra on these noisy spectra. This is consistent with overfitting: as shown in Component count vs RMS below, GaussPy+ fits an average of 8.2 components for spectra with RMS $> 0.2$ K, compared to 1.6 for PHSpectra. More free parameters mechanically reduce the residual, but the resulting decomposition is not necessarily more physical. At low RMS ( $\lesssim 0.2$ K), the two tools track each other closely along the diagonal.

Where decompositions differ

A systematic comparison reveals several recurring patterns of disagreement:

Disagreement cases — **Figure 3.** Six representative spectra where PHSpectra and GaussPy+ disagree, selected to cover different disagreement types. Each panel shows the raw spectrum (grey), PHSpectra fit (black, solid), and GaussPy+ fit (black, dashed) with individual components. Generated by `uv run benchmarks compare-plot`.

The six panels in Figure 3 show representative cases:

PHS fewer components: GaussPy+ sometimes fits many components (up to 14) where PHSpectra finds fewer, better-constrained ones
PHS more components: PHSpectra resolves blended features that GaussPy+ misses entirely
PHS lower / GP+ lower RMS: each tool wins on different spectra, with different decomposition strategies
Same N, different positions: even with the same component count, the two algorithms place components differently
Different widths: the two algorithms sometimes assign different widths to the same feature

Component count vs RMS

The scatter plot below shows the number of fitted components against residual RMS for both methods.

N components vs RMS — **Figure 4.** Number of fitted components vs residual RMS for PHSpectra (top) and GaussPy+ (bottom) on all 4200 GRS test-field spectra. The dashed line at RMS $= 0.2$ K separates two regimes. For RMS $\leq 0.2$ K the component counts are comparable ( $\langle N \rangle = 2.6$ vs 1.9). For RMS $> 0.2$ K, GaussPy+ fits far more components ( $\langle N \rangle = 8.2$ ) than PHSpectra ( $\langle N \rangle = 1.6$ ). Generated by `uv run benchmarks ncomp-rms-plot`.

The two regimes in Figure 4 explain the RMS asymmetry seen in Figure 2. For low-noise spectra (RMS $\leq 0.2$ K), both tools detect comparable numbers of components. For noisy spectra (RMS $> 0.2$ K), GaussPy+ fits an average of 8.2 components compared to PHSpectra's 1.6. The extra components reduce the residual mechanically -- more free parameters always lower $\chi^2$ -- which is why GaussPy+ achieves lower RMS on these spectra (Figure 2). However, fitting 8+ Gaussians to a noisy spectrum is more likely to trace noise structure than real emission.

PHSpectra's persistence threshold imposes a hard significance floor: no candidate peak survives unless its topological prominence exceeds $\beta \times \sigma_\mathrm{rms}$ . On noisy spectra this means fewer (or zero) components are fitted, producing a higher RMS but a more physically defensible decomposition.

Component widths

A population-level comparison of fitted widths shows no systematic bias between the two tools. Matching 6838 component pairs across 4200 spectra (Hungarian algorithm, position tolerance $< 2\sigma$ ), the median log-width ratio $\ln(\sigma_{\text{PHSpectra}} / \sigma_{\text{GaussPy+}})$ is near zero.

Width comparison — **Figure 5.** Distribution of $\ln(\sigma_{\rm PHSpectra} / \sigma_{\rm GaussPy+})$ for 6838 matched component pairs across 4200 spectra. The distribution is sharply peaked at zero, confirming no systematic width bias between the two tools. Generated by `uv run benchmarks compare-plot`.

The histogram in Figure 5 is sharply peaked at zero, confirming that neither tool systematically favours wider or narrower profiles. While individual spectra can show large width differences (the disagreement panel in Figure 3 includes such cases), these are isolated instances driven by different decomposition strategies, not a systematic effect.

True accuracy on synthetic data​

Comparison with GaussPy+​

Fit quality (RMS)​

Where decompositions differ​

Component count vs RMS​

Component widths​