Statistical Detection¶
Statistical detection methods identify anomalies without explicit rules by analyzing data distributions.
Outlier Detection¶
Z-Score Method¶
Concept: Measures how many standard deviations a value is from the mean.
Formula:
Flagged when: \(|z| > \mathrm{threshold}\) (default: 3.0)
Usage:
from fraud_detection.statistics.outliers import OutlierDetector
detector = OutlierDetector(spark, config)
claims = detector.detect_zscore_outliers(
claims,
column="charge_amount",
output_column="charge_outlier",
group_by=["procedure_code"] # Optional grouping
)
Advantages:
- Simple and interpretable
- Works well for normally distributed data
Limitations:
- Sensitive to extreme outliers (which affect mean/std)
- Assumes normal distribution
IQR Method¶
Concept: Uses quartiles to define "normal" range, robust to extreme values.
Formula:
Flagged when: Value outside bounds (k default: 1.5)
Usage:
claims = detector.detect_iqr_outliers(
claims,
column="charge_amount",
output_column="charge_iqr_outlier",
)
Advantages:
- Robust to extreme outliers
- No distribution assumption
Limitations:
- Less sensitive than Z-score
- May miss moderate anomalies
Benford's Law Analysis¶
Concept¶
Benford's Law states that in many naturally occurring datasets, the leading digit follows a specific distribution:
| Digit | Expected Frequency |
|---|---|
| 1 | 30.1% |
| 2 | 17.6% |
| 3 | 12.5% |
| 4 | 9.7% |
| 5 | 7.9% |
| 6 | 6.7% |
| 7 | 5.8% |
| 8 | 5.1% |
| 9 | 4.6% |
Fraudulent data often violates this distribution because humans tend to:
- Prefer "random-looking" distributions
- Avoid 1s as leading digits
- Favor round numbers
Usage¶
from fraud_detection.statistics.benfords import BenfordsLawAnalyzer
analyzer = BenfordsLawAnalyzer(spark)
# Analyze by provider
claims = analyzer.analyze(
claims,
column="charge_amount",
group_by="provider_id",
threshold=0.15 # Max deviation from expected
)
# Generate detailed report
report = analyzer.get_distribution_report(
claims,
column="charge_amount",
group_by="provider_id"
)
report.show()
Output:
+----------+-----------+--------+-------------------+-------------------+----------+
|first_digit| count| total|observed_frequency|expected_frequency| deviation|
+----------+-----------+--------+-------------------+-------------------+----------+
| 1| 3012| 10000| 0.3012| 0.3010| 0.0002|
| 2| 1821| 10000| 0.1821| 0.1760| 0.0061|
...
When Benford's Law Applies¶
✅ Applies to:
- Financial transactions
- Population data
- Physical measurements
- Geographic data
❌ Does not apply to:
- Assigned numbers (SSN, phone numbers)
- Constrained ranges (percentages, test scores)
- Small datasets
Provider Billing Analysis¶
Concept¶
Compare each provider's billing patterns against market norms.
claims = detector.detect_provider_outliers(
claims,
charge_column="charge_amount",
provider_column="provider_id",
procedure_column="procedure_code"
)
Metrics calculated:
- Provider's average charge per procedure
- Market average per procedure
- Deviation ratio (provider / market)
Flagged when: Ratio > 2.0 or < 0.5
Temporal Analysis¶
Detect sudden changes in billing patterns over time:
claims = detector.detect_temporal_outliers(
claims,
charge_column="charge_amount",
date_column="service_date"
)
Logic: Compare weekly average against 4-week rolling average.
Flagged when: Current > 3* rolling average
Combining Statistical Methods¶
Best practice is to combine multiple methods:
# Apply multiple outlier methods
claims = detector.detect_zscore_outliers(claims, "charge_amount", "zscore_flag")
claims = detector.detect_iqr_outliers(claims, "charge_amount", "iqr_flag")
claims = analyzer.analyze(claims, "charge_amount")
# Flag if multiple methods agree
claims = claims.withColumn(
"strong_statistical_flag",
(F.col("zscore_flag") & F.col("iqr_flag")) |
(F.col("benfords_anomaly") & (F.col("zscore_flag") | F.col("iqr_flag")))
)
Tuning Statistical Thresholds¶
| Scenario | Z-score | IQR k | Benford Threshold |
|---|---|---|---|
| High precision | 4.0 | 2.0 | 0.20 |
| Balanced | 3.0 | 1.5 | 0.15 |
| High recall | 2.0 | 1.0 | 0.10 |
Limitations¶
- Requires sufficient data: Statistical methods need volume to be meaningful
- Assumes patterns: Legitimate outliers will be flagged
- Context-blind: Doesn't understand business reasons for anomalies
- Gaming risk: Sophisticated fraudsters can evade statistical detection
Always combine statistical methods with rule-based and duplicate detection for best results.