Insurance Fraud Detection¶
A PySpark-based fraud detection system for insurance claims, deployed on AWS using CDK.
Overview¶
This project demonstrates modern data engineering practices for detecting fraudulent insurance claims at scale:
- Batch processing of insurance claims using PySpark
- Rule-based anomaly detection for flagging unusual billing patterns
- Statistical outlier detection for identifying suspicious charges
- Duplicate claim detection using similarity matching
- AWS-native deployment using CDK infrastructure as code
Features¶
Rule-Based Detection¶
Identify fraud through configurable business rules:
- Provider billing pattern analysis
- Procedure code validation
- Geographic anomaly detection (state mismatches, impossible travel)
- Temporal pattern analysis (weekend billing, claim frequency)
- Round amount detection
Statistical Detection¶
Apply statistical methods to find anomalies:
- Z-score outlier detection
- Interquartile Range (IQR) analysis
- Benford's Law analysis for charge amounts
- Provider-level billing deviation analysis
Duplicate Detection¶
Find duplicate and near-duplicate claims:
- Exact match detection on key fields
- Fuzzy matching for near-duplicates
- Configurable similarity thresholds
- Time-window based detection
Quick Start¶
# Install all dependencies (Python + Node.js + pre-commit hooks)
make install
# Generate sample data
make sample-data
# Run fraud detection locally
make run-local
# Analyze results
make analyze
Architecture¶
flowchart TB
subgraph Input
A[S3 Raw Data<br/>Claims]
end
subgraph Processing
B[EMR Spark<br/>Processing]
end
subgraph Output
C[S3 Results<br/>Flagged]
end
subgraph Catalog
D[Glue Catalog<br/>Metadata]
end
subgraph Analytics
E[Athena<br/>Queries]
end
A --> B
B --> C
C --> D
D --> E
Project Structure¶
insurance-fraud/
├── packages/
│ ├── fraud_detection/ # PySpark fraud detection application
│ ├── infra/ # AWS CDK infrastructure
│ └── docs/ # This documentation site
├── pyproject.toml # Root project configuration
└── README.md
Data Sources¶
This project is designed to work with:
- CMS Medicare Provider Utilization and Payment Data
- Synthetic claims data for development and testing
License¶
MIT