Rules API¶
Billing Pattern Rules¶
fraud_detection.rules.billing_patterns
¶
Billing pattern analysis for insurance fraud detection.
This module implements rule-based detection of suspicious billing patterns that commonly indicate fraudulent activity. It analyzes temporal patterns, claim frequencies, and amount characteristics to identify anomalies such as:
- Providers billing an implausible number of procedures per day
- Patients receiving an unusual number of services in a short period
- Suspicious weekend billing for non-emergency services
- Claims with suspiciously round dollar amounts
- Procedure unbundling (billing separately for bundled services)
Classes¶
BillingPatternRules
¶
Detect suspicious billing patterns indicative of fraud.
This class implements a collection of rule-based checks that identify billing anomalies commonly associated with fraudulent claims. Each method adds one or more flag columns to the input DataFrame indicating whether the claim exhibits the suspicious pattern.
Parameters¶
spark : SparkSession Active Spark session for distributed processing. config : DetectionConfig Configuration object containing detection thresholds:
- ``max_daily_procedures_per_provider``: Maximum procedures a provider
can reasonably bill in one day.
- ``max_claims_per_patient_per_day``: Maximum claims expected for a
single patient per day.
Examples¶
rules = BillingPatternRules(spark, config) claims = rules.check_daily_procedure_limits(claims) claims = rules.check_round_amounts(claims) suspicious = claims.filter(claims.daily_procedure_limit_exceeded)
Source code in packages/fraud_detection/src/fraud_detection/rules/billing_patterns.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | |
Functions¶
check_daily_procedure_limits(claims)
¶
Flag providers exceeding daily procedure limits.
Identifies providers billing an implausibly high number of procedures on a single day, which may indicate phantom billing (billing for services not rendered) or upcoding schemes.
A provider billing 100+ procedures per day is physically impossible for most service types and warrants investigation.
Parameters¶
claims : DataFrame
Input claims with provider_id and service_date columns.
Returns¶
DataFrame Claims with added columns:
- ``daily_procedure_count`` : int - Total procedures by this provider on this date.
- ``daily_procedure_limit_exceeded`` : bool - True if count exceeds configured limit.
Source code in packages/fraud_detection/src/fraud_detection/rules/billing_patterns.py
check_patient_claim_frequency(claims)
¶
Flag patients with abnormally high daily claim frequency.
Identifies patients receiving an unusual number of services on the same day, which may indicate claim splitting (dividing one service into multiple claims for higher reimbursement) or duplicate billing.
Parameters¶
claims : DataFrame
Input claims with patient_id and service_date columns.
Returns¶
DataFrame Claims with added columns:
- ``patient_daily_claims`` : int - Number of claims for this patient on this date.
- ``patient_frequency_exceeded`` : bool - True if count exceeds configured limit.
Source code in packages/fraud_detection/src/fraud_detection/rules/billing_patterns.py
check_procedure_unbundling(claims, bundled_procedures)
¶
Detect procedure unbundling fraud.
Unbundling occurs when a provider bills separately for procedures that should be billed together as a single comprehensive service at a lower combined rate. This is a common fraud scheme to increase reimbursement.
For example, a complete blood panel should be billed as one procedure, not as individual tests for each blood component.
Parameters¶
claims : DataFrame
Input claims with patient_id, provider_id, service_date,
and procedure_code columns.
bundled_procedures : DataFrame
Reference table defining procedure bundles with columns:
- ``bundled_code`` : str - The correct bundled procedure code.
- ``unbundled_code_1`` : str - First component code when unbundled.
- ``unbundled_code_2`` : str - Second component code when unbundled.
Returns¶
DataFrame Claims with added columns:
- ``procedures_same_day`` : array<str> - All procedure codes for this
patient/provider/date combination.
- ``unbundling_flag`` : bool - True if unbundled procedure pair detected.
Source code in packages/fraud_detection/src/fraud_detection/rules/billing_patterns.py
check_round_amounts(claims)
¶
Flag claims with suspiciously round charge amounts.
Legitimate medical charges typically result in non-round amounts due to fee schedules, adjustments, and itemized billing. A high proportion of perfectly round amounts (e.g., $100, $500, $1000) may indicate estimated or fabricated charges rather than actual services rendered.
Providers are flagged if they have round-hundred charges AND more than 20% of their claims have round amounts.
Parameters¶
claims : DataFrame
Input claims with provider_id and charge_amount columns.
Returns¶
DataFrame Claims with added columns:
- ``is_round_hundred`` : bool - True if amount is divisible by 100.
- ``is_round_fifty`` : bool - True if amount is divisible by 50.
- ``provider_round_ratio`` : float - Proportion of provider's claims with round amounts.
- ``round_amount_flag`` : bool - True if round amount from high-round-ratio provider.
Source code in packages/fraud_detection/src/fraud_detection/rules/billing_patterns.py
check_weekend_billing(claims)
¶
Flag suspicious weekend billing patterns.
Identifies providers with unusually high weekend billing volumes. Most medical practices are closed on weekends, so high weekend billing for routine (non-emergency) procedures may indicate fraudulent backdating of claims or fabricated services.
A provider is flagged if they have weekend claims AND their overall weekend billing ratio exceeds 30% of total claims.
Parameters¶
claims : DataFrame
Input claims with provider_id and service_date columns.
Returns¶
DataFrame Claims with added columns:
- ``day_of_week`` : int - Day of week (1=Sunday, 7=Saturday).
- ``is_weekend`` : bool - True if service date falls on weekend.
- ``provider_weekend_ratio`` : float - Proportion of provider's claims on weekends.
- ``weekend_billing_flag`` : bool - True if weekend claim from high-weekend provider.
Source code in packages/fraud_detection/src/fraud_detection/rules/billing_patterns.py
Geographic Rules¶
fraud_detection.rules.geographic
¶
Geographic anomaly detection for insurance fraud analysis.
This module identifies suspicious geographic patterns in claims data that may indicate fraudulent activity. Geographic analysis is particularly effective at detecting:
- Identity theft: Claims submitted using stolen patient information, where the victim lives far from the billing provider.
- Phantom billing: Services billed for patients who could not have reasonably traveled to the provider location.
- Fraud rings: Organized schemes where patients are recruited from specific geographic areas to submit false claims.
- Impossible travel: Patients appearing at multiple distant locations on the same day, indicating identity misuse.
Classes¶
GeographicRules
¶
Detect geographic anomalies indicative of insurance fraud.
Analyzes spatial relationships between patients and providers to identify claims that are geographically implausible. Supports both precise coordinate-based distance calculations and state-level heuristics when coordinates are unavailable.
Parameters¶
spark : SparkSession Active Spark session for distributed processing. config : DetectionConfig Configuration object containing:
- ``max_provider_patient_distance_miles``: Maximum reasonable distance
between patient and provider before flagging.
Examples¶
geo_rules = GeographicRules(spark, config) claims = geo_rules.check_state_mismatch(claims) claims = geo_rules.check_provider_patient_distance(claims) out_of_area = claims.filter(claims.distance_exceeded)
Source code in packages/fraud_detection/src/fraud_detection/rules/geographic.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | |
Functions¶
check_geographic_clustering(claims)
¶
Detect suspicious geographic concentration of a provider's patients.
Legitimate providers typically serve patients from a natural geographic distribution reflecting their location and specialty. A provider with a high volume of patients concentrated in a single distant state may indicate an organized fraud ring recruiting patients from a specific area.
Flags providers with 100+ patients where all patients come from a single state different from the provider's state.
Parameters¶
claims : DataFrame
Input claims with provider_id, patient_id, patient_state,
and provider_state columns.
Returns¶
DataFrame Claims with added columns:
- ``unique_patient_states`` : int - Number of distinct states patients come from.
- ``total_patients`` : int - Total patient count for this provider.
- ``geographic_clustering_flag`` : bool - True if suspicious clustering detected.
Source code in packages/fraud_detection/src/fraud_detection/rules/geographic.py
check_impossible_travel(claims)
¶
Detect physically impossible travel patterns for patients.
Identifies patients who have claims from multiple distant provider locations on the same day that would require impossible travel speeds. This is a strong indicator of identity theft (multiple people using the same patient identity) or systematic billing fraud.
Parameters¶
claims : DataFrame
Input claims with patient_id, service_date, and optionally
provider_lat, provider_lon for precise detection.
Returns¶
DataFrame Claims with added columns:
- ``provider_locations`` : array<struct> - All provider coordinates visited same day.
- ``num_providers_same_day`` : int - Count of distinct provider locations.
- ``impossible_travel_flag`` : bool - True if patient visited >3 providers same day.
Notes¶
Current implementation uses a simple threshold of >3 provider locations per day as a heuristic. A more sophisticated approach would calculate pairwise distances and required travel speeds, but this requires a UDF for efficient computation.
Source code in packages/fraud_detection/src/fraud_detection/rules/geographic.py
check_provider_patient_distance(claims)
¶
Flag claims where patient and provider locations are unusually distant.
Patients typically receive care from providers within a reasonable travel distance. Claims involving distant providers may indicate identity theft (someone using a stolen identity far from the victim's home) or phantom billing (billing for services never rendered).
If latitude/longitude coordinates are available, calculates precise haversine (great-circle) distance. Otherwise, falls back to state-level mismatch detection.
Parameters¶
claims : DataFrame
Input claims. For distance calculation, requires columns:
patient_lat, patient_lon, provider_lat, provider_lon.
Returns¶
DataFrame Claims with added columns:
- ``distance_miles`` : float - Calculated distance (if coordinates available).
- ``distance_exceeded`` : bool - True if distance exceeds configured maximum.
Source code in packages/fraud_detection/src/fraud_detection/rules/geographic.py
check_state_mismatch(claims)
¶
Flag claims where patient and provider are in different states.
While cross-state healthcare is legitimate in border areas or for specialized care, out-of-state claims are statistically more likely to be fraudulent and warrant additional scrutiny, especially when combined with other risk indicators.
Parameters¶
claims : DataFrame
Input claims with patient_state and provider_state columns.
Returns¶
DataFrame Claims with added column:
- ``state_mismatch`` : bool - True if patient and provider states differ.
Source code in packages/fraud_detection/src/fraud_detection/rules/geographic.py
Duplicate Detection¶
fraud_detection.rules.duplicates
¶
Duplicate claim detection for insurance fraud analysis.
This module identifies both exact and near-duplicate claims, which are common indicators of fraudulent billing practices such as double-billing or claim resubmission with minor modifications to avoid detection.
Classes¶
DuplicateDetector
¶
Detect duplicate and near-duplicate insurance claims.
This detector identifies two types of duplicates:
-
Exact duplicates: Claims with identical key fields (patient, provider, procedure, date, and amount). These often indicate accidental or intentional double-billing.
-
Near-duplicates: Claims that are highly similar but not identical, potentially indicating resubmission with minor changes to evade detection. Uses configurable similarity thresholds and time windows.
Parameters¶
spark : SparkSession
Active Spark session for distributed processing.
config : DetectionConfig
Configuration object containing detection thresholds:
- duplicate_similarity_threshold: Minimum similarity score (0-1) for
near-duplicate detection.
- duplicate_time_window_days: Maximum days between service dates for
claims to be considered potential near-duplicates.
Examples¶
detector = DuplicateDetector(spark, config) flagged_claims = detector.detect(claims_df) duplicates = flagged_claims.filter(flagged_claims.is_duplicate == True)
Source code in packages/fraud_detection/src/fraud_detection/rules/duplicates.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
Functions¶
detect(claims)
¶
Run full duplicate detection pipeline on claims data.
Sequentially applies exact and near-duplicate detection, then combines results into unified duplicate flags. A claim is marked as duplicate if it matches either criterion.
Parameters¶
claims : DataFrame
Input claims with required columns: claim_id, patient_id,
provider_id, procedure_code, service_date, charge_amount.
Returns¶
DataFrame Original claims with additional columns:
- ``is_duplicate`` : bool - True if claim is any type of duplicate.
- ``duplicate_of`` : str - claim_id of the original claim this duplicates.
- ``is_exact_duplicate`` : bool - True if exact field match.
- ``is_near_duplicate`` : bool - True if similarity-based match.