AWS Infrastructure¶
Overview¶
The fraud detection system is deployed on AWS using CDK infrastructure as code.
Architecture Diagram¶
flowchart TB
Upload[File Upload] --> S3Data
subgraph EventDriven[Event-Driven Pipeline]
S3Data[(S3 Data<br/>Bucket)] --> EventBridge[EventBridge]
EventBridge --> StepFunctions[Step Functions]
end
subgraph VPC[VPC]
subgraph Private[Private Subnets]
subgraph EMR[EMR Cluster]
Master[Master<br/>m5.xlarge]
Core1[Core 1<br/>m5.xlarge]
Core2[Core 2<br/>m5.xlarge]
end
end
subgraph Public[Public Subnets]
NAT[NAT Gateway]
end
end
subgraph Storage[S3 Buckets]
S3Results[(S3 Results<br/>Bucket)]
S3Scripts[(S3 Scripts<br/>Bucket)]
end
subgraph Catalog[Data Catalog]
Glue[Glue Data Catalog]
end
subgraph Analytics[Analytics]
Athena[Athena]
end
StepFunctions --> EMR
Private --> NAT
NAT --> Internet((Internet))
S3Data --> Glue
S3Results --> Glue
S3Scripts --> EMR
EMR --> S3Results
Glue --> Athena
CDK Stacks¶
Data Lake Stack¶
Resources:
- S3 bucket for raw claims data (EventBridge notifications enabled)
- S3 bucket for processed results
- Glue database for metadata
- Glue crawlers for schema discovery
Key Configuration:
# Data bucket with lifecycle rules
self.data_bucket = s3.Bucket(
encryption=s3.BucketEncryption.S3_MANAGED,
versioned=True,
lifecycle_rules=[
s3.LifecycleRule(
transitions=[
s3.Transition(
storage_class=s3.StorageClass.INFREQUENT_ACCESS,
transition_after=cdk.Duration.days(90),
)
],
),
],
)
Processing Stack¶
Resources:
- VPC with public/private subnets
- EMR security groups
- IAM roles for EMR
- Step Functions state machine
- EventBridge rule (S3 trigger)
- Scripts bucket for Spark jobs
EMR Configuration:
instances=tasks.EmrCreateCluster.InstancesConfigProperty(
instance_fleets=[
tasks.EmrCreateCluster.InstanceFleetConfigProperty(
instance_fleet_type=tasks.EmrCreateCluster.InstanceRoleType.MASTER,
target_on_demand_capacity=1,
instance_type_configs=[
tasks.EmrCreateCluster.InstanceTypeConfigProperty(
instance_type="m5.xlarge",
)
],
),
tasks.EmrCreateCluster.InstanceFleetConfigProperty(
instance_fleet_type=tasks.EmrCreateCluster.InstanceRoleType.CORE,
target_on_demand_capacity=2,
instance_type_configs=[
tasks.EmrCreateCluster.InstanceTypeConfigProperty(
instance_type="m5.xlarge",
)
],
),
],
),
Analytics Stack¶
Resources:
- Athena workgroup
- Glue tables for claims and results
- Named queries for common analysis
- Query results bucket
Step Functions Pipeline¶
The pipeline is triggered automatically when files are uploaded to the claims/ prefix in the data bucket.
stateDiagram-v2
[*] --> TransformInput: S3 Event
TransformInput --> CreateCluster
CreateCluster --> RunFraudDetection
RunFraudDetection --> TerminateCluster
TerminateCluster --> [*]
RunFraudDetection --> TerminateCluster: on failure
Pipeline States:
- TransformInput - Extracts S3 bucket and key from EventBridge event
- CreateCluster - Spins up EMR cluster with Python 3.13 bootstrap
- RunFraudDetection - Submits Spark job with dynamic input path
- TerminateCluster - Cleans up EMR resources
EventBridge Rule:
events.Rule(
event_pattern=events.EventPattern(
source=["aws.s3"],
detail_type=["Object Created"],
detail={
"bucket": {"name": events.Match.exact_string(data_bucket.bucket_name)},
"object": {"key": events.Match.prefix("claims/")},
},
),
)
Deployment¶
Prerequisites¶
- AWS CLI configured
- CDK CLI installed
- Bootstrap CDK (first time)
Deploy All Stacks¶
Deploy Individual Stacks¶
# From packages/infra
cd packages/infra
yarn cdk deploy fraud-detection-data-lake
yarn cdk deploy fraud-detection-processing
yarn cdk deploy fraud-detection-analytics
Destroy Stacks¶
Troubleshooting Stack Deletion¶
Stack deletion may fail due to:
- Athena workgroup not empty - Contains saved queries or query history
- EMR security groups with circular dependencies - Security groups reference each other
- Running EMR clusters or Step Functions executions
If yarn destroy fails, run the cleanup script first:
# Run cleanup script to remove problematic resources
./packages/infra/scripts/cleanup.sh
# Then retry destroy
cd packages/infra && yarn cdk destroy --all
The cleanup script will:
- Delete the Athena workgroup with
--recursive-delete-option - Revoke circular security group rules and delete EMR security groups
- Terminate any running EMR clusters
- Stop any running Step Functions executions
Manual Cleanup (if script fails)¶
If the cleanup script doesn't resolve the issue:
# 1. Delete Athena workgroup
aws athena delete-work-group \
--work-group fraud-detection-workgroup \
--recursive-delete-option
# 2. Find and delete EMR security groups
# First, list them
aws ec2 describe-security-groups \
--filters "Name=group-name,Values=*fraud-detection*EMR*" \
--query "SecurityGroups[].{ID:GroupId,Name:GroupName}"
# Revoke all ingress rules (replace SG_ID with actual IDs)
aws ec2 revoke-security-group-ingress \
--group-id SG_ID \
--protocol all \
--source-group OTHER_SG_ID
# Delete security groups
aws ec2 delete-security-group --group-id SG_ID
# 3. Delete failed CloudFormation stacks from console
# Go to CloudFormation → Select stack → Delete → Retain failing resources
Cost Optimization¶
EMR Spot Instances¶
Use Spot instances for cost savings:
instance_fleets=[
tasks.EmrCreateCluster.InstanceFleetConfigProperty(
instance_fleet_type=tasks.EmrCreateCluster.InstanceRoleType.CORE,
target_spot_capacity=2,
target_on_demand_capacity=0, # All Spot
instance_type_configs=[
tasks.EmrCreateCluster.InstanceTypeConfigProperty(
instance_type="m5.xlarge",
bid_price_as_percentage_of_on_demand_price=50,
),
# Fallback instance types
tasks.EmrCreateCluster.InstanceTypeConfigProperty(
instance_type="m5a.xlarge",
bid_price_as_percentage_of_on_demand_price=50,
),
],
),
]
S3 Storage Classes¶
Automatic tiering configured:
- Hot data: S3 Standard
- After 90 days: S3 Infrequent Access
- Old versions: Expire after 30 days
EMR Managed Scaling¶
managed_scaling_policy=tasks.EmrCreateCluster.ManagedScalingPolicyProperty(
compute_limits=tasks.EmrCreateCluster.ComputeLimitsProperty(
unit_type=tasks.EmrCreateCluster.ComputeLimitsUnitType.INSTANCES,
minimum_capacity_units=2,
maximum_capacity_units=10,
),
)
Security¶
Encryption¶
- S3: Server-side encryption (SSE-S3)
- EMR: In-transit encryption enabled
- Athena: Query results encrypted
Network Security¶
- EMR in private subnets
- Security groups restrict traffic
- No public IPs on EMR instances
IAM Roles¶
- EMR service role: Managed policy
- EMR EC2 role: Limited S3 access
- Glue crawler role: Read-only S3
Monitoring¶
CloudWatch Dashboards¶
Key metrics:
- EMR cluster state
- S3 bucket sizes
- Athena query performance
- Step Functions executions
Alarms¶
cloudwatch.Alarm(
metric=state_machine.metric_failed(),
threshold=1,
evaluation_periods=1,
alarm_actions=[sns_topic],
)
Cost Alerts¶
budgets.CfnBudget(
budget=budgets.CfnBudget.BudgetDataProperty(
budget_type="COST",
time_unit="MONTHLY",
budget_limit=budgets.CfnBudget.SpendProperty(
amount=500,
unit="USD",
),
),
)
Outputs¶
After deployment, the following outputs are available:
| Output | Description |
|---|---|
| DataBucketName | Raw data bucket |
| ResultsBucketName | Results bucket |
| ScriptsBucketName | Spark scripts bucket |
| GlueDatabaseName | Glue catalog database |
| AthenaWorkgroupName | Athena workgroup |
| StateMachineArn | Step Functions ARN |