AWS Infrastructure¶

Overview¶

The fraud detection system is deployed on AWS using CDK infrastructure as code.

Architecture Diagram¶

flowchart TB
    Upload[File Upload] --> S3Data

    subgraph EventDriven[Event-Driven Pipeline]
        S3Data[(S3 Data<br/>Bucket)] --> EventBridge[EventBridge]
        EventBridge --> StepFunctions[Step Functions]
    end

    subgraph VPC[VPC]
        subgraph Private[Private Subnets]
            subgraph EMR[EMR Cluster]
                Master[Master<br/>m5.xlarge]
                Core1[Core 1<br/>m5.xlarge]
                Core2[Core 2<br/>m5.xlarge]
            end
        end
        subgraph Public[Public Subnets]
            NAT[NAT Gateway]
        end
    end

    subgraph Storage[S3 Buckets]
        S3Results[(S3 Results<br/>Bucket)]
        S3Scripts[(S3 Scripts<br/>Bucket)]
    end

    subgraph Catalog[Data Catalog]
        Glue[Glue Data Catalog]
    end

    subgraph Analytics[Analytics]
        Athena[Athena]
    end

    StepFunctions --> EMR
    Private --> NAT
    NAT --> Internet((Internet))

    S3Data --> Glue
    S3Results --> Glue
    S3Scripts --> EMR
    EMR --> S3Results
    Glue --> Athena

CDK Stacks¶

Data Lake Stack¶

Resources:

S3 bucket for raw claims data (EventBridge notifications enabled)
S3 bucket for processed results
Glue database for metadata
Glue crawlers for schema discovery

Key Configuration:

# Data bucket with lifecycle rules
self.data_bucket = s3.Bucket(
    encryption=s3.BucketEncryption.S3_MANAGED,
    versioned=True,
    lifecycle_rules=[
        s3.LifecycleRule(
            transitions=[
                s3.Transition(
                    storage_class=s3.StorageClass.INFREQUENT_ACCESS,
                    transition_after=cdk.Duration.days(90),
                )
            ],
        ),
    ],
)

Processing Stack¶

Resources:

VPC with public/private subnets
EMR security groups
IAM roles for EMR
Step Functions state machine
EventBridge rule (S3 trigger)
Scripts bucket for Spark jobs

EMR Configuration:

instances=tasks.EmrCreateCluster.InstancesConfigProperty(
    instance_fleets=[
        tasks.EmrCreateCluster.InstanceFleetConfigProperty(
            instance_fleet_type=tasks.EmrCreateCluster.InstanceRoleType.MASTER,
            target_on_demand_capacity=1,
            instance_type_configs=[
                tasks.EmrCreateCluster.InstanceTypeConfigProperty(
                    instance_type="m5.xlarge",
                )
            ],
        ),
        tasks.EmrCreateCluster.InstanceFleetConfigProperty(
            instance_fleet_type=tasks.EmrCreateCluster.InstanceRoleType.CORE,
            target_on_demand_capacity=2,
            instance_type_configs=[
                tasks.EmrCreateCluster.InstanceTypeConfigProperty(
                    instance_type="m5.xlarge",
                )
            ],
        ),
    ],
),

Analytics Stack¶

Resources:

Athena workgroup
Glue tables for claims and results
Named queries for common analysis
Query results bucket

Step Functions Pipeline¶

The pipeline is triggered automatically when files are uploaded to the claims/ prefix in the data bucket.

stateDiagram-v2
    [*] --> TransformInput: S3 Event
    TransformInput --> CreateCluster
    CreateCluster --> RunFraudDetection
    RunFraudDetection --> TerminateCluster
    TerminateCluster --> [*]

    RunFraudDetection --> TerminateCluster: on failure

Pipeline States:

TransformInput - Extracts S3 bucket and key from EventBridge event
CreateCluster - Spins up EMR cluster with Python 3.13 bootstrap
RunFraudDetection - Submits Spark job with dynamic input path
TerminateCluster - Cleans up EMR resources

EventBridge Rule:

events.Rule(
    event_pattern=events.EventPattern(
        source=["aws.s3"],
        detail_type=["Object Created"],
        detail={
            "bucket": {"name": events.Match.exact_string(data_bucket.bucket_name)},
            "object": {"key": events.Match.prefix("claims/")},
        },
    ),
)

Deployment¶

Prerequisites¶

AWS CLI configured
CDK CLI installed
Bootstrap CDK (first time)

cdk bootstrap aws://ACCOUNT_ID/REGION

Deploy All Stacks¶

# From project root
make cdk-deploy

# Or from packages/infra
cd packages/infra && yarn deploy

Deploy Individual Stacks¶

# From packages/infra
cd packages/infra
yarn cdk deploy fraud-detection-data-lake
yarn cdk deploy fraud-detection-processing
yarn cdk deploy fraud-detection-analytics

Destroy Stacks¶

# From project root
make cdk-destroy

# Or from packages/infra
cd packages/infra && yarn destroy

Troubleshooting Stack Deletion¶

Stack deletion may fail due to:

Athena workgroup not empty - Contains saved queries or query history
EMR security groups with circular dependencies - Security groups reference each other
Running EMR clusters or Step Functions executions

If yarn destroy fails, run the cleanup script first:

# Run cleanup script to remove problematic resources
./packages/infra/scripts/cleanup.sh

# Then retry destroy
cd packages/infra && yarn cdk destroy --all

The cleanup script will:

Delete the Athena workgroup with --recursive-delete-option
Revoke circular security group rules and delete EMR security groups
Terminate any running EMR clusters
Stop any running Step Functions executions

Manual Cleanup (if script fails)¶

If the cleanup script doesn't resolve the issue:

# 1. Delete Athena workgroup
aws athena delete-work-group \
    --work-group fraud-detection-workgroup \
    --recursive-delete-option

# 2. Find and delete EMR security groups
# First, list them
aws ec2 describe-security-groups \
    --filters "Name=group-name,Values=*fraud-detection*EMR*" \
    --query "SecurityGroups[].{ID:GroupId,Name:GroupName}"

# Revoke all ingress rules (replace SG_ID with actual IDs)
aws ec2 revoke-security-group-ingress \
    --group-id SG_ID \
    --protocol all \
    --source-group OTHER_SG_ID

# Delete security groups
aws ec2 delete-security-group --group-id SG_ID

# 3. Delete failed CloudFormation stacks from console
# Go to CloudFormation → Select stack → Delete → Retain failing resources

Cost Optimization¶

EMR Spot Instances¶

Use Spot instances for cost savings:

instance_fleets=[
    tasks.EmrCreateCluster.InstanceFleetConfigProperty(
        instance_fleet_type=tasks.EmrCreateCluster.InstanceRoleType.CORE,
        target_spot_capacity=2,
        target_on_demand_capacity=0,  # All Spot
        instance_type_configs=[
            tasks.EmrCreateCluster.InstanceTypeConfigProperty(
                instance_type="m5.xlarge",
                bid_price_as_percentage_of_on_demand_price=50,
            ),
            # Fallback instance types
            tasks.EmrCreateCluster.InstanceTypeConfigProperty(
                instance_type="m5a.xlarge",
                bid_price_as_percentage_of_on_demand_price=50,
            ),
        ],
    ),
]

S3 Storage Classes¶

Automatic tiering configured:

Hot data: S3 Standard
After 90 days: S3 Infrequent Access
Old versions: Expire after 30 days

EMR Managed Scaling¶

managed_scaling_policy=tasks.EmrCreateCluster.ManagedScalingPolicyProperty(
    compute_limits=tasks.EmrCreateCluster.ComputeLimitsProperty(
        unit_type=tasks.EmrCreateCluster.ComputeLimitsUnitType.INSTANCES,
        minimum_capacity_units=2,
        maximum_capacity_units=10,
    ),
)

Security¶

Encryption¶

S3: Server-side encryption (SSE-S3)
EMR: In-transit encryption enabled
Athena: Query results encrypted

Network Security¶

EMR in private subnets
Security groups restrict traffic
No public IPs on EMR instances

IAM Roles¶

EMR service role: Managed policy
EMR EC2 role: Limited S3 access
Glue crawler role: Read-only S3

Monitoring¶

CloudWatch Dashboards¶

Key metrics:

EMR cluster state
S3 bucket sizes
Athena query performance
Step Functions executions

Alarms¶

cloudwatch.Alarm(
    metric=state_machine.metric_failed(),
    threshold=1,
    evaluation_periods=1,
    alarm_actions=[sns_topic],
)

Cost Alerts¶

budgets.CfnBudget(
    budget=budgets.CfnBudget.BudgetDataProperty(
        budget_type="COST",
        time_unit="MONTHLY",
        budget_limit=budgets.CfnBudget.SpendProperty(
            amount=500,
            unit="USD",
        ),
    ),
)

Outputs¶

After deployment, the following outputs are available:

Output	Description
DataBucketName	Raw data bucket
ResultsBucketName	Results bucket
ScriptsBucketName	Spark scripts bucket
GlueDatabaseName	Glue catalog database
AthenaWorkgroupName	Athena workgroup
StateMachineArn	Step Functions ARN