Testing Guide¶
Overview¶
The fraud detection project uses pytest for testing, with special support for PySpark testing. Tests are required to maintain 100% code coverage.
Test Structure¶
packages/
├── fraud_detection/
│ └── tests/
│ ├── conftest.py # Shared fixtures (Spark session, schemas)
│ └── test_<module_name>.py # Tests for each module
└── infra/
└── tests/
└── test_<stack_name>.py # CDK stack tests
Running Tests¶
All test commands use uv run to execute within the project's virtual environment.
All Tests¶
With Coverage¶
# Terminal report (default)
uv run pytest --cov=packages --cov-report=term-missing
# HTML report
uv run pytest --cov=packages --cov-report=html
open htmlcov/index.html
Specific Package¶
Specific Test Class or Method¶
# Specific class
uv run pytest packages/fraud_detection/tests/test_outliers.py::TestOutlierDetector
# Specific method
uv run pytest packages/fraud_detection/tests/test_outliers.py::TestOutlierDetector::test_zscore_outliers_detected
Filter by Marker¶
Coverage Requirements¶
The project enforces 100% code coverage. Tests will fail if coverage drops below this threshold.
Configuration in pyproject.toml:
[tool.pytest.ini_options]
addopts = "-v --cov=packages --cov-report=term-missing --cov-fail-under=100"
[tool.coverage.run]
omit = [
"*/tests/*",
"*/cli.py",
"*/jobs/*",
"*/utils/*",
# ... other exclusions
]
[tool.coverage.report]
exclude_lines = [
"pragma: no cover",
"if TYPE_CHECKING:",
"if __name__ == .__main__.:",
"raise ValueError",
]
PySpark Testing¶
Session Fixture¶
A shared Spark session is created once per test session:
# conftest.py
@pytest.fixture(scope="session")
def spark() -> Generator[SparkSession, None, None]:
session = (
SparkSession.builder
.master("local[2]")
.appName("FraudDetectionTests")
.config("spark.sql.shuffle.partitions", "2")
.config("spark.ui.enabled", "false")
.getOrCreate()
)
yield session
session.stop()
Test Data Fixtures¶
Create reusable test data using fixtures:
@pytest.fixture
def sample_claims(request: pytest.FixtureRequest) -> DataFrame:
"""Create sample claims data for testing."""
spark_session: SparkSession = request.getfixturevalue("spark")
schema: StructType = request.getfixturevalue("claims_schema")
data = [
("CLM001", "PAT001", "PRV001", "99213", date(2024, 1, 15), Decimal("100.00"), "CA", "CA"),
("CLM002", "PAT002", "PRV001", "99214", date(2024, 1, 16), Decimal("150.00"), "CA", "CA"),
]
return spark_session.createDataFrame(data, schema)
DataFrame Assertions¶
def test_transform_preserves_rows(spark, sample_claims):
result = my_transform(sample_claims)
assert result.count() == sample_claims.count()
def test_output_schema(spark, sample_claims):
result = my_transform(sample_claims)
assert "new_column" in result.columns
def test_flag_logic(detector, spark, claims_schema):
# Test that specific values are flagged correctly
result = detector.detect_outliers(claims, "charge_amount", "is_outlier")
outliers = result.filter(result.is_outlier == True).collect()
assert len(outliers) == 1
assert outliers[0]["claim_id"] == "CLM_OUTLIER"
CDK Testing¶
Template Assertions¶
from aws_cdk.assertions import Template
def test_data_lake_stack(app, env):
stack = DataLakeStack(app, "Test", project_name="test", env=env)
template = Template.from_stack(stack)
# Assert resource properties
template.has_resource_properties("AWS::S3::Bucket", {
"BucketEncryption": {...}
})
# Assert resource count
template.resource_count_is("AWS::Glue::Crawler", 2)
Test Configuration¶
pyproject.toml¶
[tool.pytest.ini_options]
testpaths = ["packages/fraud_detection/tests", "packages/infra/tests"]
pythonpath = ["."]
addopts = "-v --cov=packages --cov-report=term-missing --cov-fail-under=100"
markers = [
"slow: marks tests as slow",
"integration: marks tests as integration tests",
]
conftest.py Best Practices¶
import pytest
# Session-scoped Spark (created once per test session)
@pytest.fixture(scope="session")
def spark():
...
# Function-scoped instances (created for each test)
@pytest.fixture
def detector(spark):
return FraudDetector(spark)
CI/CD Integration¶
Tests run automatically on every PR and push to main via GitHub Actions. The CI pipeline includes:
- Format Check - Black and isort
- Lint - Flake8, Pylint, Pydocstyle
- Type Check - Mypy
- Tests - Pytest with coverage enforcement
- CDK Synth - Validates infrastructure code
- Docs Build - Validates documentation builds