FraudShield
A learning project for understanding fraud detection systems—delayed labels, drift, and what happens when things break.
Interactive Demo
This is fake — it's a simulation showing what the API responses look like. Clone the repo if you want to run the real thing locally.
Score a Transaction
Drift Simulation
Simulate a distribution shift and see how the monitoring system responds.
Quick Start
git clone https://github.com/croesus245/fraudshield-mlops-system && cd fraudshield-mlops-system
make setup && make serve # API on localhost:8000
make simulate-stream # Generate + stream synthetic transactions
make eval # Full evaluation suite
make drift-sim # Trigger drift simulation
Why I Built This
Most fraud detection tutorials show you a classifier and call it done. But real fraud systems have problems that tutorials skip:
- Labels arrive late: Chargebacks take 30-90 days. You can't just do train/test split.
- Fraudsters adapt: Patterns change constantly. Your model from last month might be useless.
- Class imbalance is extreme: 0.1-1% fraud rate. Accuracy is a useless metric.
- You need to know when it's failing: Drift detection isn't optional.
What I learned: The model is the easy part. Label reconciliation, drift monitoring, and knowing when to retrain—that's the real work.
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Stream Ingest│───▶│ Streaming │───▶│ Scoring │───▶│ Decision │
│ (Kafka or │ │ Aggregates │ │ Service │ │ Engine │
│ mock gen) │ │ + Cache │ │ (FastAPI) │ │ (Rules+Model)│
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ │ ▼
┌──────────────┐ │ ┌──────────────┐
│ Online Cache │ │ │ Action │
│ (Redis) │ │ │ (Allow/Review│
└──────────────┘ │ │ /Block) │
│ └──────────────┘
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Prediction │───▶│ Label │───▶│ Drift │
│ Log │ │ Reconciler │ │ Monitor │
│ (audit trail)│ │ (30–90d lag) │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│
┌───────────────────┘
▼
┌──────────────┐
│ Retraining │
│ Trigger │
└──────────────┘
Label Reconciliation Flow
Components
- Streaming aggregates: Velocity features per user/merchant/device (1m/10m/1h/24h windows)
- Online feature cache: Redis for fast lookups; offline Parquet store for training
- Entity embeddings: User/merchant vectors from transaction co-occurrence (SVD, 32-dim). Computed offline weekly, served via cache.
- Scoring API: FastAPI + XGBoost → risk score + decision + reason codes
- Decision engine: Model score + rule overrides (velocity limits, blocklists, amount thresholds)
- Prediction log: Store inputs/score/decision/model_version for audits and reconciliation
- Label reconciler: Joins confirmed labels back to prior predictions (30–90 day lag)
- Drift monitor: PSI on predictions, KS on 20 key features, base-rate tracking, null-rate alerts
- Retrain trigger: Automatic job when drift thresholds trip (+ optional human approval gate)
Benchmarks
Evaluation
Operational Metrics (What Fraud Teams Care About)
Default threshold: 0.70 (tuned to cap review queue ~3% while keeping recall >80%)
| Metric | Value | Interpretation |
|---|---|---|
| PR-AUC | 0.88 | Primary metric for imbalanced classification |
| Recall @ 1% FPR | 68% | Catch rate when only 1% false alarms tolerated |
| Precision @ top 0.5% risk | 71% | Hit rate in highest-risk transactions |
| Precision @ threshold | 67% | Of flagged transactions, 2/3 are actual fraud |
| Recall @ threshold | 83% | Catch 83% of all fraud |
| Manual review queue | ~2.8% of volume | Threshold tuned to cap queue at ~3% while keeping recall >80% |
Model Comparison
| Model | PR-AUC | Notes |
|---|---|---|
| Baseline (XGBoost + manual features) | 0.82 | No embeddings |
| Final (XGBoost + entity embeddings) | 0.88 | Current model |
ROC-AUC: 0.98+ (high due to synthetic separability—not a reliable metric for imbalanced fraud)
Slice Performance
| Slice | PR-AUC | Status |
|---|---|---|
| High-value transactions (> $500) | 0.85 | Pass |
| New users (< 30 days) | 0.79 | Pass |
| International transactions | 0.84 | Pass |
| Card-not-present | 0.86 | Pass |
CI-ready regression gate: All slices must achieve PR-AUC ≥ 0.75. No slice regression > 5%.
(Shown as example—run
make eval locally to see real output)Workflow:
eval-gate.yml · Eval Gate: PASS · Worst slice: New users PR-AUC = 0.79 · Regressions: 0
Monitoring & Drift
| Metric | Method | Alert Threshold | Response |
|---|---|---|---|
| Prediction score drift | PSI on score distribution | > 0.1 | Investigate feature drift |
| Feature drift | KS test on 20 key features | p < 0.01 on >3 features | Check upstream data |
| Base rate drift | 7-day rolling fraud rate | > 20% deviation from baseline | Trigger recalibration |
| Feature null-rate spike | Null % per feature | > 5% increase | Check data pipeline |
| Schema change | Contract validation | Any schema mismatch | Block pipeline, alert |
| Feature computation lag | Streaming lag metrics | > 5 min behind | Scale workers |
| Label freshness | Days since last label batch | > 45 days | Check label pipeline |
| Latency p95 | Service metrics | > 50ms | Scale or optimize |
See Drift Runbook for detailed response procedures.
Ownership: On-call owns drift/latency alerts. Data Eng owns schema/lag. ML Eng owns recalibration + model updates.
Cost & Latency
- Latency: uvicorn workers=4, batch=1, 1KB payload, 1000-request benchmark on local machine
- TPS: Single c5.xlarge equivalent, no batching. Scales linearly with instances.
- Cost: AWS pricing estimate: 2× c5.xlarge ($248) + Redis cache.t3.medium ($25) + storage/egress (~$90). Does not include Kafka (use managed or self-host).
Note: Production deployment would need load testing with realistic traffic patterns.
Failure Modes
| Failure Mode | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Model scores all low-risk | Medium | High | Minimum fraud rate alert |
| Latency spike | Medium | Medium | Circuit breaker + rule fallback |
| Label pipeline breaks | Low | High | Label freshness alert (>45 days) |
| Adversarial fraud pattern | High | High | Drift detection + rapid retrain |
| Reason-code drift / feature leakage | Low | Medium | Feature importance drift checks (SHAP optional) + top feature review |
| Upstream schema change | Medium | High | Data contract validation blocks pipeline |
Visual Proof
Screenshots from actual benchmark runs and monitoring simulations.
make benchmark
make drift-sim
.github/workflows/
Run make benchmark locally to generate your own. Screenshots coming soon.
Incident Simulation
What I Broke
Injected a synthetic fraud pattern shift mid-stream: new fraud type targeting a previously safe merchant category (simulated "travel" category attack).
What Happened
- Drift monitor detected PSI spike (0.15 > 0.1 threshold)
- Feature KS test flagged
merchant_categorydistribution shift - Slice analysis showed "travel" category PR-AUC dropped from 0.84 → 0.61
- Retraining trigger fired automatically
- New model trained and validated against regression suite
Postmortem
Full postmortem documented with timeline, root cause, and prevention measures. View postmortem →
Assumptions & Limitations
- Synthetic data: Demo uses generated transactions (no real PII). Patterns are realistic but not from production.
- Simulated delay: 30–90 day label lag is simulated by holding back labels, not real chargebacks.
- Local benchmarks: Latency/TPS measured on dev machine, not production infrastructure.
- Mock streaming: Demo uses mock stream for portability. Kafka consumer/producer stubs included; diagram shows target architecture.
Data Contract (excerpt)
TransactionEvent:
required:
- transaction_id: string (uuid)
- amount: float (>0)
- user_id: string
- merchant_id: string
- timestamp: datetime (ISO8601)
optional:
- device_id: string | null
- ip_address: string | null
null_policy: "device_id nullable, others reject on null"
(Each API request is validated and becomes a TransactionEvent internally.)
Reason Codes (sample)
Top reason codes for a high-risk decision:
velocity_user_1h_high — User initiated >N transactions in the last hour (threshold varies by user history)
Reason codes derived from SHAP values + rule triggers. Used for analyst review and model debugging.