ML Systems Engineer
MLOps

FraudShield

A learning project for understanding fraud detection systems—delayed labels, drift, and what happens when things break.

CI-Eval Drift Cost Latency Postmortem Model Card
Note: This is a personal project with synthetic data. The benchmarks are from my own load tests on a 4-core VM. Not production traffic. See methodology →

Interactive Demo

This is fake — it's a simulation showing what the API responses look like. Clone the repo if you want to run the real thing locally.

Score a Transaction

Simulated API

Drift Simulation

Monitoring

Simulate a distribution shift and see how the monitoring system responds.

Quick Start

git clone https://github.com/croesus245/fraudshield-mlops-system && cd fraudshield-mlops-system
make setup && make serve       # API on localhost:8000
make simulate-stream           # Generate + stream synthetic transactions
make eval                      # Full evaluation suite
make drift-sim                 # Trigger drift simulation

Why I Built This

Most fraud detection tutorials show you a classifier and call it done. But real fraud systems have problems that tutorials skip:

  • Labels arrive late: Chargebacks take 30-90 days. You can't just do train/test split.
  • Fraudsters adapt: Patterns change constantly. Your model from last month might be useless.
  • Class imbalance is extreme: 0.1-1% fraud rate. Accuracy is a useless metric.
  • You need to know when it's failing: Drift detection isn't optional.

What I learned: The model is the easy part. Label reconciliation, drift monitoring, and knowing when to retrain—that's the real work.

Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Stream Ingest│───▶│ Streaming    │───▶│ Scoring      │───▶│ Decision     │
│ (Kafka or    │    │ Aggregates   │    │ Service      │    │ Engine       │
│ mock gen)    │    │ + Cache      │    │ (FastAPI)    │    │ (Rules+Model)│
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
                           │                   │                   │
                           ▼                   │                   ▼
                    ┌──────────────┐           │            ┌──────────────┐
                    │ Online Cache │           │            │ Action       │
                    │ (Redis)      │           │            │ (Allow/Review│
                    └──────────────┘           │            │ /Block)      │
                                               │            └──────────────┘
                                               ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Prediction   │───▶│ Label        │───▶│ Drift        │
│ Log          │    │ Reconciler   │    │ Monitor      │
│ (audit trail)│    │ (30–90d lag) │    │              │
└──────────────┘    └──────────────┘    └──────────────┘
                                               │
                           ┌───────────────────┘
                           ▼
                    ┌──────────────┐
                    │ Retraining   │
                    │ Trigger      │
                    └──────────────┘
                        

Label Reconciliation Flow

1. Predict
2. Log (txn_id, score, decision)
3. Wait 30–90d
4. Label batch arrives
5. Join & evaluate

Components

  • Streaming aggregates: Velocity features per user/merchant/device (1m/10m/1h/24h windows)
  • Online feature cache: Redis for fast lookups; offline Parquet store for training
  • Entity embeddings: User/merchant vectors from transaction co-occurrence (SVD, 32-dim). Computed offline weekly, served via cache.
  • Scoring API: FastAPI + XGBoost → risk score + decision + reason codes
  • Decision engine: Model score + rule overrides (velocity limits, blocklists, amount thresholds)
  • Prediction log: Store inputs/score/decision/model_version for audits and reconciliation
  • Label reconciler: Joins confirmed labels back to prior predictions (30–90 day lag)
  • Drift monitor: PSI on predictions, KS on 20 key features, base-rate tracking, null-rate alerts
  • Retrain trigger: Automatic job when drift thresholds trip (+ optional human approval gate)

Benchmarks

How I tested this: Locust load testing against the FastAPI service running on my 4-core VM. Synthetic transaction payloads. This tells you the code works and roughly how fast—not how it would perform in a real production environment with real data.
Latency Target < 50ms p95 Achieved ~45ms in synthetic load test
Throughput ~1,200 TPS Locust load test, single instance
Label Delay 30–90 days Simulated in demo
Synthetic Base Rate 2.3% fraud Higher than prod for stable eval

Evaluation

How measured: Time-based train/test split (80/20). Test set labels reconciled after 45-day simulated delay. Synthetic fraud base rate: 2.3% (higher than production 0.1–1% for stable evaluation). At 0.1% base rate, small synthetic datasets produce unstable PR curves; the system supports lower base rates via scaling the generator.

Operational Metrics (What Fraud Teams Care About)

Default threshold: 0.70 (tuned to cap review queue ~3% while keeping recall >80%)

Metric Value Interpretation
PR-AUC 0.88 Primary metric for imbalanced classification
Recall @ 1% FPR 68% Catch rate when only 1% false alarms tolerated
Precision @ top 0.5% risk 71% Hit rate in highest-risk transactions
Precision @ threshold 67% Of flagged transactions, 2/3 are actual fraud
Recall @ threshold 83% Catch 83% of all fraud
Manual review queue ~2.8% of volume Threshold tuned to cap queue at ~3% while keeping recall >80%

Model Comparison

Model PR-AUC Notes
Baseline (XGBoost + manual features) 0.82 No embeddings
Final (XGBoost + entity embeddings) 0.88 Current model

ROC-AUC: 0.98+ (high due to synthetic separability—not a reliable metric for imbalanced fraud)

Slice Performance

Slice PR-AUC Status
High-value transactions (> $500) 0.85 Pass
New users (< 30 days) 0.79 Pass
International transactions 0.84 Pass
Card-not-present 0.86 Pass

CI-ready regression gate: All slices must achieve PR-AUC ≥ 0.75. No slice regression > 5%.

Example CI Gate Output
(Shown as example—run make eval locally to see real output)
Workflow: eval-gate.yml · Eval Gate: PASS · Worst slice: New users PR-AUC = 0.79 · Regressions: 0

Monitoring & Drift

Metric Method Alert Threshold Response
Prediction score drift PSI on score distribution > 0.1 Investigate feature drift
Feature drift KS test on 20 key features p < 0.01 on >3 features Check upstream data
Base rate drift 7-day rolling fraud rate > 20% deviation from baseline Trigger recalibration
Feature null-rate spike Null % per feature > 5% increase Check data pipeline
Schema change Contract validation Any schema mismatch Block pipeline, alert
Feature computation lag Streaming lag metrics > 5 min behind Scale workers
Label freshness Days since last label batch > 45 days Check label pipeline
Latency p95 Service metrics > 50ms Scale or optimize

See Drift Runbook for detailed response procedures.

Ownership: On-call owns drift/latency alerts. Data Eng owns schema/lag. ML Eng owns recalibration + model updates.

Cost & Latency

~15ms p50 Latency
~45ms p95 Latency
~1.2K TPS (single instance)
~$360 Monthly (estimated)
How measured:
  • Latency: uvicorn workers=4, batch=1, 1KB payload, 1000-request benchmark on local machine
  • TPS: Single c5.xlarge equivalent, no batching. Scales linearly with instances.
  • Cost: AWS pricing estimate: 2× c5.xlarge ($248) + Redis cache.t3.medium ($25) + storage/egress (~$90). Does not include Kafka (use managed or self-host).

Note: Production deployment would need load testing with realistic traffic patterns.

Failure Modes

Failure Mode Likelihood Impact Mitigation
Model scores all low-risk Medium High Minimum fraud rate alert
Latency spike Medium Medium Circuit breaker + rule fallback
Label pipeline breaks Low High Label freshness alert (>45 days)
Adversarial fraud pattern High High Drift detection + rapid retrain
Reason-code drift / feature leakage Low Medium Feature importance drift checks (SHAP optional) + top feature review
Upstream schema change Medium High Data contract validation blocks pipeline
Residual Risk: The system cannot detect zero-day fraud patterns until labeled examples arrive (30+ day lag). Mitigation: rule-based fallback for anomalous transactions + human review queue.

Visual Proof

Screenshots from actual benchmark runs and monitoring simulations.

Run make benchmark locally to generate your own. Screenshots coming soon.

Incident Simulation

What I Broke

Injected a synthetic fraud pattern shift mid-stream: new fraud type targeting a previously safe merchant category (simulated "travel" category attack).

What Happened

  • Drift monitor detected PSI spike (0.15 > 0.1 threshold)
  • Feature KS test flagged merchant_category distribution shift
  • Slice analysis showed "travel" category PR-AUC dropped from 0.84 → 0.61
  • Retraining trigger fired automatically
  • New model trained and validated against regression suite

Postmortem

Full postmortem documented with timeline, root cause, and prevention measures. View postmortem →

Assumptions & Limitations

  • Synthetic data: Demo uses generated transactions (no real PII). Patterns are realistic but not from production.
  • Simulated delay: 30–90 day label lag is simulated by holding back labels, not real chargebacks.
  • Local benchmarks: Latency/TPS measured on dev machine, not production infrastructure.
  • Mock streaming: Demo uses mock stream for portability. Kafka consumer/producer stubs included; diagram shows target architecture.

Data Contract (excerpt)

TransactionEvent:
  required:
    - transaction_id: string (uuid)
    - amount: float (>0)
    - user_id: string
    - merchant_id: string
    - timestamp: datetime (ISO8601)
  optional:
    - device_id: string | null
    - ip_address: string | null
  null_policy: "device_id nullable, others reject on null"

(Each API request is validated and becomes a TransactionEvent internally.)

Reason Codes (sample)

Top reason codes for a high-risk decision:

velocity_user_1h_high new_device_high_amount merchant_category_risk time_since_last_txn_low

velocity_user_1h_high — User initiated >N transactions in the last hour (threshold varies by user history)

Reason codes derived from SHAP values + rule triggers. Used for analyst review and model debugging.

Artifacts