MLOps

FraudShield

A learning project for understanding fraud detection systems—delayed labels, drift, and what happens when things break.

CI-Eval Drift Cost Latency Postmortem Model Card

Note: This is a personal project with synthetic data. The benchmarks are from my own load tests on a 4-core VM. Not production traffic. See methodology →

View Repository Local Demo Eval Report Postmortem

Interactive Demo

This is fake — it's a simulation showing what the API responses look like. Clone the repo if you want to run the real thing locally.

Score a Transaction

Simulated API

Amount ($)

Channel

Merchant Category

User Velocity (txns/hr)

Account Age (days)

New Device?

API Response

— Risk Score

— Latency

— Risk Factors

Drift Simulation

Monitoring

Simulate a distribution shift and see how the monitoring system responds.

Drift Alert ALERT

Quick Start

git clone https://github.com/croesus245/fraudshield-mlops-system && cd fraudshield-mlops-system
make setup && make serve       # API on localhost:8000
make simulate-stream           # Generate + stream synthetic transactions
make eval                      # Full evaluation suite
make drift-sim                 # Trigger drift simulation

Why I Built This

Most fraud detection tutorials show you a classifier and call it done. But real fraud systems have problems that tutorials skip:

Labels arrive late: Chargebacks take 30-90 days. You can't just do train/test split.
Fraudsters adapt: Patterns change constantly. Your model from last month might be useless.
Class imbalance is extreme: 0.1-1% fraud rate. Accuracy is a useless metric.
You need to know when it's failing: Drift detection isn't optional.

What I learned: The model is the easy part. Label reconciliation, drift monitoring, and knowing when to retrain—that's the real work.

Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Stream Ingest│───▶│ Streaming    │───▶│ Scoring      │───▶│ Decision     │
│ (Kafka or    │    │ Aggregates   │    │ Service      │    │ Engine       │
│ mock gen)    │    │ + Cache      │    │ (FastAPI)    │    │ (Rules+Model)│
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
                           │                   │                   │
                           ▼                   │                   ▼
                    ┌──────────────┐           │            ┌──────────────┐
                    │ Online Cache │           │            │ Action       │
                    │ (Redis)      │           │            │ (Allow/Review│
                    └──────────────┘           │            │ /Block)      │
                                               │            └──────────────┘
                                               ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Prediction   │───▶│ Label        │───▶│ Drift        │
│ Log          │    │ Reconciler   │    │ Monitor      │
│ (audit trail)│    │ (30–90d lag) │    │              │
└──────────────┘    └──────────────┘    └──────────────┘
                                               │
                           ┌───────────────────┘
                           ▼
                    ┌──────────────┐
                    │ Retraining   │
                    │ Trigger      │
                    └──────────────┘

Label Reconciliation Flow

1. Predict

→

2. Log (txn_id, score, decision)

→

3. Wait 30–90d

→

4. Label batch arrives

→

5. Join & evaluate

Components

Streaming aggregates: Velocity features per user/merchant/device (1m/10m/1h/24h windows)
Online feature cache: Redis for fast lookups; offline Parquet store for training
Entity embeddings: User/merchant vectors from transaction co-occurrence (SVD, 32-dim). Computed offline weekly, served via cache.
Scoring API: FastAPI + XGBoost → risk score + decision + reason codes
Decision engine: Model score + rule overrides (velocity limits, blocklists, amount thresholds)
Prediction log: Store inputs/score/decision/model_version for audits and reconciliation
Label reconciler: Joins confirmed labels back to prior predictions (30–90 day lag)
Drift monitor: PSI on predictions, KS on 20 key features, base-rate tracking, null-rate alerts
Retrain trigger: Automatic job when drift thresholds trip (+ optional human approval gate)

Benchmarks

How I tested this: Locust load testing against the FastAPI service running on my 4-core VM. Synthetic transaction payloads. This tells you the code works and roughly how fast—not how it would perform in a real production environment with real data.

Latency Target < 50ms p95 Achieved ~45ms in synthetic load test

Throughput ~1,200 TPS Locust load test, single instance

Label Delay 30–90 days Simulated in demo

Synthetic Base Rate 2.3% fraud Higher than prod for stable eval

Evaluation

How measured: Time-based train/test split (80/20). Test set labels reconciled after 45-day simulated delay. Synthetic fraud base rate: 2.3% (higher than production 0.1–1% for stable evaluation). At 0.1% base rate, small synthetic datasets produce unstable PR curves; the system supports lower base rates via scaling the generator.

Operational Metrics (What Fraud Teams Care About)

Default threshold: 0.70 (tuned to cap review queue ~3% while keeping recall >80%)

Metric	Value	Interpretation
PR-AUC	0.88	Primary metric for imbalanced classification
Recall @ 1% FPR	68%	Catch rate when only 1% false alarms tolerated
Precision @ top 0.5% risk	71%	Hit rate in highest-risk transactions
Precision @ threshold	67%	Of flagged transactions, 2/3 are actual fraud
Recall @ threshold	83%	Catch 83% of all fraud
Manual review queue	~2.8% of volume	Threshold tuned to cap queue at ~3% while keeping recall >80%

Model Comparison

Model	PR-AUC	Notes
Baseline (XGBoost + manual features)	0.82	No embeddings
Final (XGBoost + entity embeddings)	0.88	Current model

ROC-AUC: 0.98+ (high due to synthetic separability—not a reliable metric for imbalanced fraud)

Slice Performance

Slice	PR-AUC	Status
High-value transactions (> $500)	0.85	Pass
New users (< 30 days)	0.79	Pass
International transactions	0.84	Pass
Card-not-present	0.86	Pass

CI-ready regression gate: All slices must achieve PR-AUC ≥ 0.75. No slice regression > 5%.

Example CI Gate Output
(Shown as example—run make eval locally to see real output)
Workflow: eval-gate.yml · Eval Gate: PASS · Worst slice: New users PR-AUC = 0.79 · Regressions: 0

Monitoring & Drift

Metric	Method	Alert Threshold	Response
Prediction score drift	PSI on score distribution	> 0.1	Investigate feature drift
Feature drift	KS test on 20 key features	p < 0.01 on >3 features	Check upstream data
Base rate drift	7-day rolling fraud rate	> 20% deviation from baseline	Trigger recalibration
Feature null-rate spike	Null % per feature	> 5% increase	Check data pipeline
Schema change	Contract validation	Any schema mismatch	Block pipeline, alert
Feature computation lag	Streaming lag metrics	> 5 min behind	Scale workers
Label freshness	Days since last label batch	> 45 days	Check label pipeline
Latency p95	Service metrics	> 50ms	Scale or optimize

See Drift Runbook for detailed response procedures.

Ownership: On-call owns drift/latency alerts. Data Eng owns schema/lag. ML Eng owns recalibration + model updates.

Cost & Latency

~15ms p50 Latency

~45ms p95 Latency

~1.2K TPS (single instance)

~$360 Monthly (estimated)

How measured:

Latency: uvicorn workers=4, batch=1, 1KB payload, 1000-request benchmark on local machine
TPS: Single c5.xlarge equivalent, no batching. Scales linearly with instances.
Cost: AWS pricing estimate: 2× c5.xlarge ($248) + Redis cache.t3.medium ($25) + storage/egress (~$90). Does not include Kafka (use managed or self-host).

Note: Production deployment would need load testing with realistic traffic patterns.

Failure Modes

Failure Mode	Likelihood	Impact	Mitigation
Model scores all low-risk	Medium	High	Minimum fraud rate alert
Latency spike	Medium	Medium	Circuit breaker + rule fallback
Label pipeline breaks	Low	High	Label freshness alert (>45 days)
Adversarial fraud pattern	High	High	Drift detection + rapid retrain
Reason-code drift / feature leakage	Low	Medium	Feature importance drift checks (SHAP optional) + top feature review
Upstream schema change	Medium	High	Data contract validation blocks pipeline

Residual Risk: The system cannot detect zero-day fraud patterns until labeled examples arrive (30+ day lag). Mitigation: rule-based fallback for anomalous transactions + human review queue.

Visual Proof

Screenshots from actual benchmark runs and monitoring simulations.

[chart] Load Test Chart make benchmark

Locust dashboard: ~1.2K TPS sustained, p95 < 50ms

[alert] Drift Alert make drift-sim

PSI threshold breach triggering retraining workflow

[pass] CI Eval Gate .github/workflows/

GitHub Actions eval gate with slice regression checks

Run make benchmark locally to generate your own. Screenshots coming soon.

Incident Simulation

What I Broke

Injected a synthetic fraud pattern shift mid-stream: new fraud type targeting a previously safe merchant category (simulated "travel" category attack).

What Happened

Drift monitor detected PSI spike (0.15 > 0.1 threshold)
Feature KS test flagged merchant_category distribution shift
Slice analysis showed "travel" category PR-AUC dropped from 0.84 → 0.61
Retraining trigger fired automatically
New model trained and validated against regression suite

Postmortem

Full postmortem documented with timeline, root cause, and prevention measures. View postmortem →

Assumptions & Limitations

Synthetic data: Demo uses generated transactions (no real PII). Patterns are realistic but not from production.
Simulated delay: 30–90 day label lag is simulated by holding back labels, not real chargebacks.
Local benchmarks: Latency/TPS measured on dev machine, not production infrastructure.
Mock streaming: Demo uses mock stream for portability. Kafka consumer/producer stubs included; diagram shows target architecture.

Data Contract (excerpt)

TransactionEvent:
  required:
    - transaction_id: string (uuid)
    - amount: float (>0)
    - user_id: string
    - merchant_id: string
    - timestamp: datetime (ISO8601)
  optional:
    - device_id: string | null
    - ip_address: string | null
  null_policy: "device_id nullable, others reject on null"

(Each API request is validated and becomes a TransactionEvent internally.)

Reason Codes (sample)

Top reason codes for a high-risk decision:

velocity_user_1h_high new_device_high_amount merchant_category_risk time_since_last_txn_low

velocity_user_1h_high — User initiated >N transactions in the last hour (threshold varies by user history)

Reason codes derived from SHAP values + rule triggers. Used for analyst review and model debugging.

Artifacts

[eval] Eval Report [drift] Drift Runbook [post] Postmortem [card] Model Card [cost] Cost Report [repo] Repository