I build ML systems and try to make them not break in production.

Fraud detection, secure RAG, benchmarks. Each one taught me something the hard way.

These are learning projects with synthetic data. Benchmarks and methodology are documented so you can verify.

View Proof Artifacts →

MLOps

FraudShield

Fraud scoring API I built to learn how real systems handle delayed labels and drift. Got it to ~1.2K TPS on my VM. The interesting part is the label reconciliation—fraud labels arrive 30-90 days late.

CI-Eval Drift Cost Latency Postmortem

Case Study Repo

LLM/GenAI

SecureRAG

RAG system where I focused on security instead of just retrieval metrics. Wrote 120+ attack tests myself. Some injections still get through (2%)—I document what fails.

Attack Tests Faithfulness CI-Eval Cost

Case Study Repo

Research

ShiftBench

Skin lesion benchmark that doesn't hide distribution shift. Train on old data, test on new data. Results are worse than random splits suggest—that's the point.

Slice Metrics Reproducible Model Card CI-Eval

Case Study Repo

View All Projects → See Proof Artifacts →

What I include with each project:

Eval scripts that run in CI Drift monitoring setup Security tests (for LLM stuff) Cost estimates What went wrong docs

About

I'm learning ML engineering by building things and breaking them. These projects are how I teach myself what production systems actually need—stuff like handling delayed labels, defending against prompt injection, and not lying to myself with random train/test splits.

I try to document failures, not just successes. The postmortems are real.

Get in Touch → More About Me →