Abdul-Sobur Ayinde

I build ML systems and try to make them not break in production.

Fraud detection, secure RAG, benchmarks. Each one taught me something the hard way.

These are learning projects with synthetic data. Benchmarks and methodology are documented so you can verify.

MLOps

FraudShield

Fraud scoring API I built to learn how real systems handle delayed labels and drift. Got it to ~1.2K TPS on my VM. The interesting part is the label reconciliation—fraud labels arrive 30-90 days late.

CI-Eval Drift Cost Latency Postmortem
LLM/GenAI

SecureRAG

RAG system where I focused on security instead of just retrieval metrics. Wrote 120+ attack tests myself. Some injections still get through (2%)—I document what fails.

Attack Tests Faithfulness CI-Eval Cost
Research

ShiftBench

Skin lesion benchmark that doesn't hide distribution shift. Train on old data, test on new data. Results are worse than random splits suggest—that's the point.

Slice Metrics Reproducible Model Card CI-Eval

What I include with each project:

Eval scripts that run in CI Drift monitoring setup Security tests (for LLM stuff) Cost estimates What went wrong docs
Abdul-Sobur Ayinde

About

I'm learning ML engineering by building things and breaking them. These projects are how I teach myself what production systems actually need—stuff like handling delayed labels, defending against prompt injection, and not lying to myself with random train/test splits.

I try to document failures, not just successes. The postmortems are real.