ML Systems Engineer
December 2025 Evaluation 6 min read

Why Offline Metrics Lie

An experiment in deceptive validation

On a personal project, I got 0.92 AUC on holdout. Tested on newer data. Watched it drop to 0.76. This is what I learned about why random splits hide temporal drift, and how I think about evaluation now.

Main point: Time splits or you're lying to yourself.

What I'd do differently: Use time-based splits from day one. Never trust random holdout again. Add a "future simulation" test that holds out the most recent 20% by timestamp.
Read Full Post →
November 2025 MLOps 8 min read

How I Design Eval Suites and CI Gates

A practical template you can steal

The eval framework I built for my projects. Four checks: subgroup regression, calibration, latency, and cost. Sharing the YAML config and my reasoning—steal whatever's useful.

Main point: CI gates are the only defense against silent model regression.

What I'd do differently: Add distribution shift simulation as a 5th gate. Synthetically perturb the test set and require graceful degradation.
Read Full Post →
October 2025 LLM Security 10 min read

Security Failures in LLM Apps

What I learned from attacking my own RAG system

I built SecureRAG and then spent time trying to break it. Prompt injection, data exfiltration, tool abuse. This is what I learned about LLM security by thinking like an attacker.

Main point: Assume the LLM is compromised. Design permissions as if it's an untrusted user.

What I'd do differently: Assume the model is compromised from the start. Design the permission system as if the LLM is an untrusted user, not a trusted component.
Read Full Post →