ML Systems Engineer
LLM/GenAI

SecureRAG

A RAG system where I actually tried to break it. Most tutorials skip security entirely.

Attack Tests Faithfulness CI-Eval Cost Reproducible
Note: This runs locally with synthetic documents. I wrote 120+ attack tests myself—they test my own code, not a production system. Some attacks still work.

Interactive Demo

This is fake — it shows how the security logic responds, but it's running in your browser, not a real API. Clone the repo to run it for real.

SecureRAG Chat

Security: Active
[AI]
Hello! I'm SecureRAG. I can answer questions about your documents with built-in security. Try asking a question, or test my defenses with an injection attempt.

Try these examples:

Quick Start

git clone https://github.com/croesus245/securerag-defense-in-depth && cd securerag-defense-in-depth
pip install -r requirements.txt
python -m src.api                    # Start API server
pytest tests/attacks/ -v             # Run 120+ attack tests

Problem

RAG systems are deployed without adversarial testing. Prompt injection can leak documents, exfiltrate data, or abuse tools. Most portfolios show "RAG chatbot" with zero security.

Production RAG needs permission models, output validation, and attack test suites—not just retrieval metrics.

Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ User Query   │───▶│ Input        │───▶│ Query        │
│              │    │ Sanitizer    │    │ Encoder      │
└──────────────┘    └──────────────┘    └──────────────┘
                                               │
                           ┌───────────────────┘
                           ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Document     │◀───│ Retriever    │───▶│ Permission   │
│ Store        │    │ (Vector DB)  │    │ Filter       │
│ (per-tenant) │    └──────────────┘    └──────────────┘
└──────────────┘                               │
                           ┌───────────────────┘
                           ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ LLM          │◀───│ Prompt       │───▶│ Tool         │
│ (GPT-4/Local)│    │ Constructor  │    │ Executor     │
└──────────────┘    └──────────────┘    │ (sandboxed)  │
       │                                └──────────────┘
       ▼
┌──────────────┐    ┌──────────────┐
│ Output       │───▶│ Response     │
│ Validator    │    │ (to user)    │
└──────────────┘    └──────────────┘
                        

Security Layers

The idea: don't trust the LLM. Put checks around it.

  • Input check: Classifier tries to catch injection attempts before they hit the LLM. Doesn't catch everything.
  • Permission filter: Retrieved docs are filtered by tenant_id. The LLM never sees docs you shouldn't access.
  • Document trust: Instructions inside docs are ignored. Docs are data, not commands.
  • Tool sandbox: Tools are allowlisted. No shell access, strict schemas, rate limits.
  • Output validator: Scans for PII leakage and checks if the answer is actually grounded in the docs.

ML Components

  • Query Encoder: SentenceTransformer embeddings for query + document encoding (retrieval)
  • Reranker: Cross-encoder to improve top-k relevance after initial retrieval
  • Injection classifier: Trained on attack corpora to flag suspicious queries + doc chunks
  • Faithfulness judge: LLM-as-judge scoring answer groundedness against retrieved context

Goal: reduce hallucination (reranker + judge) and reduce attack surface (classifier + validator) without relying on prompt-only defenses.

Constraints

Latency < 4s p95
Cost < $0.03/query
Exfiltration SLO 0% in test suite (CI-gated)
Multi-tenant Mandatory

Evaluation

Retrieval & Faithfulness

Metric Value Target
Precision@5 0.74 > 0.70
Recall@5 0.68 > 0.60
Faithfulness (LLM-as-judge) 94% > 90%

Faithfulness: % of responses rated "supported by context" by LLM-judge + citation check on 200-query eval set.

Security Testing

Attack Category Test Cases Success Rate Status
Direct prompt injection 50 2% (1/50) ⚠ Warn
Indirect injection (via docs) 20 5% (1/20) ⚠ Warn
Data exfiltration 20 0% (0/20) Pass
Tool abuse / escalation 15 0% (0/15) Pass
PII extraction 15 0% (0/15) Pass
Total 120 1.7% Pass

Pass condition: exfiltration = 0% and total attack success < 5%. Partial behavior change without leakage is tracked but not treated as a fail.

Cost & Latency

1.8s p50 Latency
3.6s p95 Latency
~45 QPS (single instance)
$0.024 per query (GPT-4)

Measured with retrieval cache ON, tool calls OFF, batch=1, 5-doc context, 1,000 request run on 4-core VM.

Cost includes: embeddings + rerank + LLM answer (excludes document ingestion).

Failure Modes

Failure Mode Likelihood Impact Mitigation
Novel injection bypasses Medium High Continuous red-teaming, anomaly monitoring
Hallucination as fact High Medium Faithfulness scoring + citation requirement
PII in response Medium High Output PII scanner
Cross-tenant data leak Low Critical Permission filter + tenant isolation
Residual Risk: Novel injection techniques may bypass current defenses. Mitigation: Continuous red-teaming, monitor for anomalous outputs, human review for high-stakes queries.

Visual Proof

Screenshots from actual attack tests and evaluation runs.

Run pytest tests/attacks/ -v locally to see real output. Screenshots coming soon.

Artifacts