SecureRAG
A RAG system where I actually tried to break it. Most tutorials skip security entirely.
Interactive Demo
This is fake — it shows how the security logic responds, but it's running in your browser, not a real API. Clone the repo to run it for real.
SecureRAG Chat
Try these examples:
Quick Start
git clone https://github.com/croesus245/securerag-defense-in-depth && cd securerag-defense-in-depth
pip install -r requirements.txt
python -m src.api # Start API server
pytest tests/attacks/ -v # Run 120+ attack tests
Problem
RAG systems are deployed without adversarial testing. Prompt injection can leak documents, exfiltrate data, or abuse tools. Most portfolios show "RAG chatbot" with zero security.
Production RAG needs permission models, output validation, and attack test suites—not just retrieval metrics.
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ User Query │───▶│ Input │───▶│ Query │
│ │ │ Sanitizer │ │ Encoder │
└──────────────┘ └──────────────┘ └──────────────┘
│
┌───────────────────┘
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Document │◀───│ Retriever │───▶│ Permission │
│ Store │ │ (Vector DB) │ │ Filter │
│ (per-tenant) │ └──────────────┘ └──────────────┘
└──────────────┘ │
┌───────────────────┘
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ LLM │◀───│ Prompt │───▶│ Tool │
│ (GPT-4/Local)│ │ Constructor │ │ Executor │
└──────────────┘ └──────────────┘ │ (sandboxed) │
│ └──────────────┘
▼
┌──────────────┐ ┌──────────────┐
│ Output │───▶│ Response │
│ Validator │ │ (to user) │
└──────────────┘ └──────────────┘
Security Layers
The idea: don't trust the LLM. Put checks around it.
- Input check: Classifier tries to catch injection attempts before they hit the LLM. Doesn't catch everything.
- Permission filter: Retrieved docs are filtered by tenant_id. The LLM never sees docs you shouldn't access.
- Document trust: Instructions inside docs are ignored. Docs are data, not commands.
- Tool sandbox: Tools are allowlisted. No shell access, strict schemas, rate limits.
- Output validator: Scans for PII leakage and checks if the answer is actually grounded in the docs.
ML Components
- Query Encoder: SentenceTransformer embeddings for query + document encoding (retrieval)
- Reranker: Cross-encoder to improve top-k relevance after initial retrieval
- Injection classifier: Trained on attack corpora to flag suspicious queries + doc chunks
- Faithfulness judge: LLM-as-judge scoring answer groundedness against retrieved context
Goal: reduce hallucination (reranker + judge) and reduce attack surface (classifier + validator) without relying on prompt-only defenses.
Constraints
Evaluation
Retrieval & Faithfulness
| Metric | Value | Target |
|---|---|---|
| Precision@5 | 0.74 | > 0.70 |
| Recall@5 | 0.68 | > 0.60 |
| Faithfulness (LLM-as-judge) | 94% | > 90% |
Faithfulness: % of responses rated "supported by context" by LLM-judge + citation check on 200-query eval set.
Security Testing
| Attack Category | Test Cases | Success Rate | Status |
|---|---|---|---|
| Direct prompt injection | 50 | 2% (1/50) | ⚠ Warn |
| Indirect injection (via docs) | 20 | 5% (1/20) | ⚠ Warn |
| Data exfiltration | 20 | 0% (0/20) | Pass |
| Tool abuse / escalation | 15 | 0% (0/15) | Pass |
| PII extraction | 15 | 0% (0/15) | Pass |
| Total | 120 | 1.7% | Pass |
Pass condition: exfiltration = 0% and total attack success < 5%. Partial behavior change without leakage is tracked but not treated as a fail.
Cost & Latency
Measured with retrieval cache ON, tool calls OFF, batch=1, 5-doc context, 1,000 request run on 4-core VM.
Cost includes: embeddings + rerank + LLM answer (excludes document ingestion).
Failure Modes
| Failure Mode | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Novel injection bypasses | Medium | High | Continuous red-teaming, anomaly monitoring |
| Hallucination as fact | High | Medium | Faithfulness scoring + citation requirement |
| PII in response | Medium | High | Output PII scanner |
| Cross-tenant data leak | Low | Critical | Permission filter + tenant isolation |
Visual Proof
Screenshots from actual attack tests and evaluation runs.
pytest tests/attacks/
docs/attack_report.md
docs/eval_dashboard.md
Run pytest tests/attacks/ -v locally to see real output. Screenshots coming soon.