LLM/GenAI

SecureRAG

A RAG system where I actually tried to break it. Most tutorials skip security entirely.

Attack Tests Faithfulness CI-Eval Cost Reproducible

Note: This runs locally with synthetic documents. I wrote 120+ attack tests myself—they test my own code, not a production system. Some attacks still work.

View Repository → Try Demo Attack Report Eval Dashboard

Interactive Demo

This is fake — it shows how the security logic responds, but it's running in your browser, not a real API. Clone the repo to run it for real.

SecureRAG Chat

Security: Active

[AI]

Hello! I'm SecureRAG. I can answer questions about your documents with built-in security. Try asking a question, or test my defenses with an injection attempt.

Try these examples:

Security Log SAFE

Quick Start

git clone https://github.com/croesus245/securerag-defense-in-depth && cd securerag-defense-in-depth
pip install -r requirements.txt
python -m src.api                    # Start API server
pytest tests/attacks/ -v             # Run 120+ attack tests

Problem

RAG systems are deployed without adversarial testing. Prompt injection can leak documents, exfiltrate data, or abuse tools. Most portfolios show "RAG chatbot" with zero security.

Production RAG needs permission models, output validation, and attack test suites—not just retrieval metrics.

Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ User Query   │───▶│ Input        │───▶│ Query        │
│              │    │ Sanitizer    │    │ Encoder      │
└──────────────┘    └──────────────┘    └──────────────┘
                                               │
                           ┌───────────────────┘
                           ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Document     │◀───│ Retriever    │───▶│ Permission   │
│ Store        │    │ (Vector DB)  │    │ Filter       │
│ (per-tenant) │    └──────────────┘    └──────────────┘
└──────────────┘                               │
                           ┌───────────────────┘
                           ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ LLM          │◀───│ Prompt       │───▶│ Tool         │
│ (GPT-4/Local)│    │ Constructor  │    │ Executor     │
└──────────────┘    └──────────────┘    │ (sandboxed)  │
       │                                └──────────────┘
       ▼
┌──────────────┐    ┌──────────────┐
│ Output       │───▶│ Response     │
│ Validator    │    │ (to user)    │
└──────────────┘    └──────────────┘

Security Layers

The idea: don't trust the LLM. Put checks around it.

Input check: Classifier tries to catch injection attempts before they hit the LLM. Doesn't catch everything.
Permission filter: Retrieved docs are filtered by tenant_id. The LLM never sees docs you shouldn't access.
Document trust: Instructions inside docs are ignored. Docs are data, not commands.
Tool sandbox: Tools are allowlisted. No shell access, strict schemas, rate limits.
Output validator: Scans for PII leakage and checks if the answer is actually grounded in the docs.

ML Components

Query Encoder: SentenceTransformer embeddings for query + document encoding (retrieval)
Reranker: Cross-encoder to improve top-k relevance after initial retrieval
Injection classifier: Trained on attack corpora to flag suspicious queries + doc chunks
Faithfulness judge: LLM-as-judge scoring answer groundedness against retrieved context

Goal: reduce hallucination (reranker + judge) and reduce attack surface (classifier + validator) without relying on prompt-only defenses.

Constraints

Latency < 4s p95

Cost < $0.03/query

Exfiltration SLO 0% in test suite (CI-gated)

Multi-tenant Mandatory

Evaluation

Retrieval & Faithfulness

Metric	Value	Target
Precision@5	0.74	> 0.70
Recall@5	0.68	> 0.60
Faithfulness (LLM-as-judge)	94%	> 90%

Faithfulness: % of responses rated "supported by context" by LLM-judge + citation check on 200-query eval set.

Security Testing

Attack Category	Test Cases	Success Rate	Status
Direct prompt injection	50	2% (1/50)	⚠ Warn
Indirect injection (via docs)	20	5% (1/20)	⚠ Warn
Data exfiltration	20	0% (0/20)	Pass
Tool abuse / escalation	15	0% (0/15)	Pass
PII extraction	15	0% (0/15)	Pass
Total	120	1.7%	Pass

Pass condition: exfiltration = 0% and total attack success < 5%. Partial behavior change without leakage is tracked but not treated as a fail.

Cost & Latency

1.8s p50 Latency

3.6s p95 Latency

~45 QPS (single instance)

$0.024 per query (GPT-4)

Measured with retrieval cache ON, tool calls OFF, batch=1, 5-doc context, 1,000 request run on 4-core VM.

Cost includes: embeddings + rerank + LLM answer (excludes document ingestion).

Failure Modes

Failure Mode	Likelihood	Impact	Mitigation
Novel injection bypasses	Medium	High	Continuous red-teaming, anomaly monitoring
Hallucination as fact	High	Medium	Faithfulness scoring + citation requirement
PII in response	Medium	High	Output PII scanner
Cross-tenant data leak	Low	Critical	Permission filter + tenant isolation

Residual Risk: Novel injection techniques may bypass current defenses. Mitigation: Continuous red-teaming, monitor for anomalous outputs, human review for high-stakes queries.

Visual Proof

Screenshots from actual attack tests and evaluation runs.

[demo] Attack Tests Running pytest tests/attacks/

GIF: Injection/exfil tests running with pass/fail output

[chart] Attack Report Summary docs/attack_report.md

Attack category breakdown: 120 tests, 0% exfil

[graph] Eval Dashboard docs/eval_dashboard.md

Faithfulness + retrieval metrics per query type

Run pytest tests/attacks/ -v locally to see real output. Screenshots coming soon.

Artifacts

[repo] Repository [sec] Attack Report [eval] Eval Dashboard [model] Security Model [ir] Incident Response [cost] Cost Report