Watch every prompt.
Route across every model.
Judge every request.
WARDEN is a self-hosted multi-tenant LLM gateway. Adaptive routing across providers, multi-agent prompt-injection defense, semantic caching, and a closed-loop cost optimizer — so your platform team owns the inference layer instead of renting it.
Problem
Production LLM stacks face three intersecting problems today.
Each one alone would justify a tool. WARDEN ships one self-hostable artifact that attacks all three — without renting another control plane.
Multi-provider cost waste
Production LLM stacks run 3–8 providers for cost, latency, capability, and redundancy. Static gateways like LiteLLM leave 30–50% of the savings on the table; managed SaaS (Portkey, OpenRouter) violates data-residency rules in regulated industries.
OWASP LLM Top 10 + LiteLLM benchmarks
Prompt injection is OWASP LLM-01
The #1 threat in the OWASP LLM Top-10. Available defenses are single-model classifiers with high false-positive rates or monolithic guardrails with prohibitive latency budgets. Defense-in-depth via cascades has been proposed in research but not packaged as production middleware.
OWASP LLM Top-10 #1
No closed control loop
No open-source gateway learns from its own observability data. Operators hand-tune routing rules; rules drift; nobody re-tunes. WARDEN runs a nightly LangGraph agent that mines traffic and opens routing-policy PRs for human review.
Closed-loop optimization, human-in-the-loop
How it works
The request lifecycle.
Every request runs through the same five stages. Defense lands before cache so blocked prompts never get a cached response; the cache lands before routing so hits never burn an LLM call.
Receive
OpenAI-compatible /v1/chat/completions. Bearer-token auth, W3C traceparent, OTel.
Defense cascade
Classifier → critic → judge. ≥80% of traffic terminates at the classifier in <50ms.
Semantic cache
pgvector HNSW + tenant-scoped Caffeine fast path. p99 lookup <30ms.
Adaptive router
YAML policy DSL + observed health. Resilience4j circuit breakers per provider.
Provider
Anthropic / OpenAI / Bedrock / Ollama / Mistral. Adapter swap-out is config-only.
Telemetry exports to OpenTelemetry + Prometheus + LangSmith. A nightly LangGraph agent ingests the last 24h of traces and opens a routing-policy diff PR — closing the control loop without putting an agent in the synchronous path.
Benchmarks
Three numbers the test suite reproduces.
Every figure on this page corresponds to a named test you can run from a fresh clone. No marketing-grade hand-waving.
0.833 recall · 0.909 F1 · p99 1ms
M5 PromptInjectionBench eval (30 inj + 30 benign)
Synthetic 24h workload · 4.5× over the 20% bar
M8 spec done-bullet test
10k-prompt corpus · 15× under the 30ms budget
M4 SemanticCacheMicrobench
make verify && make bench-defense && make cost-optimizer-testQuickstart
Two entry points. Both done in under five minutes.
Either path takes four bash lines from a fresh clone. The local profile uses Ollama for critic + judge so there's no cloud bill at all.
Fully offline via Ollama. No API key required. Recommended for a 5-minute first look.
# 1) One-time prerequisites
brew install openjdk@25 maven docker ollama k6
# 2) Bring up the stack (~3 min on M-series Mac)
make demo-local
# 3) Fire mixed traffic + watch the dashboard
make load
open http://localhost:3000Need more flags? Run make help for 35+ operator-facing targets.
Paper
WARDEN: Adaptive Routing and Multi-Agent Prompt-Injection Defense for Self-Hosted Multi-Tenant LLM Inference.
A tech report — not a novel-algorithm paper. The contribution is the integrated system design: how to compose a defense cascade, a semantic cache, an adaptive router, and a closed-loop optimizer into one self-hostable artifact, with honest precision/recall/latency numbers across four cost-tier profiles.
- 10–12 pages, ACM sigconf template, arXiv (cs.CR or cs.DC)
- Reports cascade vs single-model baseline on PromptInjectionBench across the local, staging, demo profiles
- Cost-optimizer evaluation: replay savings on synthetic + traced production-shape workloads
- Honest limitations section: no formal proof, human-in-the-loop policy changes
6.3 Defense efficacy (heuristic baseline)
On a curated 60-prompt PromptInjectionBench subset spanning OWASP LLM-01 attack families, the heuristic classifier achieves P=1.000, R=0.833, F1=0.909 at p99 1ms — sufficient on its own for ≥83% of traffic. The five missed injections (subtle persona-hijack variants) are picked up by the critic + judge tiers when the cascade is wired end-to-end…
Architecture
Ten modules. Each one independently testable.
Spec §6 breaks WARDEN into ten modules with explicit "Done means" criteria. Each one ships green CI with a named bench or contract test.
Core Gateway API
M1OpenAI-compatible /v1/chat/completions, SSE streaming, OTel traceparent.
Provider adapters
M2Anthropic · OpenAI · Ollama · Mistral · Bedrock — config-only swap.
Routing engine
M3YAML DSL, live-reload, Resilience4j circuit breakers per provider.
Semantic cache
M4pgvector HNSW + Caffeine fast path, tenant-scoped.
Defense cascade
M5Classifier → critic → judge. Runs BEFORE cache.
Rate limit + tenants
M6Redis token bucket (Lua-atomic) + Postgres tenants, admin REST CRUD.
Observability
M7Prometheus + OTel + LangSmith + Grafana dashboard JSON.
Cost optimizer
M8LangGraph sidecar that opens routing-policy PRs.
Admin dashboard
M9Next.js 15 + shadcn/ui. 5 spec pages. Pure light theme.
Demo harness
M10docker-compose + k6 + Toxiproxy + scripted scenarios.