WARDEN
Open source · Self-hosted · OWASP LLM-aware

Watch every prompt. Route across every model. Judge every request.

WARDEN is a self-hosted multi-tenant LLM gateway. Adaptive routing across providers, multi-agent prompt-injection defense, semantic caching, and a closed-loop cost optimizer — so your platform team owns the inference layer instead of renting it.

Java 25Spring AILangGraphpgvectorMIT License

Problem

Production LLM stacks face three intersecting problems today.

Each one alone would justify a tool. WARDEN ships one self-hostable artifact that attacks all three — without renting another control plane.

01 / 03

Multi-provider cost waste

Production LLM stacks run 3–8 providers for cost, latency, capability, and redundancy. Static gateways like LiteLLM leave 30–50% of the savings on the table; managed SaaS (Portkey, OpenRouter) violates data-residency rules in regulated industries.

OWASP LLM Top 10 + LiteLLM benchmarks

02 / 03

Prompt injection is OWASP LLM-01

The #1 threat in the OWASP LLM Top-10. Available defenses are single-model classifiers with high false-positive rates or monolithic guardrails with prohibitive latency budgets. Defense-in-depth via cascades has been proposed in research but not packaged as production middleware.

OWASP LLM Top-10 #1

03 / 03

No closed control loop

No open-source gateway learns from its own observability data. Operators hand-tune routing rules; rules drift; nobody re-tunes. WARDEN runs a nightly LangGraph agent that mines traffic and opens routing-policy PRs for human review.

Closed-loop optimization, human-in-the-loop

How it works

The request lifecycle.

Every request runs through the same five stages. Defense lands before cache so blocked prompts never get a cached response; the cache lands before routing so hits never burn an LLM call.

01

Receive

OpenAI-compatible /v1/chat/completions. Bearer-token auth, W3C traceparent, OTel.

02

Defense cascade

Classifier → critic → judge. ≥80% of traffic terminates at the classifier in <50ms.

03

Semantic cache

pgvector HNSW + tenant-scoped Caffeine fast path. p99 lookup <30ms.

04

Adaptive router

YAML policy DSL + observed health. Resilience4j circuit breakers per provider.

05

Provider

Anthropic / OpenAI / Bedrock / Ollama / Mistral. Adapter swap-out is config-only.

Telemetry exports to OpenTelemetry + Prometheus + LangSmith. A nightly LangGraph agent ingests the last 24h of traces and opens a routing-policy diff PR — closing the control loop without putting an agent in the synchronous path.

Benchmarks

Three numbers the test suite reproduces.

Every figure on this page corresponds to a named test you can run from a fresh clone. No marketing-grade hand-waving.

Defense classifier
1.000precision

0.833 recall · 0.909 F1 · p99 1ms

M5 PromptInjectionBench eval (30 inj + 30 benign)

Cost optimizer (replay)
89.7%savings

Synthetic 24h workload · 4.5× over the 20% bar

M8 spec done-bullet test

Semantic cache p99
2ms

10k-prompt corpus · 15× under the 30ms budget

M4 SemanticCacheMicrobench

make verify && make bench-defense && make cost-optimizer-test

Quickstart

Two entry points. Both done in under five minutes.

Either path takes four bash lines from a fresh clone. The local profile uses Ollama for critic + judge so there's no cloud bill at all.

Fully offline via Ollama. No API key required. Recommended for a 5-minute first look.

$0 / month
# 1) One-time prerequisites
brew install openjdk@25 maven docker ollama k6

# 2) Bring up the stack (~3 min on M-series Mac)
make demo-local

# 3) Fire mixed traffic + watch the dashboard
make load
open http://localhost:3000

Need more flags? Run make help for 35+ operator-facing targets.

Paper

WARDEN: Adaptive Routing and Multi-Agent Prompt-Injection Defense for Self-Hosted Multi-Tenant LLM Inference.

A tech report — not a novel-algorithm paper. The contribution is the integrated system design: how to compose a defense cascade, a semantic cache, an adaptive router, and a closed-loop optimizer into one self-hostable artifact, with honest precision/recall/latency numbers across four cost-tier profiles.

  • 10–12 pages, ACM sigconf template, arXiv (cs.CR or cs.DC)
  • Reports cascade vs single-model baseline on PromptInjectionBench across the local, staging, demo profiles
  • Cost-optimizer evaluation: replay savings on synthetic + traced production-shape workloads
  • Honest limitations section: no formal proof, human-in-the-loop policy changes
Paper.pdf · draft§6.3

6.3 Defense efficacy (heuristic baseline)

On a curated 60-prompt PromptInjectionBench subset spanning OWASP LLM-01 attack families, the heuristic classifier achieves P=1.000, R=0.833, F1=0.909 at p99 1ms — sufficient on its own for ≥83% of traffic. The five missed injections (subtle persona-hijack variants) are picked up by the critic + judge tiers when the cascade is wired end-to-end…

TP
25
FP
0
FN
5

Architecture

Ten modules. Each one independently testable.

Spec §6 breaks WARDEN into ten modules with explicit "Done means" criteria. Each one ships green CI with a named bench or contract test.

Core Gateway API

M1

OpenAI-compatible /v1/chat/completions, SSE streaming, OTel traceparent.

Provider adapters

M2

Anthropic · OpenAI · Ollama · Mistral · Bedrock — config-only swap.

Routing engine

M3

YAML DSL, live-reload, Resilience4j circuit breakers per provider.

Semantic cache

M4

pgvector HNSW + Caffeine fast path, tenant-scoped.

Defense cascade

M5

Classifier → critic → judge. Runs BEFORE cache.

Rate limit + tenants

M6

Redis token bucket (Lua-atomic) + Postgres tenants, admin REST CRUD.

Observability

M7

Prometheus + OTel + LangSmith + Grafana dashboard JSON.

Cost optimizer

M8

LangGraph sidecar that opens routing-policy PRs.

Admin dashboard

M9

Next.js 15 + shadcn/ui. 5 spec pages. Pure light theme.

Demo harness

M10

docker-compose + k6 + Toxiproxy + scripted scenarios.