Open source · Self-hosted · OWASP LLM-aware

Watch every prompt.
Route across every model.
Judge every request.

WARDEN is a self-hosted multi-tenant LLM gateway. Adaptive routing across providers, multi-agent prompt-injection defense, semantic caching, and a closed-loop cost optimizer — so your platform team owns the inference layer instead of renting it.

Java 25Spring AILangGraphpgvectorMIT License

Problem

Three intersecting failures of today's LLM stacks.

01 / 03

Multi-provider cost waste

Static gateways leave 30–50% of cost savings on the table.

02 / 03

Prompt injection is OWASP LLM-01

Single-model classifiers miss; monolithic guardrails are too slow.

03 / 03

No closed control loop

Routing rules drift. Nobody re-tunes. Until now.

How it works

One request, six stages.

Every request runs the same path. Defense lands before cache so blocked prompts never serve a cached response. Cache lands before routing so hits never burn an LLM call.

Benchmarks

Three numbers the test suite reproduces.

Every figure on this page corresponds to a named test you can run from a fresh clone. No marketing-grade hand-waving.

Defense classifier

1.000precision

0.833 recall · 0.909 F1 · p99 1ms

PromptInjectionBench · 30 injection + 30 benign prompts

Cost optimizer (replay)

89.7%savings

Synthetic 24h workload · 4.5× over the 20% bar

End-to-end replay test against a seeded workload

Semantic cache p99

3ms

10k-prompt corpus · 10× under the 30ms budget

SemanticCacheMicrobench · 10k-row corpus

Quickstart

Two entry points. Both done in under five minutes.

One interactive script for local exploration, one for production multi-instance deploy. No long config files either way.

Interactive setup. Defaults to Ollama for everything — zero API keys, zero cloud bill.

$0 / month

# 1) Clone and enter the repo
git clone https://github.com/Abhishek-Aditya-bs/Warden.git
cd Warden

# 2) One command — interactive Q&A then it starts everything
./start.sh

# 3) In another terminal, drive traffic so dashboards populate
./load.sh all

Stuck? Both scripts have a ./start.sh stop equivalent — clean teardown in one command.

Paper

WARDEN: Adaptive routing and multi-agent prompt-injection defense for self-hosted LLM inference.

A tech report — not a novel-algorithm paper. The contribution is the integrated system design: how to compose a defense cascade, a semantic cache, an adaptive router, and a closed-loop optimizer into one self-hostable artifact, with honest precision / recall / latency numbers across cost-tier profiles.

Technical report · clean vector diagrams · arXiv-ready (cs.CR / cs.DC)
Multi-model sweep on a 100-prompt corpus — claude-haiku-4.5, gpt-5.4-mini, gemini-3.5-flash, claude-sonnet-4.6
Cost-optimizer replay: 89.7% projected savings on a synthetic 24h workload
Honest limitations: bimodal classifier, the precision/recall trade, replay upper bound, human in the loop

Paper.pdf · excerptdraft

Defense efficacy (May 2026 model sweep)

On a 100-prompt corpus seeded with adversarial-benign and obfuscated prompts, the regex classifier alone falls to P=0.909, R=0.600. Routing the residual through current LLMs (claude-haiku-4.5, gpt-5.4-mini, gemini-3.5-flash, claude-sonnet-4.6) recovers recall to 1.000 for cents; full LLM adjudication lifts precision to 1.000. The whole sweep cost ~$1.3.

Recall recovered

1.000

Precision (full adjud.)

1.000

Sweep cost

~$1.3

What's inside

Ten capabilities. One self-hosted artifact.

Each piece is independently testable and independently optional. Bring up only what you need — the rest stays off until you turn it on.

OpenAI-compatible API

Drop-in /v1/chat/completions. SSE streaming, OTel traceparent.

Provider adapters

Anthropic, OpenAI, Ollama, Mistral, Bedrock — config-only swap.

Routing engine

YAML policy DSL, live-reload, per-provider circuit breakers.

Semantic cache

pgvector HNSW with a Caffeine exact-match fast path.

Defense cascade

Classifier → critic → judge. Runs before cache.

Rate limit + tenants

Lua-atomic Redis token bucket, Postgres-backed tenants, admin CRUD.

Observability

Prometheus, OpenTelemetry, LangSmith, Grafana dashboards.

Cost optimizer

LangGraph sidecar that opens routing-policy PRs nightly.

Operator dashboard

Next.js 15. Polls /actuator/prometheus and the admin API.

Chaos + load harness

Toxiproxy provider outages, k6 mixed-workload scenarios.

Browse the source on GitHub

Watch every prompt. Route across every model. Judge every request.