Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

Prefix caching lets vLLM and SGLang skip recomputing tokens the model has already seen — but only if the next request lands on the same GPU replica. Standard round-robin routing can significantly reduce cache locality in multi-replica deployments, limiting the benefits of prefix caching unless routing is made cache-aware. Prefix-aware sticky routing fixes it. This post explains the mechanism, the cost of ignoring it, and how to implement it in three levels of increasing sophistication.

Inference accounts for a growing share of AI compute demand. Some industry estimates project inference could represent the majority of AI compute consumption as generative AI workloads mature, and a significant fraction of that is redundant work — your model recomputing attention states for tokens it processed seconds ago. The culprit isn't the GPU or the model. It's the load balancer sitting in front of your inference cluster, routing each request without any awareness of what's already cached where.

For a broader overview of LLM inference engines and optimization levers, see LLM Inferencing: Optimize Speed, Cost & Scale AI.

Why Prefix Caching Matters

Every LLM inference request runs in two distinct phases. During prefill, the model processes the entire input sequence and builds a key-value (KV) cache — a matrix of attention states, one entry per token, stored in GPU VRAM. During decode, the model generates output tokens one at a time, attending back to that cached state without re-reading the input.

Prefill is expensive. Its compute cost scales quadratically with input length: doubling the prompt quadruples the attention computation. For a 2,000-token system prompt processed once per request at 1,000 requests per hour, that's 2 million tokens of prefill work per hour — every hour, every day.

Prefix caching eliminates the redundant portion of that work. If Request B shares a prefix with a previously processed Request A, the engine reuses the cached KV blocks for those shared tokens and computes only the suffix that's new. The savings compound across three common workload patterns:

RAG pipelines: retrieved document chunks can often overlap across queries, particularly when many users access the same knowledge base or frequently referenced documents.
Multi-turn chat: each follow-up message appends to conversation history the model has already encoded.
Few-shot prompting: a fixed set of examples prepended to every request costs prefill compute once, not N times.

vLLM implements prefix caching via hash-based block matching — each KV block is identified by a hash of the tokens that produced it. SGLang uses RadixAttention, a radix tree that finds the longest matching prefix across all active requests and reuses those blocks directly. On prefix-heavy workloads, SGLang's RadixAttention has been reported to deliver multi-fold throughput improvements over cache-blind serving, with LMSYS reporting gains of up to roughly 5× in representative benchmarks. vLLM with automatic prefix caching enabled eliminates the majority of redundant prefill on multi-turn workloads.

This works best on a single instance when requests have stable shared prefixes and enough KV cache capacity to retain reusable blocks. The moment you scale to a multi-replica fleet, the problem begins.

How Round-Robin Routing Destroys Cache Hit Rate

Consider a deployment with eight vLLM replicas behind a standard Kubernetes service. Replica 1 processes a request with a 2,000-token system prompt, computes the KV cache, and stores it in its VRAM. The next request with the identical prompt arrives. Round-robin sends it to Replica 2 — which has no cache for that prefix, so it recomputes the full prefill from scratch. The third request goes to Replica 3. Same miss.

In workloads where a useful prefix cache exists on only one replica—for example, growing chat histories or tenant-specific contexts—round-robin routing reduces the probability of landing on the cache-owning replica as fleet size increases. This effect can significantly reduce cache locality and increase redundant prefill computation.

In prefix-heavy workloads, naive scale-out can reduce cache locality and increase redundant prefill unless routing is cache-aware.

The production numbers are stark. The llm-d project, Red Hat's Kubernetes-native distributed inference framework, benchmarked prefix-cache-aware routing against round-robin on 8 pods / 16 H100 GPUs and measured 57× faster time-to-first-token and 2× throughput on the same hardware. A separate benchmark on Llama 3.1 70B across 4× AMD MI300X GPUs showed 3× output tokens/sec and 2× TTFT reduction after switching routing strategies. DigitalOcean's inference gateway, built on the same primitives, measured up to 108% throughput improvement for cache-aware versus random routing at identical hardware configurations.

The economic impact depends on request volume, model size, GPU type, and workload characteristics, but higher cache hit rates generally translate into lower infrastructure costs and better utilization of existing hardware.

The challenge is that pure cache locality and pure load balancing pull in opposite directions. The question then becomes: how do you preserve cache locality without sacrificing load balancing?

Three Levels of Prefix-Aware Routing

These approaches are ordered from simplest to most precise.

Level 1 — Session Affinity

The quickest fix: configure your load balancer to hash on a session or user identifier and always route requests from the same session to the same replica. This is available natively in most load balancers as a sticky-session or consistent-hash policy — no custom routing logic required.

Session affinity works well for multi-turn chat where the same user generates sequential requests with growing conversation history. It has two structural limits. Adding a replica reshuffles the hash ring, breaking existing affinity and triggering cache misses until the fleet re-warms.

Use this as a fast initial improvement for single-tenant or low-concurrency deployments, or anywhere you want a quick win before investing in a more precise solution.

Level 2 — Prefix-Hash Routing

Rather than hashing on who sent the request, hash on what the request contains. Specifically, hash the first N tokens of the prompt — the stable prefix that's identical across users — and route all requests with the same prefix hash to the same replica.

The key difference from session affinity: this handles shared system prompts correctly. If the routing key is computed from the stable shared prefix, users with the same system prompt can be routed to the same replica. The replica computes the prefix KV cache once and serves every subsequent matching request from cache.

Choosing the right prefix boundary matters more than the hashing mechanism itself. For a RAG pipeline, the boundary should cover the longest stable portion of the prompt, which may be only the system prompt or may include commonly reused retrieved context. For few-shot prompting, it's the end of the example block. For multi-turn chat, it's the end of the previous turn. Setting the boundary too short leaves variable content inside the hashed region and generates spurious misses; setting it too long includes content that varies across requests and fragments routing across too many replicas.

Level 3 — KV-Event-Aware Routing

The most precise approach: the router subscribes to KV block allocation and eviction events emitted by each inference engine and maintains a real-time map of which blocks — identified by their content hashes — are resident on which replica. On each incoming request, it scores every eligible replica by how much of the request's prefix is already cached there, weighted against each replica's current load, then routes to the highest-scoring option.

The scoring function balances two signals. Cache overlap score measures what fraction of the incoming request's prefix blocks are already resident on a given replica. Load score measures how saturated that replica currently is. The router balances cache locality against current replica load when making routing decisions.

This is broadly the architecture used by llm-d's Endpoint Picker component, which uses cache-state and load information to select a target replica before forwarding the request. The routing overhead itself is negligible against the prefill savings it unlocks.

When to Skip Prefix-Aware Routing

Prefix-aware routing earns its complexity only when workloads have genuine prefix overlap. It adds a routing decision and a block-state lookup on every request with no return in three scenarios: when all prompts are unique (document summarization, creative generation with varied inputs), when prompts are short (under roughly 200 tokens, where prefill cost is trivial), or when batch workloads process fully distinct inputs with no structural repetition.

The practical diagnostic: measure your cache hit rate directly from vLLM's cache event metrics or SGLang's RadixAttention stats.If hit rate remains low even after cache-aware routing, inspect prompt structure first.

How TrueFoundry Implements KV Cache Sticky Routing

TrueFoundry's AI Gateway and model deployment layer implement sticky routing for KV cache optimization natively. When a model is deployed via TrueFoundry, requests with matching prefixes can be routed to the same vLLM or SGLang replica, allowing those requests to reuse existing KV cache state instead of repeatedly recomputing the same prefill tokens.

The routing layer is designed to balance two competing goals: preserving cache locality and maintaining healthy load distribution across replicas. When a replica already contains the relevant prefix cache, routing can favor that replica to maximize reuse. When a replica becomes saturated, the gateway can fall back to alternative replicas to avoid turning cache locality into a bottleneck.

Because routing is integrated with model deployment, teams do not need to build separate infrastructure for sticky sessions, prefix hashing, or custom load-balancing policies. The same deployment workflow used to scale models horizontally also preserves cache-aware routing behavior as replicas are added or removed.

For teams running long system prompts, multi-turn chat applications, RAG workloads, or few-shot prompting pipelines, this improves the likelihood that repeated prefixes remain warm on the replicas most likely to serve subsequent requests. The result is higher cache utilization, lower redundant prefill compute, and better GPU efficiency without requiring application-level routing logic.

Best Practices

Fix prompt structure before touching routing. Place all static content — system prompt, few-shot examples, retrieved chunks — at the start of every prompt. Variable content (user message, request ID, timestamp) must follow the stable prefix. A single misplaced dynamic field before the system prompt collapses cache-ability entirely, regardless of routing strategy.
Set your prefix boundary based on actual token counts, not guesses. Tokenize your typical system prompt and example block, count the tokens, and use that boundary for prefix hashing. An arbitrary fixed length often cuts mid-sentence inside variable content or misses the full stable region.
Start at Level 2, measure, then decide whether Level 3 is worth it. Moving from round-robin to prefix-hash routing captures the majority of available gains with no custom routing infrastructure. Level 3 is worth evaluating when you run multiple replicas at sustained concurrency and prefix-cache misses are a measured bottleneck.
Monitor cache hit rate as a production alert, not just a dashboard metric. A developer quietly adding a dynamic field to the system prompt will tank hit rate immediately. The latency regression shows up before anyone checks logs. Alert on hit rate dropping below your established baseline.
Tune the locality weight per workload type, not globally. Multi-turn chat and RAG pipelines benefit from strong cache-locality preference. Batch inference over diverse inputs benefits from pure load balancing. If your gateway serves mixed traffic, run separate routing configurations per workload class rather than compromising both.
Watch VRAM utilization alongside hit rate. Prefix caching trades GPU memory for prefill savings. At high concurrency, cached blocks compete with active requests for VRAM, triggering LRU eviction and collapsing hit rates. If hit rate degrades under load despite correct routing, memory pressure is the likely cause — adjust the memory utilization ceiling on the inference engine before assuming a routing problem.

Conclusion

For most teams, moving from round-robin to prefix-hash routing is the highest-leverage optimization. More sophisticated cache-aware routing becomes increasingly valuable as fleet size, concurrency, and cache reuse grow.

Learn how TrueFoundry's AI Gateway handles prefix-aware routing alongside model deployment in a single control plane → truefoundry.com/ai-gateway

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now