Blank white background with no objects or features visible.

Join the Resilient Agents online hackathon hosted by TrueFoundry. Win up to $10,000 in prizes. Register Now →

Join our VAR & VAD ecosystem — deliver enterprise AI governance across LLMs, MCPs & Agents. Become a Partner →

Semantic Caching for LLMs: Cutting Cost and Latency Beyond Prefix Caching

Por Boyu Wang

Actualizado: June 7, 2026

Prefix caching reuses identical prompts. Semantic caching reuses similar ones — embed the incoming request, and if a near-identical question was answered recently, serve the stored answer instead of calling the model. It's one of the highest-leverage cost and latency levers a gateway can pull, and one where "it works in the demo" and "it's safe in production" are very different claims. This post is how it works, the single knob that governs it, the cases where it quietly serves the wrong answer, and where the cache should live.

Key Takeaways
  • There are three layers of LLM caching, in increasing reach and risk: provider prefix caching (exact prefix), exact-match response caching (hash the whole request), and semantic caching (embed the request, serve a similar prior response). This post is the third.
  • Semantic caching embeds the request, runs a vector similarity search, and serves a cached response when similarity clears a threshold — turning a several-hundred-millisecond model call into a tens-of-milliseconds lookup on a hit.
  • The similarity threshold is the whole game: set it too low and you serve wrong answers (false hits); too high and the hit rate collapses. It's a precision/recall tradeoff to tune per route, not a global constant.
  • Embedding-close is not meaning-equal. "What is the capital of France?" and "What is the capital of Germany?" sit near each other in embedding space. Conservative thresholds, entity guards, and per-namespace caches mitigate this; nothing eliminates it.
  • Never semantically cache personalized, time-sensitive, stateful, or high-stakes responses. A mis-scoped cache can serve one user's data to another — a privacy failure, not just a wrong answer.
  • Cache keys must carry scope (tenant/user) and version (model, system-prompt version, tools), and entries must be invalidated on TTL and whenever the prompt or tools change.
  • The cache belongs at the gateway: shared across services, scoped per tenant, with hit-rate, cost-saved, and latency-saved observability. TrueFoundry's AI Gateway is the natural place to measure cacheable traffic and attribute the savings.

Kabir, a backend engineer, had a good week and then a bad one. The good week: he'd put a semantic cache in front of Northwind's support assistant — embed each incoming question, and if a near-identical question had been answered recently, return the stored answer instead of calling the model. Model-call volume dropped 35%. Latency on cache hits fell from roughly 900 ms to under 40 ms. The bad week: a customer asked "where's my delivery?" and got a confident, detailed answer — about someone else's order. Two customers had asked semantically identical questions minutes apart; the cache embedded both to nearly the same vector, scored a hit, and served the first customer's answer to the second.

The cache was working exactly as designed. It just had no idea that "where's my delivery?" means something different depending on who is asking. Semantic caching trades a model call for a similarity match, and the entire safety of that trade lives in two places: the threshold you match on, and what you allow into the cache in the first place. This post is both.

What TrueFoundry's AI Gateway Provides Here

Everything in this post — exact-match and semantic caching, the similarity threshold as a per-route knob, per-tenant scoping so Kabir's bug from the cold open never happens, and the hit-rate / cost-saved telemetry that tells you whether the cache is actually paying off — is something TrueFoundry's AI Gateway caching expresses as gateway configuration. A single header on the request turns it on; the gateway hashes the request (or embeds the last message for semantic), compares against a Redis-backed store, and returns the cached response on a hit. On a miss, the request goes to the provider and the new response and embedding are cached for next time.

The correctness story — making sure two semantically similar requests from different users never return each other's answers — is built in via two-level namespacing. Level 1 is automatic: every cache entry is scoped to the calling user or virtual account, so User A's request can never hit User B's entry, full stop. Level 2 is optional: a namespace field in the cache config partitions further (per tenant, per environment, per system-prompt version), which is what the post's namespace argument needs in practice. Together, they're what makes a gateway-level semantic cache safe to share across services without each one re-implementing isolation.

TrueFoundry AI Gateway request flow: where caching sits on the request path
Fig 1: Where the cache sits in the broader request path: the cache lookup happens before the provider call, so a hit short-circuits the model entirely. Source: TrueFoundry — Gateway Plane Architecture

La forma más rápida de crear, gobernar y escalar su IA

Inscríbase
Tabla de contenido

Controle, implemente y rastree la IA en su propia infraestructura

Reserva 30 minutos con nuestro Experto en IA

Reserve una demostración

La forma más rápida de crear, gobernar y escalar su IA

Demo del libro

Descubra más

No se ha encontrado ningún artículo.
June 8, 2026
|
5 minutos de lectura

Semantic Caching for LLMs: Cutting Cost and Latency Beyond Prefix Caching

No se ha encontrado ningún artículo.
June 7, 2026
|
5 minutos de lectura

Prompt Injection Defense at the Gateway: Direct, Indirect, and Tool-Mediated Attacks

No se ha encontrado ningún artículo.
June 7, 2026
|
5 minutos de lectura

Multi-Provider Failover and Load Balancing: Surviving LLM Provider Outages

Liderazgo intelectual
Best MCP Gateway
June 6, 2026
|
5 minutos de lectura

Las 5 mejores pasarelas MCP en 2026

comparación
No se ha encontrado ningún artículo.

Blogs recientes

Black left pointing arrow symbol on white background, directional indicator.
Black left pointing arrow symbol on white background, directional indicator.
Realice un recorrido rápido por el producto
Comience el recorrido por el producto
Visita guiada por el producto