Semantic Caching for LLMs: Beyond Prefix Caching

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Une méthode incroyablement rapide pour créer, suivre et déployer vos modèles !

Gère plus de 350 RPS sur un seul processeur virtuel, aucun réglage n'est nécessaire
Prêt pour la production avec un support complet pour les entreprises

Commencez à utiliser Truefoundry dès maintenant Parlez à l'expert

Prefix caching reuses identical prompts. Semantic caching reuses similar ones — embed the incoming request, and if a near-identical question was answered recently, serve the stored answer instead of calling the model. It's one of the highest-leverage cost and latency levers a gateway can pull, and one where "it works in the demo" and "it's safe in production" are very different claims. This post is how it works, the single knob that governs it, the cases where it quietly serves the wrong answer, and where the cache should live.

Key Takeaways

There are three layers of LLM caching, in increasing reach and risk: provider prefix caching (exact prefix), exact-match response caching (hash the whole request), and semantic caching (embed the request, serve a similar prior response). This post is the third.
Semantic caching embeds the request, runs a vector similarity search, and serves a cached response when similarity clears a threshold — turning a several-hundred-millisecond model call into a tens-of-milliseconds lookup on a hit.
The similarity threshold is the whole game: set it too low and you serve wrong answers (false hits); too high and the hit rate collapses. It's a precision/recall tradeoff to tune per route, not a global constant.
Embedding-close is not meaning-equal. "What is the capital of France?" and "What is the capital of Germany?" sit near each other in embedding space. Conservative thresholds, entity guards, and per-namespace caches mitigate this; nothing eliminates it.
Never semantically cache personalized, time-sensitive, stateful, or high-stakes responses. A mis-scoped cache can serve one user's data to another — a privacy failure, not just a wrong answer.
Cache keys must carry scope (tenant/user) and version (model, system-prompt version, tools), and entries must be invalidated on TTL and whenever the prompt or tools change.
The cache belongs at the gateway: shared across services, scoped per tenant, with hit-rate, cost-saved, and latency-saved observability. TrueFoundry's AI Gateway is the natural place to measure cacheable traffic and attribute the savings.

Kabir, a backend engineer, had a good week and then a bad one. The good week: he'd put a semantic cache in front of Northwind's support assistant — embed each incoming question, and if a near-identical question had been answered recently, return the stored answer instead of calling the model. Model-call volume dropped 35%. Latency on cache hits fell from roughly 900 ms to under 40 ms. The bad week: a customer asked "where's my delivery?" and got a confident, detailed answer — about someone else's order. Two customers had asked semantically identical questions minutes apart; the cache embedded both to nearly the same vector, scored a hit, and served the first customer's answer to the second.

The cache was working exactly as designed. It just had no idea that "where's my delivery?" means something different depending on who is asking. Semantic caching trades a model call for a similarity match, and the entire safety of that trade lives in two places: the threshold you match on, and what you allow into the cache in the first place. This post is both.

What TrueFoundry's AI Gateway Provides Here

Everything in this post — exact-match and semantic caching, the similarity threshold as a per-route knob, per-tenant scoping so Kabir's bug from the cold open never happens, and the hit-rate / cost-saved telemetry that tells you whether the cache is actually paying off — is something TrueFoundry's AI Gateway caching expresses as gateway configuration. A single header on the request turns it on; the gateway hashes the request (or embeds the last message for semantic), compares against a Redis-backed store, and returns the cached response on a hit. On a miss, the request goes to the provider and the new response and embedding are cached for next time.

The correctness story — making sure two semantically similar requests from different users never return each other's answers — is built in via two-level namespacing. Level 1 is automatic: every cache entry is scoped to the calling user or virtual account, so User A's request can never hit User B's entry, full stop. Level 2 is optional: a namespace field in the cache config partitions further (per tenant, per environment, per system-prompt version), which is what the post's namespace argument needs in practice. Together, they're what makes a gateway-level semantic cache safe to share across services without each one re-implementing isolation.

TrueFoundry AI Gateway request flow: where caching sits on the request path — Fig 1: *Where the cache sits in the broader request path: the cache lookup happens before the provider call, so a hit short-circuits the model entirely. Source:* *TrueFoundry — Gateway Plane Architecture*

‍

‍

TrueFoundry AI Gateway offre une latence d'environ 3 à 4 ms, gère plus de 350 RPS sur 1 processeur virtuel, évolue horizontalement facilement et est prête pour la production, tandis que LiteLM souffre d'une latence élevée, peine à dépasser un RPS modéré, ne dispose pas d'une mise à l'échelle intégrée et convient parfaitement aux charges de travail légères ou aux prototypes.

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Planifiez votre démo dès maintenant

Le moyen le plus rapide de créer, de gérer et de faire évoluer votre IA

INSCRIVEZ-VOUS

Comment pouvez-vous empêcher les coûts de GenAI de grimper en flèche à grande échelle ?

Gartner report on best practices for optimizing generative and agentic AI costs and projected statistics.

Accédez au rapport complet de 2026

Gartner Hype Cycle for Platform Engineering 2026

Access Full 2026 Report

One Layer of Control for All AI

Route and govern model and tool traffic with a centralized AI Gateway

Book Demo

Table des matières

Lien textuel

Gouvernez, déployez et suivez l'IA dans votre propre infrastructure

Réservez un séjour de 30 minutes avec notre Expert en IA

Réservez une démo

Summarize with

Blurry red snowflake on white background, symmetrical frosty design with soft edges and abstract shape.

Semantic Caching for LLMs: Cutting Cost and Latency Beyond Prefix Caching

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

What TrueFoundry's AI Gateway Provides Here

Le moyen le plus rapide de créer, de gérer et de faire évoluer votre IA

One Layer of Control for All AI

Gouvernez, déployez et suivez l'IA dans votre propre infrastructure

Le moyen le plus rapide de créer, de gérer et de faire évoluer votre IA

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

TrueFoundry vs MintMCP: MCP Gateway Comparison

TrueFoundry + Seldon: Unified Control Plane for Enterprise AI

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Blogs récents

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Agent Economics, No. 1: What Is the Agent Economy — and Who Gets to Design It?

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

Best MCP Gateway for Production AI Systems in 2026

Best AI Gateways for LLM Inference Optimization in 2026

TrueFoundry vs MintMCP: MCP Gateway Comparison

Graph Engineering for Multi-Agent Systems: Architecture, Governance, and Observability

Designing for Model Deprecations with Virtual Models and Staged Cutovers

Unified AI Gateway as Enterprise's New Foundational Primitive

The Path to the Championship: Enterprise AI's Knockout Rounds Run Through the Gateway

AI Safety vs AI Security: What the Difference Means for Enterprise Teams

What Is Responsible AI? Principles, Practice, and What It Means for Enterprise Teams

AI Audit Checklist 2026: What to Review, When, and Why It Matters

BCG Says Strategy Matters More Than Tools — Part 2: From Agent Adoption to Governed Tools and Runtimes

BCG Says Strategy Matters More Than Tools — Part 1: From Strategic Clarity to Gateway Controls

Resources

Why TrueFoundry?

Semantic Caching for LLMs: Cutting Cost and Latency Beyond Prefix Caching

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

What TrueFoundry's AI Gateway Provides Here

Le moyen le plus rapide de créer, de gérer et de faire évoluer votre IA

One Layer of Control for All AI

Gouvernez, déployez et suivez l'IA dans votre propre infrastructure

Le moyen le plus rapide de créer, de gérer et de faire évoluer votre IA

Découvrez-en plus

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

TrueFoundry vs MintMCP: MCP Gateway Comparison

TrueFoundry + Seldon: Unified Control Plane for Enterprise AI

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Blogs récents

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Agent Economics, No. 1: What Is the Agent Economy — and Who Gets to Design It?

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

Best MCP Gateway for Production AI Systems in 2026

Best AI Gateways for LLM Inference Optimization in 2026

TrueFoundry vs MintMCP: MCP Gateway Comparison

Graph Engineering for Multi-Agent Systems: Architecture, Governance, and Observability

Designing for Model Deprecations with Virtual Models and Staged Cutovers

Unified AI Gateway as Enterprise's New Foundational Primitive

The Path to the Championship: Enterprise AI's Knockout Rounds Run Through the Gateway

AI Safety vs AI Security: What the Difference Means for Enterprise Teams

What Is Responsible AI? Principles, Practice, and What It Means for Enterprise Teams

AI Audit Checklist 2026: What to Review, When, and Why It Matters

BCG Says Strategy Matters More Than Tools — Part 2: From Agent Adoption to Governed Tools and Runtimes

BCG Says Strategy Matters More Than Tools — Part 1: From Strategic Clarity to Gateway Controls

Resources

Why TrueFoundry?

Abonnez-vous à notre newsletter