Semantic Caching for LLMs: Beyond Prefix Caching

Auf Geschwindigkeit ausgelegt: ~ 10 ms Latenz, auch unter Last

Unglaublich schnelle Methode zum Erstellen, Verfolgen und Bereitstellen Ihrer Modelle!

Verarbeitet mehr als 350 RPS auf nur 1 vCPU — kein Tuning erforderlich
Produktionsbereit mit vollem Unternehmenssupport

Beginnen Sie jetzt mit Truefoundry Sprechen Sie mit dem Experten

Prefix caching reuses identical prompts. Semantic caching reuses similar ones — embed the incoming request, and if a near-identical question was answered recently, serve the stored answer instead of calling the model. It's one of the highest-leverage cost and latency levers a gateway can pull, and one where "it works in the demo" and "it's safe in production" are very different claims. This post is how it works, the single knob that governs it, the cases where it quietly serves the wrong answer, and where the cache should live.

Key Takeaways

There are three layers of LLM caching, in increasing reach and risk: provider prefix caching (exact prefix), exact-match response caching (hash the whole request), and semantic caching (embed the request, serve a similar prior response). This post is the third.
Semantic caching embeds the request, runs a vector similarity search, and serves a cached response when similarity clears a threshold — turning a several-hundred-millisecond model call into a tens-of-milliseconds lookup on a hit.
The similarity threshold is the whole game: set it too low and you serve wrong answers (false hits); too high and the hit rate collapses. It's a precision/recall tradeoff to tune per route, not a global constant.
Embedding-close is not meaning-equal. "What is the capital of France?" and "What is the capital of Germany?" sit near each other in embedding space. Conservative thresholds, entity guards, and per-namespace caches mitigate this; nothing eliminates it.
Never semantically cache personalized, time-sensitive, stateful, or high-stakes responses. A mis-scoped cache can serve one user's data to another — a privacy failure, not just a wrong answer.
Cache keys must carry scope (tenant/user) and version (model, system-prompt version, tools), and entries must be invalidated on TTL and whenever the prompt or tools change.
The cache belongs at the gateway: shared across services, scoped per tenant, with hit-rate, cost-saved, and latency-saved observability. TrueFoundry's AI Gateway is the natural place to measure cacheable traffic and attribute the savings.

Kabir, a backend engineer, had a good week and then a bad one. The good week: he'd put a semantic cache in front of Northwind's support assistant — embed each incoming question, and if a near-identical question had been answered recently, return the stored answer instead of calling the model. Model-call volume dropped 35%. Latency on cache hits fell from roughly 900 ms to under 40 ms. The bad week: a customer asked "where's my delivery?" and got a confident, detailed answer — about someone else's order. Two customers had asked semantically identical questions minutes apart; the cache embedded both to nearly the same vector, scored a hit, and served the first customer's answer to the second.

The cache was working exactly as designed. It just had no idea that "where's my delivery?" means something different depending on who is asking. Semantic caching trades a model call for a similarity match, and the entire safety of that trade lives in two places: the threshold you match on, and what you allow into the cache in the first place. This post is both.

What TrueFoundry's AI Gateway Provides Here

Everything in this post — exact-match and semantic caching, the similarity threshold as a per-route knob, per-tenant scoping so Kabir's bug from the cold open never happens, and the hit-rate / cost-saved telemetry that tells you whether the cache is actually paying off — is something TrueFoundry's AI Gateway caching expresses as gateway configuration. A single header on the request turns it on; the gateway hashes the request (or embeds the last message for semantic), compares against a Redis-backed store, and returns the cached response on a hit. On a miss, the request goes to the provider and the new response and embedding are cached for next time.

The correctness story — making sure two semantically similar requests from different users never return each other's answers — is built in via two-level namespacing. Level 1 is automatic: every cache entry is scoped to the calling user or virtual account, so User A's request can never hit User B's entry, full stop. Level 2 is optional: a namespace field in the cache config partitions further (per tenant, per environment, per system-prompt version), which is what the post's namespace argument needs in practice. Together, they're what makes a gateway-level semantic cache safe to share across services without each one re-implementing isolation.

TrueFoundry AI Gateway request flow: where caching sits on the request path — Fig 1: *Where the cache sits in the broader request path: the cache lookup happens before the provider call, so a hit short-circuits the model entirely. Source:* *TrueFoundry — Gateway Plane Architecture*

‍

‍

TrueFoundry AI Gateway bietet eine Latenz von ~3—4 ms, verarbeitet mehr als 350 RPS auf einer vCPU, skaliert problemlos horizontal und ist produktionsbereit, während LiteLM unter einer hohen Latenz leidet, mit moderaten RPS zu kämpfen hat, keine integrierte Skalierung hat und sich am besten für leichte Workloads oder Prototyp-Workloads eignet.

Auf Geschwindigkeit ausgelegt: ~ 10 ms Latenz, auch unter Last

Vereinbaren Sie jetzt Ihre Demo