Provider-Agnostic Prompt Caching: How an LLM Gateway Normalizes Anthropic, OpenAI, and Bedrock

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

Every major LLM provider implements prompt caching differently. Here's how the TrueFoundry AI Gateway translates cache directives across providers, handles fallback when a target doesn't support caching, and exposes unified hit metrics — with token savings benchmarks.

The first time we saw a 10,000-token system prompt billed at full price 950 times in a single hour — on a deployment that was supposed to have caching turned on — the on-call engineer's Slack message was four words: "where did it go?" The cache_control directive was set. The prefix was identical byte-for-byte across every request. Anthropic's API was happily caching the traffic that reached it. The catch was that half the requests had been routed to a different provider during a brief failover, and the gateway translating between them had quietly dropped the cache directive on the floor — no error, no warning, just a $30 line item where there should have been a $3 one. The bill recovered. The trust took longer. This post is about everything that has to go right for that not to happen to you.

Prompt caching is the single highest-leverage cost optimization for any LLM workload built on a long, stable prefix — a 10K-token system prompt, a tool registry, a retrieval bundle. The discount is roughly 90% off cached input tokens, large enough to change the unit economics of an agent product.

**Figure 1 —** A canonical request enters the TrueFoundry AI Gateway, which translates it into three native dialects on the way out. Each provider returns a different usage shape; the gateway normalizes back to a single object downstream consumers can rely on.

The catch is that the three major LLM provider APIs — Anthropic native, OpenAI, AWS Bedrock Converse — each model caching differently in ways that aren't cosmetic: directive shape, response shape, TTL options, and even when caching is invoked at all.

The Core Idea

Three native cache APIs, one unified directive at the gateway

Anthropic uses explicit cache_control breakpoints. OpenAI caches automatically on long prefixes. Bedrock Converse uses cachePoint markers. A gateway that doesn't normalize all three forces provider-specific branches into routing, cost attribution, and observability — and silently misprices every request that takes a path the code didn't test.

1 Anthropic Prompt Caching: Explicit Breakpoints and the 5-Minute TTL

Anthropic's prompt caching is the most expressive of the three. You explicitly mark which portion of the request should be cached using the cache_control attribute on a system or message block:

anthropic request · json

{
  "model": "claude-sonnet-4-6",
  "system": [{
    "type": "text",
    "text": "<10,000 token system prompt>",
    "cache_control": {"type": "ephemeral"}
  }],
  "messages": [{"role": "user", "content": "<query>"}]
}

Three rules govern when this actually caches. The prefix must clear a model-specific minimum: 1,024 tokens for Sonnet-class and 2,048 tokens for Opus-class. Below the threshold the directive is silently dropped. The default TTL is 5 minutes, refreshed on each hit. And the prefix must be byte-identical — a stray timestamp before the cached block kills the hit (we return to this in §08).

Pricing has two non-obvious components. Cache reads are billed at 10% of the base input rate — the 90% discount you came for. Cache writes carry a 25% premium over base input pricing. The cache must pay for itself across enough subsequent hits to recover the write surcharge; for a 10K-token prefix on Sonnet 4.6, breakeven is two follow-up hits.

The response surfaces caching in three usage fields: cache_read_input_tokens (served from cache), cache_creation_input_tokens (written this turn), and input_tokens (everything else). Since extended-TTL caching now lets you opt into a 1-hour retention at a 2× write premium, Anthropic also surfaces a per-TTL breakdown under cache_creation.ephemeral_5m_input_tokens and cache_creation.ephemeral_1h_input_tokens. That breakdown matters more than it looks — we'll come back to it in §04.

2 OpenAI Automatic Caching: Implicit Prefixes and the 1-Hour Window

OpenAI's prompt caching takes the opposite philosophy. There is no directive. The cache turns on automatically once the prompt prefix exceeds 1,024 tokens, and OpenAI's infrastructure caches the KV activations server-side. Subsequent requests with the same prefix get the cached state automatically.

The developer-facing surface is two optional parameters that influence — but don't trigger — caching. prompt_cache_key is a string combined with the prefix hash to influence backend routing: same key, same backend, higher hit rate. prompt_cache_retention accepts "in_memory" (default, 5–10 minutes) or "24h" for extended retention. Both are documented in OpenAI's prompt caching guide.

The asymmetry with Anthropic is sharp: you cannot choose which prefix is cached. OpenAI's caching layer picks the longest repeated prefix it observes on a given backend. If your system prompt is followed by a frequently-changing tool registry, the cache may land on the system prompt alone or on a different boundary, with no visibility into which. The response surfaces only one cache field — cached_tokens nested under usage.prompt_tokens_details — and no write-side counter, because OpenAI doesn't bill for cache writes.

The practical implication: OpenAI caching is well-suited to workloads where the stable prefix is one block. Anthropic's explicit breakpoints give you fine-grained control over multi-region prefixes; OpenAI doesn't.

3 AWS Bedrock Prompt Caching: The Converse API Approach

Bedrock's prompt caching sits in the Converse API and uses a cachePoint marker block placed after the content it caches:

bedrock converse · json

{
  "modelId": "anthropic.claude-sonnet-4-6-v1:0",
  "system": [
    {"text": "<long system prompt>"},
    {"cachePoint": {"type": "default", "ttl": "1h"}}
  ],
  "messages": [/* ... */]
}

Three things make this trickier than it looks. First, the marker placement is the opposite of Anthropic's — the directive comes after the content it caches, not on it. Implementations that lift Anthropic's pattern directly silently disable caching. Second, this is the Converse API specifically; Bedrock's native InvokeModel endpoint accepts the Anthropic-style format verbatim for Claude models, but Converse is the path most newer integrations use.

Third, TTL support shifted recently. Before January 26, 2026, Bedrock Converse only supported a 5-minute TTL. On that date AWS shipped 1-hour TTL support for the Claude 4.5+ family and Amazon Nova; the new ttl: "1h" field on the cachePoint opts in. Older models reject the field — no graceful downgrade.

The response uses camelCase: cacheReadInputTokens, cacheWriteInputTokens, and inputTokens. Unlike Anthropic, Bedrock Converse does not expose a per-TTL breakdown on writes. The gateway that translated the request knows which TTL was requested; the response alone doesn't tell you.

Sharp Edge

Bedrock's 1-hour TTL is Claude 4.5+ only. Sending ttl: "1h" to an older model returns an API error, not a downgrade. TrueFoundry's router resolves model fallbacks before the cache translation step, not after, so the TTL it commits to always matches a model that can honor it.

4 Gateway Normalization: What Happens When the Target Doesn't Support Caching

The translation problem the gateway has to solve is: a client sends a single canonical request with Anthropic-style cache_control, and the gateway routes to whichever provider the policy selects. Three strategies are possible — strip the directive when the target doesn't understand it, translate to the target's native cache primitive, or error and refuse to route.

TrueFoundry's gateway picks translate-where-possible, strip-and-log otherwise. Anthropic and Bedrock Converse get full translation — cache_control with a TTL becomes a cachePoint with the equivalent ttl field. OpenAI gets a translated intent: the directive itself is dropped, but prompt_cache_retention is set to "24h" for a 1-hour canonical TTL, and prompt_cache_key is forwarded from the canonical cache_key. When a downgrade is forced — a 1-hour TTL routed to a model that doesn't support extended TTL — the gateway emits a warning span attribute (gen_ai.tfy.cache_ttl_downgrade=5m) so the regression is observable.

One subtler trap on the response side is the mixed-TTL cost accounting bug.

**Figure 2 —** An Anthropic response can contain both 5-minute and 1-hour write tokens in the same call. Collapsing them into a single TTL label for cost calculation produces a 19–29% error on every cache-write turn. The unified usage shape has to preserve the breakdown.

The unified usage object the gateway exposes downstream therefore has five token fields, not four: cache_hit_tokens, cache_miss_tokens, cache_write_5m_tokens, cache_write_1h_tokens, and .

5 Cache Behavior During Fallback: Cold Starts and Cost Implications

The interesting failure mode for any multi-provider gateway is what happens when the primary target is rate-limited or down and the request falls back to a secondary. For caching, caches don't follow traffic across providers. A request whose 10K-token system prompt has been hot in Anthropic's cache for the last fifteen minutes will, on fallback to Azure OpenAI, hit a completely cold prefix cache. The fallback succeeds — but it pays full input-token cost on the prefix, and the trace correctly reports a cache miss.

For a 10K-token prefix on Sonnet 4.6, that's the difference between $0.003 (cached read) and $0.030 (full input) per request — a 10× cost spike during the failover window. Even after the secondary warms up, the warm-up isn't free: every fresh request on a cold target pays the write premium.

The mitigation has two parts. Keep the secondary warm — send 1% of traffic to the fallback provider continuously, not just during incidents. Monitor fallback cache hit rate separately. A useful alert isn't "fallback rate exceeded 5%"; it's "fallback rate exceeded 5% AND fallback cache hit rate below 50%." The first is a routing event; the second is a routing event with a measurable cost penalty attached.

6 Unified Cache Metrics: How the Gateway Surfaces Hit Rate Across Providers

The payoff of the normalization layer is that every trace span carries the same cache attributes regardless of which provider served the request:

otel span attributes · gateway-emitted

gen_ai.provider.name              = anthropic
gen_ai.request.model              = claude-sonnet-4-6
gen_ai.usage.cache_hit_tokens     = 9800
gen_ai.usage.cache_miss_tokens    = 248
gen_ai.usage.cache_write_tokens   = 0
gen_ai.usage.output_tokens        = 503
gen_ai.cache.hit_rate             = 0.975
gen_ai.cache.ttl_tier             = 1h
gen_ai.tfy.virtual_account        = tenant-123

Any dashboard query, cost-attribution job, or router-tuning analysis can read these fields without branching on provider. The cache hit rate computed from them is comparable across Anthropic, OpenAI, and Bedrock traffic, which means one dashboard panel can answer "which provider is giving us the best cache economics on this workload" without three different SQL templates.

Two caveats. The cache-specific attributes follow OTel's gen_ai.* namespace pattern but aren't (yet) part of the official semantic conventions spec. And hit_rate is an operationally useful summary, not a precise efficiency metric; for forensic cost reconciliation, read the underlying token fields directly.

7 Token Savings Benchmark: 10K-Token System Prompt, 1,000 Requests

The workload most agent applications actually run: a stable 10,000-token system prompt, a 200-token user query that changes per request, a 500-token model response, and 1,000 requests across a session window. Assume a realistic 95% cache hit rate — 950 hits, 50 misses from TTL expirations during idle gaps. All numbers in $USD against May-2026 list pricing.

**Figure 3 —** 1,000 requests, identical workload, three pricing paths. Anthropic caching cuts total cost 66%; OpenAI auto-caching cuts it 69%. The bulk of the residual cost is output tokens, which neither provider's prefix cache affects.

Two things stand out. Both providers deliver large, roughly comparable savings — the gap between them is small relative to the gap from no-caching. And with caching enabled, output tokens become the dominant cost line (~58% of the Anthropic total, ~65% of the OpenAI total). That changes which optimizations matter next: once caching is on, the next dollar comes from output-length control, not from a deeper input optimization.

8 Prompt Caching Anti-Patterns That Kill Your Hit Rate

The cache key is exact-match on bytes, before any of the cleverness above kicks in. A handful of common patterns silently destroy hit rates, and they all share the same shape: something volatile gets concatenated before or inside the supposedly stable prefix.

Timestamps inside the system prompt

"Today is {{date}}." produces a new cache key every day. If the date matters, put it in the user message, not the system prompt.

Request IDs / correlation tokens in the prefix

Injecting a UUID for downstream logging defeats caching entirely. Pass these as message metadata, not prompt content.

Per-user system prompts

"You are talking to {{user_name}}." produces N caches for N users. Personalize in the first user-turn message instead.

Randomized tool order

Some agent frameworks shuffle tools to combat positional bias. The shuffle changes the byte sequence and kills the cache.

Model swaps between turns

Caches are model-scoped. Routing turn N to Sonnet and turn N+1 to Haiku means turn N+1 starts cold.

A/B-testing prompt edits

Half-and-half experiments fragment the cache. Run A/B tests at the user level so each cohort builds a coherent cache.

The gateway can detect most of these passively by watching cache hit rate per virtual account. A 10K-token system prompt consistently running below a 40% hit rate is almost always a content-volatility problem, and the trace timeline usually shows which segment is changing. The TrueFoundry dashboard surfaces this as a "low cache hit rate" alert with prefix-diff context, so the fix is usually a five-minute prompt edit.

9 FAQs

Do I have to use Anthropic's cache_control syntax even if I'm only calling OpenAI? −

No — but it costs you nothing and gives you portability. If you write client code against cache_control and later route a subset of traffic through Anthropic or Bedrock, no client changes are needed. If you write it against OpenAI's parameters directly, every future provider addition becomes a client-side refactor.

What's the minimum prefix length that actually caches? −

1,024 tokens for Anthropic Sonnet-class and OpenAI; 2,048 for Anthropic Opus-class; 1,024 for Bedrock Converse. Below the threshold the directive is accepted and silently dropped. If you're not seeing hits on what you thought was a long prefix, count the tokens before assuming the cache is broken.

How does the gateway handle a 1-hour TTL request when the routed model doesn't support it? −

Model resolution happens before cache translation. If the router lands on a model that doesn't support extended TTL, the translator emits the 5-minute default and tags the span with gen_ai.tfy.cache_ttl_downgrade=5m — the downgrade is visible in the trace.

Can I share a cache across providers — Anthropic and Bedrock both serve Claude? −

No. Caches are infrastructure-scoped — even though Anthropic's API and Bedrock's Claude routes use the same model weights, the KV cache lives in different machines on different accounts. Sending the same request to both targets builds two independent caches and pays the write surcharge twice.

How does this compose with semantic caching at the gateway? −

The two layers are complementary. Provider prompt caching reduces input-token cost but still generates a fresh output; semantic caching at the gateway eliminates the model call entirely on a hit. Use semantic cache as the first lookup, provider cache as the fallback. We cover the gateway layer in Semantic Caching: When Text Stops Being the Right Cache Key.

Is there a way to verify cache-attribution math against the actual bill? −

Yes — pull each provider's billing export for a fixed window, sum the gateway's computed costs over the same window grouped by provider and TTL tier, and reconcile. If the gateway's number is more than ~2% off the bill, something in the normalization path is wrong; under 2% is usually rounding.

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now