Prompt Caching
Prompt caching reduces processing time and costs by reusing previously computed prefixes. When you send the same or similar prompts repeatedly (e.g. alarge system prompt, a shared context document, or a set of tool definitions), the provider can skip reprocessing the cached portion and only compute the new tokens. This leads to lower latency and reduced costs on subsequent requests.
You can send
cache_control for any model. The Gateway handles the per provider caching logic internally, forwarding it to providers that read cache_control and stripping it for providers that cache prefixes on their own. This means you can write one request and use it across providers without worrying about compatibility.How to Enable
Add"cache_control": {"type": "ephemeral"} to any system prompt, user message content block, or tool definition:
Provider Support
There are three caching styles. The first two are marker-based — you sendcache_control and the provider caches based on it (this uses the same explicit vs automatic distinction as the Messages API):
- Explicit — you place
cache_controlon individual blocks (asystemblock, a message content block, or atooldefinition). - Automatic — you place a single
cache_controlfield at the top level of the request and the provider manages the breakpoint for you. Supported on Anthropic, Claude Platform on AWS, and Azure AI Foundry. - Provider-managed — the provider caches repeated prefixes on its own with no markup, so the Gateway strips any
cache_controlyou send.
| Provider | Caching style | How cache_control is handled |
|---|---|---|
| Anthropic (direct), Azure AI Foundry (Claude) | Explicit + automatic | Forwarded to the provider unchanged as native Anthropic cache_control (optional ttl supported). |
| Google Vertex (Claude), Databricks (Claude) | Explicit | Forwarded to the provider unchanged as native Anthropic cache_control (optional ttl supported). |
| AWS Bedrock (Claude) | Explicit | Chat completions run through Bedrock’s Converse API, so each block-level cache_control is translated into a native cachePoint marker. A top-level cache_control is dropped (no automatic caching on Bedrock). |
| OpenAI / Azure OpenAI | Provider-managed | cache_control stripped before forwarding. Optionally pass prompt_cache_key. |
| Google Gemini / Google Vertex (Gemini) | Provider-managed | cache_control stripped before forwarding. |
| Groq, xAI, and others | Provider-managed | cache_control stripped before forwarding. |
The examples above use explicit caching (block-level
cache_control), which works on every Claude provider. For automatic caching (a single top-level cache_control) and its provider availability, see Messages API caching — the same rules apply here.Anthropic (and Bedrock / Vertex / Azure / Databricks Claude)
Anthropic (and Bedrock / Vertex / Azure / Databricks Claude)
For Anthropic (direct), Google Vertex (Claude), Azure AI Foundry (Claude), and Databricks (Claude), Anthropic enforces a minimum content length for caching to take effect; shorter prompts accept the
cache_control is forwarded to the provider unchanged. You can also include an optional ttl field (e.g. "5m" or "1h") to control cache duration.For AWS Bedrock, the gateway translates each cache_control block into Bedrock’s native Converse cachePoint format, so your request code stays the same. Converse cachePoint markers do not carry a duration, so any ttl you set is ignored on Bedrock chat completions.On the Messages API (
/messages), Bedrock Claude models are served through the InvokeModel API with the native Anthropic body instead of Converse. There, cache_control (and ttl) is forwarded unchanged rather than translated to cachePoint.cache_control hint but are not cached:| Minimum tokens | Models |
|---|---|
4096 | Claude Mythos Preview, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5 |
2048 | Sonnet 4.6, Haiku 3.5, Haiku 3 |
1024 | Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7 |
For Amazon Titan and Nova models on Bedrock,
cache_control on tool definitions is automatically skipped since these models do not support cache points on tools.OpenAI / Azure OpenAI
OpenAI / Azure OpenAI
OpenAI and Azure OpenAI are provider-managed — they cache matching prefixes on their own. No
cache_control markup is needed, and the gateway strips it before forwarding.You can optionally pass prompt_cache_key to group requests that share a common prefix, improving cache hit rates:prompt_cache_key is only supported for OpenAI and Azure OpenAI.Gemini / Groq / xAI
Gemini / Groq / xAI
These providers are provider-managed — they handle caching on their own end. No
cache_control markup is needed, and the gateway strips it before forwarding.Cached token counts are still reported in the response usage when the provider returns them.Cache Usage in Responses
When caching is active, the responseusage object includes cached token counts:
prompt_tokens_details.cached_tokens: tokens served from cache. Available across all providers that report cache usage.cache_read_input_tokens/cache_creation_input_tokens: Anthropic style fields, present for Anthropic, Bedrock, and Groq when values are non zero.
Reasoning Models
TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers includingAnthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex.
These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.
Supported Reasoning Models
OpenAI
OpenAI
Supported models:
o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5Azure OpenAI
Azure OpenAI
Supported models:
gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-miniAnthropic
Anthropic
Supported models:
viaUsing Direct API Calls with Native
For more precise control with Anthropic models, you can use the native
Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219) via
Anthropic, AWS Bedrock, and Google Vertex AIUsing OpenAI SDK
For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the
reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:none: 0% of max_tokenslow: 30% of max_tokensmedium: 60% of max_tokenshigh: 90% of max_tokens
Using Direct API Calls with Native thinking Parameter
For more precise control with Anthropic models, you can use the native thinking parameter directly:Groq
Groq
Supported models:
OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)xAI
xAI
Supported models:
grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.The
reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.Gemini
Gemini
Supported models: All Using Direct API Calls with Native
For more precise control with Gemini models, you can use the native
Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini ProvidersFor Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the
reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:none: 0% of max_tokenslow: 30% of max_tokensmedium: 60% of max_tokenshigh: 90% of max_tokens
Using Direct API Calls with Native thinking Parameter
For more precise control with Gemini models, you can use the native thinking parameter directly: