Skip to main content

Prompt Caching

Prompt caching reduces processing time and costs by reusing previously computed prefixes. When you send the same or similar prompts repeatedly (e.g. a large system prompt, a shared context document, or a set of tool definitions), the provider can skip reprocessing the cached portion and only compute the new tokens. This leads to lower latency and reduced costs on subsequent requests.
You can send cache_control for any model. The Gateway handles the per provider caching logic internally, forwarding it to providers that read cache_control and stripping it for providers that cache prefixes on their own. This means you can write one request and use it across providers without worrying about compatibility.

How to Enable

Add "cache_control": {"type": "ephemeral"} to any system prompt, user message content block, or tool definition:
response = client.chat.completions.create(
    model="anthropic/claude-opus-4-6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "Your large system prompt here...",
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {"role": "user", "content": "Your question here"},
    ]
)

Provider Support

There are three caching styles. The first two are marker-based — you send cache_control and the provider caches based on it (this uses the same explicit vs automatic distinction as the Messages API):
  • Explicit — you place cache_control on individual blocks (a system block, a message content block, or a tool definition).
  • Automatic — you place a single cache_control field at the top level of the request and the provider manages the breakpoint for you. Supported on Anthropic, Claude Platform on AWS, and Azure AI Foundry.
  • Provider-managed — the provider caches repeated prefixes on its own with no markup, so the Gateway strips any cache_control you send.
ProviderCaching styleHow cache_control is handled
Anthropic (direct), Azure AI Foundry (Claude)Explicit + automaticForwarded to the provider unchanged as native Anthropic cache_control (optional ttl supported).
Google Vertex (Claude), Databricks (Claude)ExplicitForwarded to the provider unchanged as native Anthropic cache_control (optional ttl supported).
AWS Bedrock (Claude)ExplicitChat completions run through Bedrock’s Converse API, so each block-level cache_control is translated into a native cachePoint marker. A top-level cache_control is dropped (no automatic caching on Bedrock).
OpenAI / Azure OpenAIProvider-managedcache_control stripped before forwarding. Optionally pass prompt_cache_key.
Google Gemini / Google Vertex (Gemini)Provider-managedcache_control stripped before forwarding.
Groq, xAI, and othersProvider-managedcache_control stripped before forwarding.
The examples above use explicit caching (block-level cache_control), which works on every Claude provider. For automatic caching (a single top-level cache_control) and its provider availability, see Messages API caching — the same rules apply here.
For Anthropic (direct), Google Vertex (Claude), Azure AI Foundry (Claude), and Databricks (Claude), cache_control is forwarded to the provider unchanged. You can also include an optional ttl field (e.g. "5m" or "1h") to control cache duration.For AWS Bedrock, the gateway translates each cache_control block into Bedrock’s native Converse cachePoint format, so your request code stays the same. Converse cachePoint markers do not carry a duration, so any ttl you set is ignored on Bedrock chat completions.
On the Messages API (/messages), Bedrock Claude models are served through the InvokeModel API with the native Anthropic body instead of Converse. There, cache_control (and ttl) is forwarded unchanged rather than translated to cachePoint.
Anthropic enforces a minimum content length for caching to take effect; shorter prompts accept the cache_control hint but are not cached:
Minimum tokensModels
4096Claude Mythos Preview, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5
2048Sonnet 4.6, Haiku 3.5, Haiku 3
1024Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7
For Amazon Titan and Nova models on Bedrock, cache_control on tool definitions is automatically skipped since these models do not support cache points on tools.
OpenAI and Azure OpenAI are provider-managed — they cache matching prefixes on their own. No cache_control markup is needed, and the gateway strips it before forwarding.You can optionally pass prompt_cache_key to group requests that share a common prefix, improving cache hit rates:
response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[{"role": "user", "content": "Your prompt here"}],
    prompt_cache_key="optional-custom-key"
)
prompt_cache_key is only supported for OpenAI and Azure OpenAI.
These providers are provider-managed — they handle caching on their own end. No cache_control markup is needed, and the gateway strips it before forwarding.Cached token counts are still reported in the response usage when the provider returns them.

Cache Usage in Responses

When caching is active, the response usage object includes cached token counts:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    },
    "cache_read_input_tokens": 1200,
    "cache_creation_input_tokens": 300
  }
}
  • prompt_tokens_details.cached_tokens: tokens served from cache. Available across all providers that report cache usage.
  • cache_read_input_tokens / cache_creation_input_tokens: Anthropic style fields, present for Anthropic, Bedrock, and Groq when values are non zero.

Reasoning Models

TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers including Anthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex. These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.

Supported Reasoning Models

Supported models: o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="openai-main/o4-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-mini
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="azure-openai-main/o3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219)
via Anthropic, AWS Bedrock, and Google Vertex AI

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)
For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
  • none: 0% of max_tokens
  • low: 30% of max_tokens
  • medium: 60% of max_tokens
  • high: 90% of max_tokens

Using Direct API Calls with Native thinking Parameter

For more precise control with Anthropic models, you can use the native thinking parameter directly:
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)
Supported models: OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="groq-main/deepseek-r1-distill-llama-70b",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

# For grok-3-mini with reasoning_effort parameter
response = client.chat.completions.create(
    model="xai-main/grok-3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "low" (only for grok-3-mini)
    max_tokens=8000
)

print(response.choices[0].message.content)
The reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.
Supported models: All Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini Providers
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)
For Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
  • none: 0% of max_tokens
  • low: 30% of max_tokens
  • medium: 60% of max_tokens
  • high: 90% of max_tokens
Note: Gemini 2.5 Pro and 2.5 Flash comes with reasoning on by default.

Using Direct API Calls with Native thinking Parameter

For more precise control with Gemini models, you can use the native thinking parameter directly:
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)

Response Format

When reasoning tokens are enabled, the response includes both thinking and content sections:
{
  "id": "1742890579083",
  "object": "chat.completion",
  "created": 1742890579,
  "model": "",
  "provider": "aws",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Therefore: 3^3^3 = 7,625,597,484,987",
        "reasoning_content": "Exponentiation is right-associative: compute 3^3 = 27, then 3^27..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 180,
    "total_tokens": 225
  }
}

Streaming with Reasoning Tokens

For streaming responses, the thinking section is always sent before the content section.