Chat Completions: Caching & Reasoning

Provider support for response schema

You can use response_format with any provider. The Gateway either uses the providerÃ¢â‚¬â„¢s native structured output or converts your schema into a tool the model must call and then puts the result in message.content.

Provider	Support
OpenAI	Native for `json_object` and for `json_schema` on supported models (e.g. gpt-4o, gpt-5, gpt-4.1, o3, o4). Other models use tool conversion.
Azure OpenAI	Same as OpenAI.
Anthropic	Native for Claude 4.5/4.6 with `json_schema`. Other models use tool conversion.
Google Gemini, Google Vertex	Native when the request has no tools; otherwise tool conversion.
All others (Bedrock, Cohere, Mistral, OpenRouter, Groq, xAI, vLLM, etc.)	Tool conversion only. The Gateway turns your schema into a required tool and extracts the result into `message.content`.

Anthropic and JSON schema constraints: The code examples in this doc use Pydantic’s ge=0 for fields such as age. Anthropic’s API does not support these constraint parameters in the schema. If you use structured output with Anthropic models, omit ge, le and similar numeric/string constraints from your schema (or use a schema without them). The code will work with Anthropic once those constraints are removed.

Prompt Caching

Prompt caching reduces processing time and costs by reusing previously computed prefixes. When you send the same or similar prompts repeatedly (e.g. a large system prompt, a shared context document, or a set of tool definitions), the provider can skip reprocessing the cached portion and only compute the new tokens. This leads to lower latency and reduced costs on subsequent requests.

You can send cache_control for any model. The Gateway handles the per provider caching logic internally, forwarding it to providers that support explicit caching and stripping it for providers that cache automatically. This means you can write one request and use it across providers without worrying about compatibility.

How to Enable

Add "cache_control": {"type": "ephemeral"} to any system prompt, user message content block, or tool definition:

response = client.chat.completions.create(
    model="anthropic/claude-opus-4-6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "Your large system prompt here...",
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {"role": "user", "content": "Your question here"},
    ]
)

Provider Support

Provider	Caching type	How `cache_control` is handled
Anthropic (direct, Vertex AI, Azure AI Foundry)	Explicit	Forwarded as is
AWS Bedrock	Explicit	Translated to native `cachePoint` format
OpenAI / Azure OpenAI	Automatic	Stripped by the gateway
Google Gemini / Vertex AI	Automatic	Stripped by the gateway
Groq, xAI, and others	Automatic	Stripped by the gateway

Anthropic / Bedrock

For Anthropic, cache_control is forwarded to the provider as is. You can also include an optional ttl field (e.g. "5m") to control cache duration.For Bedrock, the gateway automatically translates cache_control into Bedrock’s native cachePoint format, so your request code stays the same.Anthropic enforces a minimum content length for caching to take effect:

Minimum tokens	Models
`4096`	Claude Mythos Preview, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5
`2048`	Sonnet 4.6, Haiku 3.5, Haiku 3
`1024`	Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7

For Amazon Titan models on Bedrock, cache_control on tool definitions is automatically skipped since these models do not support cache points on tools.

OpenAI / Azure OpenAI

OpenAI and Azure OpenAI cache prompts automatically based on matching prefixes. No cache_control markup is needed Ã¢â‚¬â€ the gateway strips it before forwarding.You can optionally pass prompt_cache_key to group requests that share a common prefix, improving cache hit rates:

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[{"role": "user", "content": "Your prompt here"}],
    prompt_cache_key="optional-custom-key"
)

prompt_cache_key is only supported for OpenAI and Azure OpenAI.

Gemini / Groq / xAI

These providers handle caching automatically on their end. No cache_control markup is needed Ã¢â‚¬â€ the gateway strips it before forwarding.Cached token counts are still reported in the response usage when the provider returns them.

Cache Usage in Responses

When caching is active, the response usage object includes cached token counts:

{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    },
    "cache_read_input_tokens": 1200,
    "cache_creation_input_tokens": 300
  }
}

prompt_tokens_details.cached_tokens: tokens served from cache. Available across all providers that report cache usage.
cache_read_input_tokens / cache_creation_input_tokens: Anthropic style fields, present for Anthropic, Bedrock, and Groq when values are non zero.

Reasoning Models

TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers including Anthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex. These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.

Supported Reasoning Models

OpenAI

Supported models: o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="openai-main/o4-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)

Azure OpenAI

Supported models: gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-mini

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="azure-openai-main/o3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)

Anthropic

Supported models: Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219)
via Anthropic, AWS Bedrock, and Google Vertex AI

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)

For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:

none: 0% of max_tokens
low: 30% of max_tokens
medium: 60% of max_tokens
high: 90% of max_tokens

Using Direct API Calls with Native `thinking` Parameter

For more precise control with Anthropic models, you can use the native thinking parameter directly:

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)

Groq

Supported models: OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="groq-main/deepseek-r1-distill-llama-70b",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)

xAI

Supported models: grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

# For grok-3-mini with reasoning_effort parameter
response = client.chat.completions.create(
    model="xai-main/grok-3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "low" (only for grok-3-mini)
    max_tokens=8000
)

print(response.choices[0].message.content)

The reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.

Gemini

Supported models: All Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini Providers

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)

For Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:

none: 0% of max_tokens
low: 30% of max_tokens
medium: 60% of max_tokens
high: 90% of max_tokens

Note: Gemini 2.5 Pro and 2.5 Flash comes with reasoning on by default.

Using Direct API Calls with Native `thinking` Parameter

For more precise control with Gemini models, you can use the native thinking parameter directly:

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)

Response Format

When reasoning tokens are enabled, the response includes both thinking and content sections:

{
  "id": "1742890579083",
  "object": "chat.completion",
  "created": 1742890579,
  "model": "",
  "provider": "aws",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Therefore: 3^3^3 = 7,625,597,484,987",
        "reasoning_content": "Exponentiation is right-associative: compute 3^3 = 27, then 3^27..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 180,
    "total_tokens": 225
  }
}

Streaming with Reasoning Tokens

For streaming responses, the thinking section is always sent before the content section.

Get Started

LLM Gateway

MCP Registry and Gateway

Agent Registry

Skills Registry

Guardrails and Security

Prompt Management

Observability

Deployment

Admin Guide

Chat

Agent

Messages

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Fine-tuning

Moderations

Models

Chat Completions: Caching & Reasoning

Provider support for response schema

Prompt Caching

How to Enable

Provider Support

Cache Usage in Responses

Reasoning Models

Supported Reasoning Models

Using OpenAI SDK

Using Direct API Calls with Native `thinking` Parameter

Using Direct API Calls with Native `thinking` Parameter

Response Format

Streaming with Reasoning Tokens

Get Started

LLM Gateway

MCP Registry and Gateway

Agent Registry

Skills Registry

Guardrails and Security

Prompt Management

Observability

Deployment

Admin Guide

Chat

Agent

Messages

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Fine-tuning

Moderations

Models

Documentation Index

​Provider support for response schema

​Prompt Caching

​How to Enable

​Provider Support

​Cache Usage in Responses

​Reasoning Models

​Supported Reasoning Models

​Using OpenAI SDK

​Using Direct API Calls with Native thinking Parameter

​Using Direct API Calls with Native thinking Parameter

​Response Format

​Streaming with Reasoning Tokens

Provider support for response schema

Prompt Caching

How to Enable

Provider Support

Cache Usage in Responses

Reasoning Models

Supported Reasoning Models

Using OpenAI SDK

Using Direct API Calls with Native `thinking` Parameter

Using Direct API Calls with Native `thinking` Parameter

Response Format

Streaming with Reasoning Tokens