Documentation Index
Fetch the complete documentation index at: https://www.truefoundry.com/llms.txt
Use this file to discover all available pages before exploring further.
Provider support for response schema
You can useresponse_format with any provider. The Gateway either uses the provider’s native structured output or converts your schema into a tool the model must call and then puts the result in message.content.
| Provider | Support |
|---|---|
| OpenAI | Native for json_object and for json_schema on supported models (e.g. gpt-4o, gpt-5, gpt-4.1, o3, o4). Other models use tool conversion. |
| Azure OpenAI | Same as OpenAI. |
| Anthropic | Native for Claude 4.5/4.6 with json_schema. Other models use tool conversion. |
| Google Gemini, Google Vertex | Native when the request has no tools; otherwise tool conversion. |
| All others (Bedrock, Cohere, Mistral, OpenRouter, Groq, xAI, vLLM, etc.) | Tool conversion only. The Gateway turns your schema into a required tool and extracts the result into message.content. |
Anthropic and JSON schema constraints: The code examples in this doc use Pydantic’s
ge=0 for fields such as age. Anthropic’s API does not support these constraint parameters in the schema. If you use structured output with Anthropic models, omit ge, le and similar numeric/string constraints from your schema (or use a schema without them). The code will work with Anthropic once those constraints are removed.Prompt Caching
Prompt caching reduces processing time and costs by reusing previously computed prefixes. When you send the same or similar prompts repeatedly (e.g. alarge system prompt, a shared context document, or a set of tool definitions), the provider can skip reprocessing the cached portion and only compute the new tokens. This leads to lower latency and reduced costs on subsequent requests.
You can send
cache_control for any model. The Gateway handles the per provider caching logic internally, forwarding it to providers that support explicit caching and stripping it for providers that cache automatically. This means you can write one request and use it across providers without worrying about compatibility.How to Enable
Add"cache_control": {"type": "ephemeral"} to any system prompt, user message content block, or tool definition:
Provider Support
| Provider | Caching type | How cache_control is handled |
|---|---|---|
| Anthropic (direct, Vertex AI, Azure AI Foundry) | Explicit | Forwarded as is |
| AWS Bedrock | Explicit | Translated to native cachePoint format |
| OpenAI / Azure OpenAI | Automatic | Stripped by the gateway |
| Google Gemini / Vertex AI | Automatic | Stripped by the gateway |
| Groq, xAI, and others | Automatic | Stripped by the gateway |
Anthropic / Bedrock
Anthropic / Bedrock
For Anthropic,
cache_control is forwarded to the provider as is. You can also include an optional ttl field (e.g. "5m") to control cache duration.For Bedrock, the gateway automatically translates cache_control into Bedrock’s native cachePoint format, so your request code stays the same.Anthropic enforces a minimum content length for caching to take effect:| Minimum tokens | Models |
|---|---|
4096 | Claude Mythos Preview, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5 |
2048 | Sonnet 4.6, Haiku 3.5, Haiku 3 |
1024 | Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7 |
For Amazon Titan models on Bedrock,
cache_control on tool definitions is automatically skipped since these models do not support cache points on tools.OpenAI / Azure OpenAI
OpenAI / Azure OpenAI
OpenAI and Azure OpenAI cache prompts automatically based on matching prefixes. No
cache_control markup is needed  the gateway strips it before forwarding.You can optionally pass prompt_cache_key to group requests that share a common prefix, improving cache hit rates:prompt_cache_key is only supported for OpenAI and Azure OpenAI.Gemini / Groq / xAI
Gemini / Groq / xAI
These providers handle caching automatically on their end. No
cache_control markup is needed  the gateway strips it before forwarding.Cached token counts are still reported in the response usage when the provider returns them.Cache Usage in Responses
When caching is active, the responseusage object includes cached token counts:
prompt_tokens_details.cached_tokens: tokens served from cache. Available across all providers that report cache usage.cache_read_input_tokens/cache_creation_input_tokens: Anthropic style fields, present for Anthropic, Bedrock, and Groq when values are non zero.
Reasoning Models
TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers includingAnthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex.
These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.
Supported Reasoning Models
OpenAI
OpenAI
Supported models:
o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5Azure OpenAI
Azure OpenAI
Supported models:
gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-miniAnthropic
Anthropic
Supported models:
viaUsing Direct API Calls with Native
For more precise control with Anthropic models, you can use the native
Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219) via
Anthropic, AWS Bedrock, and Google Vertex AIUsing OpenAI SDK
For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the
reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:none: 0% of max_tokenslow: 30% of max_tokensmedium: 60% of max_tokenshigh: 90% of max_tokens
Using Direct API Calls with Native thinking Parameter
For more precise control with Anthropic models, you can use the native thinking parameter directly:Groq
Groq
Supported models:
OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)xAI
xAI
Supported models:
grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.The
reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.Gemini
Gemini
Supported models: All Using Direct API Calls with Native
For more precise control with Gemini models, you can use the native
Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini ProvidersFor Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the
reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:none: 0% of max_tokenslow: 30% of max_tokensmedium: 60% of max_tokenshigh: 90% of max_tokens
Using Direct API Calls with Native thinking Parameter
For more precise control with Gemini models, you can use the native thinking parameter directly: