Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.truefoundry.com/llms.txt

Use this file to discover all available pages before exploring further.

Provider support for response schema

You can use response_format with any provider. The Gateway either uses the provider’s native structured output or converts your schema into a tool the model must call and then puts the result in message.content.
ProviderSupport
OpenAINative for json_object and for json_schema on supported models (e.g. gpt-4o, gpt-5, gpt-4.1, o3, o4). Other models use tool conversion.
Azure OpenAISame as OpenAI.
AnthropicNative for Claude 4.5/4.6 with json_schema. Other models use tool conversion.
Google Gemini, Google VertexNative when the request has no tools; otherwise tool conversion.
All others (Bedrock, Cohere, Mistral, OpenRouter, Groq, xAI, vLLM, etc.)Tool conversion only. The Gateway turns your schema into a required tool and extracts the result into message.content.
Anthropic and JSON schema constraints: The code examples in this doc use Pydantic’s ge=0 for fields such as age. Anthropic’s API does not support these constraint parameters in the schema. If you use structured output with Anthropic models, omit ge, le and similar numeric/string constraints from your schema (or use a schema without them). The code will work with Anthropic once those constraints are removed.

Prompt Caching

Prompt caching reduces processing time and costs by reusing previously computed prefixes. When you send the same or similar prompts repeatedly (e.g. a large system prompt, a shared context document, or a set of tool definitions), the provider can skip reprocessing the cached portion and only compute the new tokens. This leads to lower latency and reduced costs on subsequent requests.
You can send cache_control for any model. The Gateway handles the per provider caching logic internally, forwarding it to providers that support explicit caching and stripping it for providers that cache automatically. This means you can write one request and use it across providers without worrying about compatibility.

How to Enable

Add "cache_control": {"type": "ephemeral"} to any system prompt, user message content block, or tool definition:
response = client.chat.completions.create(
    model="anthropic/claude-opus-4-6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "Your large system prompt here...",
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {"role": "user", "content": "Your question here"},
    ]
)

Provider Support

ProviderCaching typeHow cache_control is handled
Anthropic (direct, Vertex AI, Azure AI Foundry)ExplicitForwarded as is
AWS BedrockExplicitTranslated to native cachePoint format
OpenAI / Azure OpenAIAutomaticStripped by the gateway
Google Gemini / Vertex AIAutomaticStripped by the gateway
Groq, xAI, and othersAutomaticStripped by the gateway
For Anthropic, cache_control is forwarded to the provider as is. You can also include an optional ttl field (e.g. "5m") to control cache duration.For Bedrock, the gateway automatically translates cache_control into Bedrock’s native cachePoint format, so your request code stays the same.Anthropic enforces a minimum content length for caching to take effect:
Minimum tokensModels
4096Claude Mythos Preview, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5
2048Sonnet 4.6, Haiku 3.5, Haiku 3
1024Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7
For Amazon Titan models on Bedrock, cache_control on tool definitions is automatically skipped since these models do not support cache points on tools.
OpenAI and Azure OpenAI cache prompts automatically based on matching prefixes. No cache_control markup is needed — the gateway strips it before forwarding.You can optionally pass prompt_cache_key to group requests that share a common prefix, improving cache hit rates:
response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[{"role": "user", "content": "Your prompt here"}],
    prompt_cache_key="optional-custom-key"
)
prompt_cache_key is only supported for OpenAI and Azure OpenAI.
These providers handle caching automatically on their end. No cache_control markup is needed — the gateway strips it before forwarding.Cached token counts are still reported in the response usage when the provider returns them.

Cache Usage in Responses

When caching is active, the response usage object includes cached token counts:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    },
    "cache_read_input_tokens": 1200,
    "cache_creation_input_tokens": 300
  }
}
  • prompt_tokens_details.cached_tokens: tokens served from cache. Available across all providers that report cache usage.
  • cache_read_input_tokens / cache_creation_input_tokens: Anthropic style fields, present for Anthropic, Bedrock, and Groq when values are non zero.

Reasoning Models

TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers including Anthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex. These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.

Supported Reasoning Models

Supported models: o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="openai-main/o4-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-mini
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="azure-openai-main/o3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219)
via Anthropic, AWS Bedrock, and Google Vertex AI

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)
For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
  • none: 0% of max_tokens
  • low: 30% of max_tokens
  • medium: 60% of max_tokens
  • high: 90% of max_tokens

Using Direct API Calls with Native thinking Parameter

For more precise control with Anthropic models, you can use the native thinking parameter directly:
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)
Supported models: OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="groq-main/deepseek-r1-distill-llama-70b",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

# For grok-3-mini with reasoning_effort parameter
response = client.chat.completions.create(
    model="xai-main/grok-3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "low" (only for grok-3-mini)
    max_tokens=8000
)

print(response.choices[0].message.content)
The reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.
Supported models: All Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini Providers
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)
For Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
  • none: 0% of max_tokens
  • low: 30% of max_tokens
  • medium: 60% of max_tokens
  • high: 90% of max_tokens
Note: Gemini 2.5 Pro and 2.5 Flash comes with reasoning on by default.

Using Direct API Calls with Native thinking Parameter

For more precise control with Gemini models, you can use the native thinking parameter directly:
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="{GATEWAY_BASE_URL}"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)

Response Format

When reasoning tokens are enabled, the response includes both thinking and content sections:
{
  "id": "1742890579083",
  "object": "chat.completion",
  "created": 1742890579,
  "model": "",
  "provider": "aws",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Therefore: 3^3^3 = 7,625,597,484,987",
        "reasoning_content": "Exponentiation is right-associative: compute 3^3 = 27, then 3^27..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 180,
    "total_tokens": 225
  }
}

Streaming with Reasoning Tokens

For streaming responses, the thinking section is always sent before the content section.