> ## Documentation Index
> Fetch the complete documentation index at: https://www.truefoundry.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Chat Completions: Caching & Reasoning

> Prompt caching and reasoning models for the Chat Completions API

## Prompt Caching

Prompt caching reduces processing time and costs by reusing previously computed prefixes. When you send the same or similar prompts repeatedly (e.g. a `large system prompt`, a `shared context document`, or a `set of tool definitions`), the provider can skip reprocessing the cached portion and only compute the new tokens. This leads to lower latency and reduced costs on subsequent requests.

<Info>
  You can send `cache_control` for any model. The Gateway handles the per provider caching logic internally, forwarding it to providers that read `cache_control` and stripping it for providers that cache prefixes on their own. This means you can write one request and use it across providers without worrying about compatibility.
</Info>

### How to Enable

Add `"cache_control": {"type": "ephemeral"}` to any `system prompt`, `user message content block`, or `tool definition`:

<CodeGroup>
  ```python System Prompt theme={"dark"}
  response = client.chat.completions.create(
      model="anthropic/claude-opus-4-6",
      messages=[
          {
              "role": "system",
              "content": [
                  {
                      "type": "text",
                      "text": "Your large system prompt here...",
                      "cache_control": {"type": "ephemeral"},
                  },
              ],
          },
          {"role": "user", "content": "Your question here"},
      ]
  )
  ```

  ```python User Message theme={"dark"}
  response = client.chat.completions.create(
      model="anthropic/claude-opus-4-6",
      messages=[
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                      "text": "Your large context or reference text here...",
                      "cache_control": {"type": "ephemeral"},
                  },
                  {
                      "type": "text",
                      "text": "Now answer my question about the above.",
                  },
              ],
          },
      ]
  )
  ```

  ```python Tool Definitions theme={"dark"}
  response = client.chat.completions.create(
      model="anthropic/claude-opus-4-6",
      messages=[{"role": "user", "content": "What's the weather?"}],
      tools=[
          {
              "type": "function",
              "function": {
                  "name": "get_weather",
                  "description": "Get current weather",
                  "parameters": {
                      "type": "object",
                      "properties": {"location": {"type": "string"}},
                      "required": ["location"]
                  }
              },
              "cache_control": {"type": "ephemeral"}
          }
      ]
  )
  ```
</CodeGroup>

### Provider Support

There are three caching styles. The first two are **marker-based** — you send `cache_control` and the provider caches based on it (this uses the same `explicit` vs `automatic` distinction as the [Messages API](/docs/ai-gateway/messages-overview#prompt-caching)):

* **Explicit** — you place `cache_control` on individual blocks (a `system` block, a message content block, or a `tool` definition).
* **Automatic** — you place a single `cache_control` field at the **top level** of the request and the provider manages the breakpoint for you. Supported on Anthropic, Claude Platform on AWS, and Azure AI Foundry.
* **Provider-managed** — the provider caches repeated prefixes on its own with no markup, so the Gateway strips any `cache_control` you send.

| Provider                                      | Caching style        | How `cache_control` is handled                                                                                                                                                                                          |
| --------------------------------------------- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Anthropic (direct), Azure AI Foundry (Claude) | Explicit + automatic | Forwarded to the provider unchanged as native Anthropic `cache_control` (optional `ttl` supported).                                                                                                                     |
| Google Vertex (Claude), Databricks (Claude)   | Explicit             | Forwarded to the provider unchanged as native Anthropic `cache_control` (optional `ttl` supported).                                                                                                                     |
| AWS Bedrock (Claude)                          | Explicit             | Chat completions run through Bedrock's **Converse** API, so each block-level `cache_control` is translated into a native `cachePoint` marker. A top-level `cache_control` is dropped (no automatic caching on Bedrock). |
| OpenAI / Azure OpenAI                         | Provider-managed     | `cache_control` stripped before forwarding. Optionally pass `prompt_cache_key`.                                                                                                                                         |
| Google Gemini / Google Vertex (Gemini)        | Provider-managed     | `cache_control` stripped before forwarding.                                                                                                                                                                             |
| Groq, xAI, and others                         | Provider-managed     | `cache_control` stripped before forwarding.                                                                                                                                                                             |

<Note>
  The examples above use **explicit** caching (block-level `cache_control`), which works on every Claude provider. For **automatic** caching (a single top-level `cache_control`) and its provider availability, see [Messages API caching](/docs/ai-gateway/messages-overview#prompt-caching) — the same rules apply here.
</Note>

<AccordionGroup>
  <Accordion title="Anthropic (and Bedrock / Vertex / Azure / Databricks Claude)">
    For Anthropic (direct), Google Vertex (Claude), Azure AI Foundry (Claude), and Databricks (Claude), `cache_control` is forwarded to the provider unchanged. You can also include an optional `ttl` field (e.g. `"5m"` or `"1h"`) to control cache duration.

    For **AWS Bedrock**, the gateway translates each `cache_control` block into Bedrock's native Converse `cachePoint` format, so your request code stays the same. Converse `cachePoint` markers do not carry a duration, so any `ttl` you set is ignored on Bedrock chat completions.

    <Note>
      On the [Messages API (`/messages`)](/docs/ai-gateway/messages-overview#prompt-caching), Bedrock Claude models are served through the **InvokeModel** API with the native Anthropic body instead of Converse. There, `cache_control` (and `ttl`) is forwarded unchanged rather than translated to `cachePoint`.
    </Note>

    Anthropic enforces a minimum content length for caching to take effect; shorter prompts accept the `cache_control` hint but are not cached:

    | Minimum tokens | Models                                                         |
    | -------------- | -------------------------------------------------------------- |
    | `4096`         | Claude Mythos Preview, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5 |
    | `2048`         | Sonnet 4.6, Haiku 3.5, Haiku 3                                 |
    | `1024`         | Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7             |

    <Note>
      For Amazon Titan and Nova models on Bedrock, `cache_control` on tool definitions is automatically skipped since these models do not support cache points on tools.
    </Note>
  </Accordion>

  <Accordion title="OpenAI / Azure OpenAI">
    OpenAI and Azure OpenAI are **provider-managed** — they cache matching prefixes on their own. No `cache_control` markup is needed, and the gateway strips it before forwarding.

    You can optionally pass `prompt_cache_key` to group requests that share a common prefix, improving cache hit rates:

    ```python theme={"dark"}
    response = client.chat.completions.create(
        model="openai-main/gpt-4o",
        messages=[{"role": "user", "content": "Your prompt here"}],
        prompt_cache_key="optional-custom-key"
    )
    ```

    <Info>
      `prompt_cache_key` is only supported for OpenAI and Azure OpenAI.
    </Info>
  </Accordion>

  <Accordion title="Gemini / Groq / xAI">
    These providers are **provider-managed** — they handle caching on their own end. No `cache_control` markup is needed, and the gateway strips it before forwarding.

    Cached token counts are still reported in the response `usage` when the provider returns them.
  </Accordion>
</AccordionGroup>

### Cache Usage in Responses

When caching is active, the response `usage` object includes cached token counts:

```json theme={"dark"}
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    },
    "cache_read_input_tokens": 1200,
    "cache_creation_input_tokens": 300
  }
}
```

* `prompt_tokens_details.cached_tokens`: tokens served from cache. Available across all providers that report cache usage.
* `cache_read_input_tokens` / `cache_creation_input_tokens`: Anthropic style fields, present for Anthropic, Bedrock, and Groq when values are non zero.

## Reasoning Models

TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers including `Anthropic`,`OpenAI`,`Azure OpenAI`,`Groq`, `xAI` and `Vertex`.

These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model's cognitive process.

### Supported Reasoning Models

<AccordionGroup>
  <Accordion title="OpenAI">
    Supported models: `o4-mini`, `o4-preview`, `o3` model family, `o1` model family, `gpt-5-mini`, `gpt-5-nano`, `gpt-5`

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    response = client.chat.completions.create(
        model="openai-main/o4-mini",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        reasoning_effort="high",  # Options: "high", "medium", "low"
        max_tokens=8000
    )

    print(response.choices[0].message.content)
    ```
  </Accordion>

  <Accordion title="Azure OpenAI">
    Supported models: `gpt-5`, `gpt-5-mini`, `gpt-5-nano`, `o3-pro`, `codex-mini`, `o4-mini`, `o3`, `o3-mini`, `o1`, `o1-mini`

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    response = client.chat.completions.create(
        model="azure-openai-main/o3-mini",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        reasoning_effort="high",  # Options: "high", "medium", "low"
        max_tokens=8000
    )

    print(response.choices[0].message.content)
    ```
  </Accordion>

  <Accordion title="Anthropic">
    Supported models: `Claude Opus 4.1` (claude-opus-4-1-20250805), `Claude Opus 4` (claude-opus-4-20250514), `Claude Sonnet 4` (claude-sonnet-4-20250514), `Claude Sonnet 3.7` (claude-3-7-sonnet-20250219) \
    via `Anthropic`, `AWS Bedrock`, and `Google Vertex AI`

    ### Using OpenAI SDK

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    response = client.chat.completions.create(
        model="anthropic-main/claude-3-7-sonnet",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        reasoning_effort="high",  # Options: "high", "medium", "low", "none"
        max_tokens=8000
    )

    print(response.choices[0].message.content)
    ```

    <Note>
      For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the `reasoning_effort` parameter into Anthropic's native `thinking` parameter format since Anthropic doesn't support the `reasoning_effort` parameter directly.

      The translation uses the `max_tokens` parameter with the following ratios:

      * `none`: 0% of max\_tokens
      * `low`: 30% of max\_tokens
      * `medium`: 60% of max\_tokens
      * `high`: 90% of max\_tokens
    </Note>

    ### Using Direct API Calls with Native `thinking` Parameter

    For more precise control with Anthropic models, you can use the native `thinking` parameter directly:

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    response = client.chat.completions.create(
        model="anthropic-main/claude-3-7-sonnet",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        max_tokens=8000,
        extra_body={
            "thinking": {
                "type": "enabled",
                "budget_tokens": 1000
            }
        }
    )

    print(response.choices[0].message.content)
    ```
  </Accordion>

  <Accordion title="Groq">
    Supported models: `OpenAI GPT-OSS 20B` (openai/gpt-oss-20b), `OpenAI GPT-OSS 120B` (openai/gpt-oss-120b), `Qwen 3 32B` (qwen/qwen3-32b), `DeepSeek R1 Distil Llama 70B` (deepseek-r1-distill-llama-70b)

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    response = client.chat.completions.create(
        model="groq-main/deepseek-r1-distill-llama-70b",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        reasoning_effort="high",  # Options: "high", "medium", "low"
        max_tokens=8000
    )

    print(response.choices[0].message.content)
    ```
  </Accordion>

  <Accordion title="xAI">
    Supported models: `grok-3-mini` (with `reasoning_effort` parameter), `grok-4-0709`, `grok-4-1-fast-reasoning`, `grok-4-fast-reasoning` (reasoning built-in)

    For `grok-3-mini`, you can use the `reasoning_effort` parameter to control reasoning depth. Other Grok models like `grok-4-0709` have reasoning capabilities built-in but do not support the `reasoning_effort` parameter.

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    # For grok-3-mini with reasoning_effort parameter
    response = client.chat.completions.create(
        model="xai-main/grok-3-mini",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        reasoning_effort="high",  # Options: "high", "low" (only for grok-3-mini)
        max_tokens=8000
    )

    print(response.choices[0].message.content)
    ```

    <Note>
      The `reasoning_effort` parameter is only supported for `grok-3-mini`. For other Grok models like `grok-4-0709` and `grok-4-1-fast-reasoning`, reasoning is built-in and the `reasoning_effort` parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.

      **Parameter Restrictions**: Reasoning models (like `grok-4-0709` and `grok-4-1-fast-reasoning`) do not support `presence_penalty`, `frequency_penalty`, or `stop` parameters. Using these parameters with reasoning models will result in an error.
    </Note>
  </Accordion>

  <Accordion title="Gemini">
    Supported models: All `Gemini 2.5 Series` Models.

    These models can be accessed from `Google Vertex` or `Google Gemini` Providers

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    response = client.chat.completions.create(
        model="vertex-main/gemini-2-5-pro",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        reasoning_effort="high",  # Options: "high", "medium", "low", "none"
        max_tokens=8000
    )

    print(response.choices[0].message.content)
    ```

    <Note>
      For Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the `reasoning_effort` parameter into Gemini's native `thinking` parameter format since Gemini doesn't support the `reasoning_effort` parameter directly.

      The translation uses the `max_tokens` parameter with the following ratios:

      * `none`: 0% of max\_tokens
      * `low`: 30% of max\_tokens
      * `medium`: 60% of max\_tokens
      * `high`: 90% of max\_tokens

      Note: Gemini 2.5 Pro and 2.5 Flash comes with reasoning on by default.
    </Note>

    ### Using Direct API Calls with Native `thinking` Parameter

    For more precise control with Gemini models, you can use the native `thinking` parameter directly:

    ```python theme={"dark"}
    from openai import OpenAI

    client = OpenAI(
        api_key="TFY_API_KEY",
        base_url="{GATEWAY_BASE_URL}"
    )

    response = client.chat.completions.create(
        model="vertex-main/gemini-2-5-pro",
        messages=[
            {"role": "user", "content": "How to compute 3^3^3?"}
        ],
        max_tokens=8000,
        extra_body={
            "thinking": {
                "type": "enabled",
                "budget_tokens": 1000
            }
        }
    )

    print(response.choices[0].message.content)
    ```
  </Accordion>
</AccordionGroup>

### Response Format

When reasoning tokens are enabled, the response includes both thinking and content sections:

```json theme={"dark"}
{
  "id": "1742890579083",
  "object": "chat.completion",
  "created": 1742890579,
  "model": "",
  "provider": "aws",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Therefore: 3^3^3 = 7,625,597,484,987",
        "reasoning_content": "Exponentiation is right-associative: compute 3^3 = 27, then 3^27..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 180,
    "total_tokens": 225
  }
}
```

### Streaming with Reasoning Tokens

For streaming responses, the thinking section is always sent before the content section.
