Skip to main content
Compaction reduces context size while preserving state for the next turn, so you can balance quality, cost, and latency as conversations grow. The gateway supports standalone POST /responses/compact and server-side compaction via context_management in POST /responses.
Compaction is only supported by OpenAI. Use the gateway’s OpenAI inference base URL.

Standalone: POST /responses/compact

Send a full context window; the API returns a compacted window (including an opaque, encrypted compaction item) to pass as input to your next /responses call. Body: model, input, and optionally instructions, previous_response_id.
Do not prune the compact response. Pass the full output into your next /responses call as-is.

Code

from openai import OpenAI

client = OpenAI(
    api_key="your-tfy-api-key",
    base_url="https://{controlPlaneUrl}/api/llm",
)

compacted = client.responses.compact(
    model="openai-main/gpt-4o",
    input=long_input_items_array,
)

next_input = [
    *compacted.output,
    {"type": "message", "role": "user", "content": user_input_message()},
]

next_response = client.responses.create(
    model="openai-main/gpt-4o",
    input=next_input,
    store=False,
)

Response shape

{
  "id": "resp_compact_123",
  "object": "response.compaction",
  "created_at": 1234567890,
  "output": [
    { "type": "compaction", "encrypted_content": "..." }
  ],
  "usage": {
    "input_tokens": 15000,
    "output_tokens": 1200,
    "total_tokens": 16200
  }
}

Server-side: POST /responses with context_management

Set context_management: [{"type": "compaction", "compact_threshold": 200000}] on create. When the rendered token count crosses the threshold, the server compacts and emits a compaction item in the stream. No separate /responses/compact call needed.
  • Stateless: Append response output (including compaction items) to your input each turn.
  • Stateful: Use previous_response_id and send only the new user message; do not manually prune.
With stateless chaining, you can drop items that came before the most recent compaction item to keep requests smaller. With previous_response_id, do not prune.

Code

conversation = [
    {"type": "message", "role": "user", "content": "Let's begin a long coding task."},
]

while keep_going:
    response = client.responses.create(
        model="openai-main/gpt-4o",
        input=conversation,
        store=False,
        context_management=[{"type": "compaction", "compact_threshold": 200000}],
    )
    conversation.extend(response.output)
    conversation.append(
        {"type": "message", "role": "user", "content": get_next_user_input()},
    )

References