Kommendes Webinar: Unternehmenssicherheit für Claude Code | 21. April · 11 Uhr PST. Registriere dich hier →

Cartesia and TrueFoundry AI Gateway: Native Passthrough for Voice Inference

Aktualisiert:

Fassen Sie zusammen mit

Cartesia's Sonic-3 text to speech model and Ink-Whisper streaming speech to text model integrate with TrueFoundry AI Gateway through a native passthrough surface. Requests flow to Cartesia's /tts/bytes HTTP endpoint and /tts/sse server sent events stream and /tts/websocket bidirectional WebSocket and the Ink streaming WebSocket with their original protocol semantics intact. The gateway injects the Cartesia API key from its central credential store and runs access control and emits OpenTelemetry spans before forwarding the connection.

This post explains why voice inference providers cannot use the same OpenAI compatible translation pattern that the gateway applies to chat completion providers. It covers how the gateway plane handles native passthrough inside the existing Hono request pipeline. It covers Cartesia's API surface across both TTS and STT. It covers the configuration shape and the end to end data flow.

Why voice providers do not use the OpenAI translation path

Most TrueFoundry AI Gateway integrations operate on a translation principle. A request arrives in OpenAI compatible format on /chat/completions or /embeddings or /responses. The gateway resolves the model identifier to a provider endpoint and translates the request into that provider's native shape via an adapter. Anthropic gets translated to the Messages API. Google Vertex gets translated to the Generative Language API. Cohere gets translated to its native chat schema. The response comes back and gets translated in reverse so the caller sees a uniform OpenAI shape regardless of which physical provider served the request.

This pattern works because chat completion semantics are roughly equivalent across providers. There is a list of messages and a model identifier and sampling parameters and a streaming flag and a response with tool calls and finish reasons. The differences are real but narrow and can be reconciled inside an adapter.

Voice inference does not fit that mold. Cartesia's TTS API has parameters that have no equivalent in the OpenAI Audio API. The voice field accepts a Cartesia voice ID or a voice embedding. The output_format block specifies container and encoding and sample rate as a structured object. The language field selects between 42 supported languages. The __experimental_controls block carries speed and emotion parameters that map to Sonic-3's expressive controls. The WebSocket protocol introduces multiplexed contexts and flush_id boundaries and continuation semantics for streaming text input from an upstream LLM. None of this exists in the OpenAI /v1/audio/speech shape.

The Ink-Whisper STT path is similar. The streaming WebSocket protocol passes audio frames in real time and emits interim transcripts and final transcripts as the model performs dynamic chunking on semantically meaningful boundaries. The OpenAI /v1/audio/transcriptions endpoint is a request response file upload with no streaming counterpart in the official spec.

Translating this surface would either drop capability or introduce lossy mappings. The gateway therefore exposes Cartesia through native passthrough. The caller continues to use the official Cartesia Python SDK or any other Cartesia client with its full feature set. The gateway sits in the path as a credential and policy and observability boundary rather than as a protocol translator.

How native passthrough works inside the gateway plane

The TrueFoundry AI Gateway is built on the Hono framework. A single gateway pod on 1 vCPU and 1 GB RAM handles 250 plus RPS with approximately 3 ms added latency. Pods are stateless and CPU bound and scale horizontally to tens of thousands of RPS. The gateway plane and the control plane are split. The control plane manages configuration in PostgreSQL and ClickHouse and propagates updates over NATS. Gateway pods cache that configuration in memory.

When a Cartesia request hits a gateway pod the same pre forwarding pipeline runs that runs for chat completions. The JWT presented on the request is validated against cached IdP public keys with no external auth call. Authorization is checked against the in memory map of users to models that NATS keeps synchronized. The routing layer resolves the model identifier (such as sonic-3 or ink-whisper) to the provider endpoint configured for that model and to the Cartesia account credentials stored in the control plane. The request body and path and query parameters are not rewritten. Only the Authorization and X-API-Key headers are stripped from the inbound request and replaced with the Cartesia API key from the secure credential store. The forwarded URL becomes the Cartesia origin (https://api.cartesia.ai/...) with the matching path and method preserved. The body is streamed through unchanged.

For the WebSocket endpoints (wss://api.cartesia.ai/tts/websocket and the Ink streaming endpoint) the gateway performs an HTTP Upgrade handshake. After the upgrade succeeds the gateway holds two WebSocket connections (one with the client and one with Cartesia) and proxies frames in both directions. The multiplexed context model that Cartesia exposes is preserved because the gateway does not interpret the frame payloads. A client that opens a single WebSocket and runs dozens of concurrent generations against different context_id values sees the same behaviour through the gateway as it would talking to Cartesia directly.

The asynchronous trace publication path that the gateway uses for chat completions also runs for Cartesia traffic. The gateway emits spans for the inbound HTTP handler and the credential resolution and the outbound provider call (or WebSocket session). For TTS requests these spans carry duration and status and the resolved model name and a hash of the transcript. For STT sessions the span captures the connection lifetime and the message count. Spans are published asynchronously to NATS after the request completes. The OpenTelemetry exporter reads from the async path and forwards traces to the configured backend (gRPC or HTTP). Export is additive and does not change the gateway's own trace storage. The gateway never fails a Cartesia request even if the external OTEL endpoint is unreachable.

The cost tracking pipeline also runs in passthrough mode. Cartesia bills on credits which translate to characters synthesized for TTS and seconds transcribed for STT. The gateway records the request size and response duration metadata and publishes these to the same NATS event bus that aggregates chat completion cost data. The aggregator service computes per user and per team and per model rollups that show up in the unified analytics view alongside chat traffic.

What Cartesia exposes

Cartesia builds voice models on a state space model architecture. The TTS family is named Sonic and the current production model is Sonic-3. The STT family is named Ink and the current production model is Ink-Whisper.

Sonic-3 is a streaming TTS model with a published time to first audio of approximately 90 ms. It supports 42 languages. It exposes fine grained controls on volume and speed and emotion through API parameters and SSML tags. It supports laughter through [laughter] inline tags. The model is exposed through three endpoint shapes that suit different use cases.

The first is POST /tts/bytes. This is a synchronous batch endpoint that returns the entire audio file in the response body. It accepts MP3 or WAV or raw PCM output formats and is suited to pre generating audio assets where the full latency of waiting for the complete output is acceptable.

The second is POST /tts/sse. This is a server sent events stream. The model emits audio chunks progressively as they are generated. This suits applications that play audio progressively and need the time to first byte advantage but do not need to stream input text into the model.

The third is WSS /tts/websocket. This is the recommended endpoint for real time voice agents. The connection is bidirectional and supports multiplexed generations through the context_id field. A single open WebSocket can carry dozens of concurrent generations. The context_id allows continuation generation where additional text segments can be pushed into an existing context to maintain prosody across the joins. This matters when the upstream text source is an LLM streaming token by token and the TTS needs to follow the cadence of the text generation. The WebSocket protocol also supports manual flushing through flush_id markers which create discrete audio boundaries within a single context.

Ink-Whisper is a streaming STT model derived from whisper-large-v3-turbo and re engineered for real time conversational use. The defining metric is time to complete transcript which measures how quickly the final accurate transcript is ready after the user stops speaking. Ink-Whisper achieves this through dynamic chunking. Standard Whisper performs best on fixed 30 second audio buffers and so introduces a fundamental latency floor unsuited to live conversation. Ink-Whisper analyses the audio stream for semantically meaningful break points such as pauses and breaths and processes shorter chunks as they form. The endpoint is a streaming WebSocket that accepts PCM audio frames at 16 kHz and emits interim and final transcripts as the model commits to them. The default audio encoding is pcm_s16le at 16000 Hz.

Cartesia disconnects WebSocket connections after 3 minutes of inactivity. The timeout resets with each frame sent in either direction. Clients typically run silence based keepalives to hold the connection open across utterance gaps.

The integration surface

Adding Cartesia to TrueFoundry AI Gateway is three steps in the dashboard. Navigate to AI Gateway and then Models and select Cartesia. Add a Cartesia account by entering a unique account name and the Cartesia API key. The key is stored encrypted in the control plane and is never exposed to the gateway pods directly. Optionally add collaborators which controls which users and teams can route traffic through this account. Then register one or more models by clicking Add Model and providing a Display name and a Model ID and a Model type. For Cartesia the Model ID and Display name must be identical and must match the Cartesia model identifier exactly (sonic-3 and sonic-3-2026-01-12 and ink-whisper and so on).

The configuration surface for a Cartesia account is small.

FieldValueAccount nameUnique identifier scoped to the workspaceAPI KeyCartesia API key from the Cartesia dashboardCollaboratorsUsers and teams permitted to route through this account

The configuration surface for a Cartesia model is similarly small.

FieldValueDisplay nameMust equal Model IDModel IDCartesia model identifier (for example sonic-3 or ink-whisper)Model typeSelected from the supported voice model types

Inference uses the Cartesia native SDK with the gateway URL substituted as the base URL. A Python client looks like the following.

import os
from cartesia import Cartesia

client = Cartesia(
   api_key=os.environ["TFY_API_KEY"],
   base_url="https://<your-gateway-host>/cartesia",
)

response = client.tts.bytes(
   model_id="sonic-3",
   transcript="The road goes ever on and on.",
   voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
   output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
)

The same SDK calls work for the WebSocket endpoint and for the Ink-Whisper STT WebSocket. The TrueFoundry issued JWT replaces the Cartesia API key in the SDK configuration. The SDK believes it is talking to Cartesia directly because the gateway preserves the URL paths and the response shapes. Cost and access control and tracing all happen invisibly in the request path.

Architecture summary

The end to end data flow is straightforward. A client opens an HTTP request or a WebSocket against the gateway URL using the Cartesia SDK. The gateway pod authenticates the JWT against cached IdP public keys and resolves the model identifier to the configured Cartesia account. It strips the inbound auth header and substitutes the Cartesia API key from the credential store. It forwards the request or upgrades the WebSocket to https://api.cartesia.ai. For WebSocket sessions it bridges frames in both directions until either side closes the connection. After the request completes the gateway publishes a span to NATS which feeds the OTEL exporter and the cost aggregator.

What is not required is significant. There is no Cartesia SDK fork. There is no shadow translation layer that flattens TTS parameters into OpenAI Audio shape and loses the voice ID and the streaming context model in the process. There is no separate trace pipeline for voice traffic and a different one for chat traffic. There is no per service API key distributed across application code. There is no client side WebSocket terminator that has to be deployed separately to apply access control to the streaming endpoints.

The architectural principle that makes this work is the separation between protocol semantics and governance semantics. The Cartesia protocol carries voice domain meaning that does not generalize cleanly to other providers. The governance layer (authentication and authorization and credential injection and observability and cost tracking) is provider agnostic and can run in front of any HTTP or WebSocket origin without inspecting the payload. Native passthrough preserves the first while applying the second. The result is that Cartesia's full feature surface (Sonic-3's contexts and continuations and emotion controls and Ink-Whisper's streaming transcript flow) is available to clients while the operational guarantees that the rest of the AI Gateway provides for chat traffic apply to voice traffic on the same gateway pods with the same control plane and the same trace and cost backends.

Der schnellste Weg, deine KI zu entwickeln, zu steuern und zu skalieren

Melde dich an
Inhaltsverzeichniss

Steuern, implementieren und verfolgen Sie KI in Ihrer eigenen Infrastruktur

Buchen Sie eine 30-minütige Fahrt mit unserem KI-Experte

Eine Demo buchen

Der schnellste Weg, deine KI zu entwickeln, zu steuern und zu skalieren

Demo buchen

Entdecke mehr

Keine Artikel gefunden.
April 22, 2026
|
Lesedauer: 5 Minuten

GraySwan-Integration mit TrueFoundry

Keine Artikel gefunden.
April 22, 2026
|
Lesedauer: 5 Minuten

Aufbau der KI-Kontrollebene für Unternehmen: Gartner Insights und der Ansatz von TrueFoundry

Vordenkerrolle
April 22, 2026
|
Lesedauer: 5 Minuten

Marktplätze für KI-Agenten: Die Zukunft der Automatisierung auf Unternehmensebene

Keine Artikel gefunden.
April 22, 2026
|
Lesedauer: 5 Minuten

TrueFoundry AI Gateway-Integration mit LangSmith

LLM-Werkzeuge
LLM-Terminologie
Technik und Produkt
Keine Artikel gefunden.

Aktuelle Blogs

Machen Sie eine kurze Produkttour
Produkttour starten
Produkttour