Exporting TrueFoundry AI Gateway Traces to Middleware via OpenTelemetry

Built for Speed: ~10ms Latency, Even Under Load
Blazingly fast way to build, track and deploy your models!
- Handles 350+ RPS on just 1 vCPU — no tuning needed
- Production-ready with full enterprise support
TrueFoundry AI Gateway exports OpenTelemetry spans to Middleware over OTLP/HTTP by posting protobuf-encoded trace payloads to your tenant's ingest endpoint at https://<your-domain>.middleware.io:443/v1/traces with a raw API key in the Authorization header. Every inference request that passes through the gateway generates a structured set of spans that Middleware ingests alongside the rest of your infrastructure and application telemetry. No code changes are required to the applications calling the gateway and no sidecar is deployed alongside the gateway pod.
This post covers the trace generation path inside the TrueFoundry AI Gateway and how spans are published asynchronously without affecting request latency. It covers what Middleware does with the incoming OTLP data and how it correlates gateway spans with broader system telemetry. It also describes the configuration surface where the two systems connect.
How the TrueFoundry AI Gateway Handles Observability
The TrueFoundry AI Gateway is built on the Hono framework and runs as stateless gateway pods. A single pod on 1 vCPU and 1 GB RAM handles 250+ RPS and adds approximately 3 ms of latency to each inference request. The gateway uses a split architecture where a control plane manages configuration backed by PostgreSQL and ClickHouse and NATS and the gateway plane processes actual inference requests using configuration synced into memory via NATS.
The gateway is OpenTelemetry compliant and generates spans across the full lifecycle of every request. Spans are produced for the inbound HTTP handler and the authentication step and the model resolution step and the outbound provider call and the streaming response. Each span carries a standard set of attributes: tfy.input and tfy.output and tfy.input_short_hand for prompt and completion content alongside the gen_ai.* attribute family covering token counts and model name and finish reason. These conform to the OpenTelemetry semantic conventions for generative AI workloads and land in Middleware as structured span data that can be queried and correlated against other service spans.
The gateway's core design principle is that no external calls happen in the request path. JWT tokens are validated against cached public keys. Authorization is checked against an in-memory map of users to models synced via NATS. Model routing decisions run entirely in memory. The only external call in the critical path is the LLM provider call itself. This is what makes the OTEL integration non-invasive: after the response is returned to the client the gateway publishes the trace event to NATS asynchronously. A dedicated OTEL exporter process reads from the NATS async path and forwards the serialized spans over OTLP/HTTP to the configured endpoint. The gateway never blocks or fails a request because the external OTEL endpoint is unreachable or slow. Export failure is isolated entirely outside the request path.
The exporter is additive: it runs in parallel with TrueFoundry's own internal trace storage and does not replace it. Teams that need to suppress prompt and completion content before it leaves their environment can enable the Exclude Request Data toggle which strips tfy.input and tfy.output and tfy.input_short_hand from span attributes before the exporter serializes the payload. Token counts and latency and model metadata are retained regardless of the toggle state.
What Middleware Does with the Traces
Middleware is a full-stack observability platform built on OpenTelemetry as its core instrumentation standard. It was founded in 2021 and accepts telemetry through standard OTLP endpoints: OTLP/HTTP at https://<uid>.middleware.io:443/v1/traces and OTLP/gRPC on the same host at port 443. The TrueFoundry docs page for this integration uses the HTTP path with protobuf encoding. Both paths accept the same span data model and land in the same Middleware backend.
Middleware's backend stores traces alongside logs and infrastructure metrics and real user monitoring data in a single correlated data layer. The key architectural property of this is that a span arriving from the TrueFoundry gateway can be linked directly to the infrastructure signals from the host or cluster where the gateway pod runs. An engineer investigating a latency spike in a gateway span can navigate from the trace view to the infrastructure metrics for that pod without switching dashboards. This cross-signal correlation is what distinguishes Middleware from a standalone trace collector: the OTLP endpoint is the entry point and the APM trace view is one surface in a broader correlated observability environment.
Within the trace view Middleware renders spans in a timeline with parent and child relationships intact. For gateway spans this means the full request hierarchy is visible: inbound handler at the root and authentication and model resolution and provider call as child spans below it. The gen_ai.* attributes surface as queryable metadata on each span allowing teams to filter traces by model name or token count ranges or finish reason across the entire trace history. Middleware applies no mandatory sampling at ingest on its cloud tier: all spans are stored and queryable.
Middleware also exposes a unified APM dashboard where service-level metrics derived from trace data are displayed alongside infrastructure signals. Gateway traces arriving via OTLP contribute to the service topology map that Middleware builds from incoming span data. If the gateway is instrumented with a service.name resource attribute the gateway will appear as a node in the service map with latency and error rate computed from its spans. This makes the gateway visible as a first-class service in Middleware's topology alongside the application services that call it.
For teams that route traffic through the Middleware OpenTelemetry Collector before reaching the Middleware backend the gateway's OTLP output can also be directed at a self-managed collector. The collector can apply tail sampling and attribute enrichment and batching before forwarding to the Middleware OTLP endpoint. The TrueFoundry gateway exporter speaks standard OTLP so the intermediate collector sees the same payload it would receive from any OTel-instrumented service.
The Integration Surface
The TrueFoundry OTEL exporter is configured in the dashboard under AI Engineering → Settings → OTEL Config in the Organisation section. The form exposes an HTTP Configuration mode and a gRPC mode. The Middleware integration uses the HTTP path with protobuf encoding. The configuration fields are:
FieldValueTraces Endpointhttps://<your-domain>.middleware.io:443/v1/tracesProtocolHTTPEncodingProtoAuthorization Header<YOUR_MIDDLEWARE_API_KEY>
The <your-domain> prefix in the endpoint URL is the tenant identifier Middleware assigns to your account. It is the same prefix used across all Middleware ingest endpoints for your organization. The Authorization header takes the raw API key with no Bearer prefix: Middleware's OTLP ingest endpoint validates the key directly as the header value without a scheme prefix.
The Middleware API key used for trace ingestion is generated in the Middleware project or organization settings under the API keys or Ingestion Keys section. Once generated the full secret value should be stored securely as Middleware does not expose the full value after initial creation. The key is scoped to your Middleware account and all traces exported with it appear in the project associated with that key.
Resource attributes set on the TrueFoundry gateway deployment flow through to Middleware as span-level metadata. The service.name attribute in particular controls how the gateway appears in Middleware's service map and APM views. If multiple gateway deployments export to the same Middleware project each should carry a distinct service.name so traces from different deployments are distinguishable in the topology view.
For organizations operating in network-restricted environments where direct OTLP egress to middleware.io is not permitted the gateway exporter can be pointed at a local OpenTelemetry Collector instance instead. The collector then forwards to Middleware's endpoint after applying any network egress rules or attribute processing the environment requires. This indirect path uses the same OTLP wire format throughout and requires no changes to the gateway configuration other than the endpoint URL.
Architecture Summary
When a request passes through the TrueFoundry AI Gateway the gateway produces an OTEL trace covering authentication and model resolution and the provider call. After the response is returned to the client the gateway publishes the trace event to NATS. The OTEL exporter process reads from NATS and serializes the spans as a protobuf-encoded OTLP/HTTP payload and sends it to https://<your-domain>.middleware.io:443/v1/traces with the Middleware API key in the Authorization header. Middleware receives the payload at its OTLP ingest layer and stores the spans in its correlated telemetry backend where they become queryable alongside logs and infrastructure metrics and APM data for the rest of the stack.
This path requires no application-side instrumentation changes. No sidecar runs alongside the gateway pod. No SDK is installed on the calling application. The gateway generates and exports spans entirely from within its own process using the async NATS publish path so the OTEL export adds zero latency to inference requests.
The architectural principle that makes this integration clean is the strict separation between the request path and the telemetry export path inside the gateway. The async NATS bus decouples span production from span delivery. The OTEL exporter is a subscriber on that bus and can be pointed at any OTLP-compatible endpoint by changing a single configuration field. Middleware's OTLP endpoint accepts the gateway's standard span output without any transformation. The result is that gateway traffic becomes visible inside Middleware's full-stack observability environment from the moment the exporter configuration is saved.
TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.
The fastest way to build, govern and scale your AI














.webp)

.webp)
.webp)


.webp)
.webp)


.png)



