Blank white background with no objects or features visible.

Join the Resilient Agents online hackathon hosted by TrueFoundry. Win up to $10,000 in prizes. Register Now →

Join our VAR & VAD ecosystem — deliver enterprise AI governance across LLMs, MCPs & Agents. Become a Partner →

Intelligent LLM Routing: Cost-, Latency-, and Quality-Aware Model Selection at the Gateway

By Boyu Wang

Updated: June 8, 2026

A 2026 application doesn't talk to one model — it talks to a menu of them, spanning frontier, mid-tier, cheap, and self-hosted. Routing is the policy that picks one per request, navigating three goals that pull against each other: cost, latency, and quality. This post walks the routing strategies from static rules to semantic routing and model cascades, the hard problem of measuring the quality you want to route on, why routing is not failover, and the instrumentation that keeps a router from quietly betraying you.

Key Takeaways

  • A modern app routes across a menu of models — frontier (Opus 4.8, GPT-5.5), mid (Sonnet 4.6, GPT-5.4), cheap (Haiku 4.5), and self-hosted — along three competing axes: cost, latency, and quality/task-fit. Sending everything to the best model is the most expensive option and often overkill.
  • Routing strategies form a ladder of increasing complexity: static rules, weighted splits, latency-aware, cost-aware, semantic routing, and model cascades. Each rung should be justified by measured benefit, not adopted for its own sake.
  • Semantic routing embeds the request and routes by inferred intent. The embedding/classifier step is small in our examples (treated as ~5–20 ms; measure it in your own path), and it only beats a static task tag when the caller can't cheaply label the request itself.
  • Model cascades (cheap-first, escalate-on-failure) can cut blended cost substantially when most traffic is resolved by the cheap tier — but the escalation rate is a live cost variable. A drifting verifier can silently escalate everything.
  • Routing "on quality" requires a way to measure quality: offline eval sets, online LLM-as-judge, or A/B against business metrics. Routing on vibes is how a quality regression ships unnoticed.
  • Routing (optimization) is not failover (availability). They share machinery — a candidate list and a policy — but conflating "pick the cheapest" with "survive an outage" causes incidents.
  • The gateway is the natural decision point: it normalizes provider APIs and already holds the cost/latency telemetry routing depends on. TrueFoundry's AI Gateway exposes routing rules, weighted load balancing, and fallback chains across hosted and self-hosted models, with per-route observability.

Tuesday at Northwind. Omar, a platform engineer, had spent the quarter proud of one number: a 41% drop in the company's LLM bill. He'd built a router. Simple classification and intent-detection calls went to a cheap model; only the genuinely hard requests — multi-step reasoning, code generation — reached the frontier model. It worked. Finance noticed.

Then the second week's bill came in at three times the first week's, with no traffic increase. Omar traced it. His cascade had a verifier — the cheap model's output was schema-checked, and on a failed check the request escalated to the frontier model. A provider-side update had subtly changed the cheap model's output formatting, the schema check started failing on most responses, and the router had quietly escalated about 90% of traffic to the most expensive model. Nothing errored. Nothing alerted. The router did exactly what it was told; it just stopped doing what Omar meant. The escalation rate had been climbing for nine days, and nobody was watching it.

Routing is often one of the highest-leverage cost levers in an LLM stack and one of the easiest to get quietly wrong. This post is the strategies, their tradeoffs, and the instrumentation that keeps a router honest.

What TrueFoundry's AI Gateway Provides Here

The routing strategies in this post aren't abstractions — they're how TrueFoundry's AI Gateway is configured. Its routing configuration matches each request by model, by subject (user, team, or virtual account), or by an X-TFY-METADATA header, evaluates rules top-to-bottom with first-match-wins, and sends the request to a target model — all as YAML applied at the gateway rather than branching logic in the app.

The three strategies map onto the ladder in section 2: weight-based for splits and canaries, latency-based to favor the lowest-latency healthy target, and priority-based for ordered preference with fallback. Per-target overrides also cover the model-specific-prompt problem this post raises — you can attach a different prompt_version_fqn per target so each model gets a prompt tuned for it — alongside per-target retries and fallback. (For new setups the docs recommend Virtual Models, which package the same strategies, retries, and fallbacks with clearer per-model ownership and access control.)

TrueFoundry AI Gateway routing config: a request flows through routing rules and is assigned to a target model
Fig 1: A request is matched against routing rules and assigned to a target model — the weight-, latency-, or priority-based selection this post walks through. Source: TrueFoundry AI Gateway docs — Routing Config.
TrueFoundry AI Gateway routing configuration UI

The fastest way to build, govern and scale your AI

Sign Up
Table of Contents

One Gateway for Every LLM, Agent and MCP Server

Book a 30-min with our AI expert

Book a Demo

The fastest way to build, govern and scale your AI

Book Demo

Discover More

No items found.
 Best AI Gateway
June 8, 2026
|
5 min read

5 Melhores Gateways de IA em 2026

comparação
June 8, 2026
|
5 min read

Prompt Injection Defense at the Gateway: Direct, Indirect, and Tool-Mediated Attacks

No items found.
June 8, 2026
|
5 min read

AI Governance and Audit for Enterprise LLMs: Virtual Keys, RBAC, and Compliance-Grade Logs

No items found.
June 8, 2026
|
5 min read

Intelligent LLM Routing: Cost-, Latency-, and Quality-Aware Model Selection at the Gateway

No items found.
No items found.

Recent Blogs

Black left pointing arrow symbol on white background, directional indicator.
Black left pointing arrow symbol on white background, directional indicator.
Take a quick product tour
Start Product Tour
Product Tour