Intelligent LLM Routing: Cost & Quality-Aware Selection

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

A 2026 application doesn't talk to one model — it talks to a menu of them, spanning frontier, mid-tier, cheap, and self-hosted. Routing is the policy that picks one per request, navigating three goals that pull against each other: cost, latency, and quality. This post walks the routing strategies from static rules to semantic routing and model cascades, the hard problem of measuring the quality you want to route on, why routing is not failover, and the instrumentation that keeps a router from quietly betraying you.

Key Takeaways

A modern app routes across a menu of models — frontier (Opus 4.8, GPT-5.5), mid (Sonnet 4.6, GPT-5.4), cheap (Haiku 4.5), and self-hosted — along three competing axes: cost, latency, and quality/task-fit. Sending everything to the best model is the most expensive option and often overkill.
Routing strategies form a ladder of increasing complexity: static rules, weighted splits, latency-aware, cost-aware, semantic routing, and model cascades. Each rung should be justified by measured benefit, not adopted for its own sake.
Semantic routing embeds the request and routes by inferred intent. The embedding/classifier step is small in our examples (treated as ~5–20 ms; measure it in your own path), and it only beats a static task tag when the caller can't cheaply label the request itself.
Model cascades (cheap-first, escalate-on-failure) can cut blended cost substantially when most traffic is resolved by the cheap tier — but the escalation rate is a live cost variable. A drifting verifier can silently escalate everything.
Routing "on quality" requires a way to measure quality: offline eval sets, online LLM-as-judge, or A/B against business metrics. Routing on vibes is how a quality regression ships unnoticed.
Routing (optimization) is not failover (availability). They share machinery — a candidate list and a policy — but conflating "pick the cheapest" with "survive an outage" causes incidents.
The gateway is the natural decision point: it normalizes provider APIs and already holds the cost/latency telemetry routing depends on. TrueFoundry's AI Gateway exposes routing rules, weighted load balancing, and fallback chains across hosted and self-hosted models, with per-route observability.

Tuesday at Northwind. Omar, a platform engineer, had spent the quarter proud of one number: a 41% drop in the company's LLM bill. He'd built a router. Simple classification and intent-detection calls went to a cheap model; only the genuinely hard requests — multi-step reasoning, code generation — reached the frontier model. It worked. Finance noticed.

Then the second week's bill came in at three times the first week's, with no traffic increase. Omar traced it. His cascade had a verifier — the cheap model's output was schema-checked, and on a failed check the request escalated to the frontier model. A provider-side update had subtly changed the cheap model's output formatting, the schema check started failing on most responses, and the router had quietly escalated about 90% of traffic to the most expensive model. Nothing errored. Nothing alerted. The router did exactly what it was told; it just stopped doing what Omar meant. The escalation rate had been climbing for nine days, and nobody was watching it.

Routing is often one of the highest-leverage cost levers in an LLM stack and one of the easiest to get quietly wrong. This post is the strategies, their tradeoffs, and the instrumentation that keeps a router honest.

What TrueFoundry's AI Gateway Provides Here

The routing strategies in this post aren't abstractions — they're how TrueFoundry's AI Gateway is configured. Its routing configuration matches each request by model, by subject (user, team, or virtual account), or by an X-TFY-METADATA header, evaluates rules top-to-bottom with first-match-wins, and sends the request to a target model — all as YAML applied at the gateway rather than branching logic in the app.

The three strategies map onto the ladder in section 2: weight-based for splits and canaries, latency-based to favor the lowest-latency healthy target, and priority-based for ordered preference with fallback. Per-target overrides also cover the model-specific-prompt problem this post raises — you can attach a different prompt_version_fqn per target so each model gets a prompt tuned for it — alongside per-target retries and fallback. (For new setups the docs recommend Virtual Models, which package the same strategies, retries, and fallbacks with clearer per-model ownership and access control.)

TrueFoundry AI Gateway routing config: a request flows through routing rules and is assigned to a target model — *Fig 1: A request is matched against routing rules and assigned to a target model — the weight-, latency-, or priority-based selection this post walks through. Source:* *TrueFoundry AI Gateway docs — Routing Config*.

TrueFoundry AI Gateway routing configuration UI

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now

The fastest way to build, govern and scale your AI

How Can You Prevent GenAI Costs From Spiraling at Scale?

Gartner report on best practices for optimizing generative and agentic AI costs and projected statistics.

Access Full 2026 Report

Gartner Hype Cycle for Platform Engineering 2026

Access Full 2026 Report

One Layer of Control for All AI

Route and govern model and tool traffic with a centralized AI Gateway

Book Demo

Table of Contents

Text Link

One Gateway for Every LLM, Agent and MCP Server

Book a 30-min with our AI expert

Book a Demo

Summarize with

Blurry red snowflake on white background, symmetrical frosty design with soft edges and abstract shape.

Intelligent LLM Routing: Cost-, Latency-, and Quality-Aware Model Selection at the Gateway

Built for Speed: ~10ms Latency, Even Under Load

What TrueFoundry's AI Gateway Provides Here

The fastest way to build, govern and scale your AI

One Layer of Control for All AI

One Gateway for Every LLM, Agent and MCP Server

The fastest way to build, govern and scale your AI

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

TrueFoundry vs MintMCP: MCP Gateway Comparison

TrueFoundry + Seldon: Unified Control Plane for Enterprise AI

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Recent Blogs

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Agent Economics, No. 1: What Is the Agent Economy — and Who Gets to Design It?

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

Best MCP Gateway for Production AI Systems in 2026

Best AI Gateways for LLM Inference Optimization in 2026

TrueFoundry vs MintMCP: MCP Gateway Comparison

Graph Engineering for Multi-Agent Systems: Architecture, Governance, and Observability

Designing for Model Deprecations with Virtual Models and Staged Cutovers

Unified AI Gateway as Enterprise's New Foundational Primitive

The Path to the Championship: Enterprise AI's Knockout Rounds Run Through the Gateway

AI Safety vs AI Security: What the Difference Means for Enterprise Teams

What Is Responsible AI? Principles, Practice, and What It Means for Enterprise Teams

AI Audit Checklist 2026: What to Review, When, and Why It Matters

BCG Says Strategy Matters More Than Tools — Part 2: From Agent Adoption to Governed Tools and Runtimes

BCG Says Strategy Matters More Than Tools — Part 1: From Strategic Clarity to Gateway Controls

Recursos

Por que TrueFoundry?

Intelligent LLM Routing: Cost-, Latency-, and Quality-Aware Model Selection at the Gateway

Built for Speed: ~10ms Latency, Even Under Load

What TrueFoundry's AI Gateway Provides Here

The fastest way to build, govern and scale your AI

One Layer of Control for All AI

One Gateway for Every LLM, Agent and MCP Server

The fastest way to build, govern and scale your AI

Discover More

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

TrueFoundry vs MintMCP: MCP Gateway Comparison

TrueFoundry + Seldon: Unified Control Plane for Enterprise AI

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Recent Blogs

Agent Economics, No. 2: Mapping Firm-Scale AI Controls to Agent-Economy Institutions

Agent Economics, No. 1: What Is the Agent Economy — and Who Gets to Design It?

Introducing Ask TFY: A New Way to Understand and Control Your AI in Production

Best MCP Gateway for Production AI Systems in 2026

Best AI Gateways for LLM Inference Optimization in 2026

TrueFoundry vs MintMCP: MCP Gateway Comparison

Graph Engineering for Multi-Agent Systems: Architecture, Governance, and Observability

Designing for Model Deprecations with Virtual Models and Staged Cutovers

Unified AI Gateway as Enterprise's New Foundational Primitive

The Path to the Championship: Enterprise AI's Knockout Rounds Run Through the Gateway

AI Safety vs AI Security: What the Difference Means for Enterprise Teams

What Is Responsible AI? Principles, Practice, and What It Means for Enterprise Teams

AI Audit Checklist 2026: What to Review, When, and Why It Matters

BCG Says Strategy Matters More Than Tools — Part 2: From Agent Adoption to Governed Tools and Runtimes

BCG Says Strategy Matters More Than Tools — Part 1: From Strategic Clarity to Gateway Controls

Recursos

Por que TrueFoundry?

Assine nossa newsletter