LLM Failover & Load Balancing for Provider Outages

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

Every model provider has bad days — regional outages, rate-limit storms under load, latency that degrades without quite failing. If your application calls one provider directly, that provider's worst day is your worst day. This post is the reliability layer that prevents it: a taxonomy of how LLM calls fail, retries that don't make things worse, fallback chains across providers, health-aware load balancing, circuit breakers that fail fast, and the genuinely hard case of failing over mid-stream.

Key Takeaways

Production LLM apps depend on third-party providers with real outages, 429 rate-limit storms, and p99 latency spikes. Calling one provider directly makes it a single point of failure — its incident becomes your incident.
Failure modes differ and need different responses: hard 5xx errors, 429 rate limits (honor provider-specific headers — Retry-After, or x-ratelimit-* remaining/reset), latency degradation (slow but not failed), partial/streaming failures, and content-filter rejections (not a transient failover case — route them through a policy remediation path instead).
Retries need exponential backoff with jitter and must honor the provider's rate-limit headers (Retry-After, or x-ratelimit-* remaining/reset). Naive immediate retries turn a 429 into a thundering-herd storm that sustains the overload you're trying to escape.
Fallback chains — across providers and within a provider — trade availability for some output variance. The same prompt can behave differently on the fallback model, so failover is not free; validate the fallback's output too.
Load balancing across providers, deployments, and independent quota pools spreads load and adds real rate-limit headroom — but extra API keys inside one organization usually share a quota, so they don't multiply it. Health checks (active probes or passive circuit breaking) keep traffic off sick backends.
Circuit breakers fail fast — open, half-open, closed — so a down provider doesn't cascade into your own latency and queues. Hedged requests cut the latency tail at a cost multiplier. Streaming failover is the hard case, because partial output may already have been sent.
Retries, fallback, load balancing, and circuit breaking belong at the gateway because it sees every provider and key and normalizes the API. TrueFoundry's AI Gateway provides fallback chains, load balancing across providers and keys, and health-aware routing, with failover events on the request trace.

2:14 a.m. at Northwind. Nadia, an SRE, woke to a page: the customer-facing support agent was returning errors to every user. Not some users — every user. Northwind's own services were healthy; CPU, memory, and queues were all nominal. The errors were identical: 503 from the model provider. A regional incident on the provider's side had taken its inference endpoint down, and Northwind's agent called that endpoint directly. Every request hit the dead endpoint, failed, and returned an error to the customer. For forty minutes, until the provider recovered, there was nothing to do but wait — the agent had exactly one way to get a completion, and it was down.

The postmortem's action item wasn't "pick a more reliable provider." Every provider has incidents. It was "never depend on a single provider for a request that has to succeed." That is a gateway problem, and this post is how to solve it: retries that don't make things worse, fallback chains across providers, health-aware load balancing, and circuit breakers that fail fast instead of dragging your whole system down with the provider.

What TrueFoundry's AI Gateway Provides Here

Everything in this post — retries, fallback chains, health-aware load balancing — is something TrueFoundry's AI Gateway expresses as configuration rather than per-service code. Its routing configuration defines load-balancing, fallback, and retry rules in YAML, evaluates them in order so the first matching rule wins, and applies them centrally to every request instead of being reimplemented in each app.

The pieces map onto the failure taxonomy below. Each target carries its own retry_config — attempts, delay, and the status codes worth retrying (429/500/502/503 by default) — and a separate fallback_status_codes list moves the request to the next target when retries won't help. Priority-based routing gives an ordered failover chain; latency-based routing favors the lowest-latency healthy target; and an unhealthy target is detected from its requests-, tokens-, and failures-per-minute and sidelined for a cooldown. The docs even cover the hard streaming case from section 8 — provider-specific stream-overload handling so a fall-through can happen before any user-visible tokens are emitted.

TrueFoundry AI Gateway routing config: a request flows through routing rules and is assigned to a target model — *How a request flows through the gateway's routing rules to a target model, with load balancing and fallback applied along the way. Source:* *TrueFoundry AI Gateway docs — Routing Config*.

TrueFoundry AI Gateway Configs tab showing the YAML editor for routing configuration

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now