Load Balancing in AI Gateway: Optimizing Performance

Health Assessment

Usage Limits: If a model exceeds its configured request or token throughput limits (defined under model_configs), it is marked unhealthy.
Failure Tolerance: Models that accrue more errors than allowed, based on allowed_failures_per_minute and scoped by specific HTTP status codes, are similarly sidelined for the duration of their cooldown period.

Rule Evaluation
The gateway evaluates routing rules in the order they appear in your YAML configuration. Each rule’s when block filters incoming requests by model name, user, or team subjects, or custom metadata. Only the first matching rule is applied, ensuring deterministic routing behavior.

Weight-Based Routing
Under a weight-based rule, you specify a list of target models along with integer weights that sum to 100. For example, you might route 90 percent of traffic to azure/gpt-4o and 10 percent to openai/gpt-4o. The gateway randomly distributes each request in proportion to these weights among the currently healthy targets. You can also include override_params to tweak settings like temperature or max tokens on a per-model basis.

Latency-Based Routing
When using latency-based rules, no manual weights are needed. The gateway calculates each model’s average per-token latency over recent traffic, considering either the last twenty minutes of requests or the most recent one hundred calls, whichever is fewer. Models with fewer than three data points are treated as “fast” to gather more statistics. Any endpoint whose latency falls within 1.2 times the fastest model is considered equally eligible, preventing rapid switching due to minor performance fluctuations. Incoming requests are then directed to the fastest healthy model.

Load Balancing in AI Gateway: Optimizing Performance

Why Do We Need Load Balancing In The AI Gateway?

How Load Balancing Works in TrueFoundry AI Gateway

TrueFoundry Load Balancing: The Best AI Gateway Solution

How to configure load balancing in True Foundry?

Understanding True Foundry’s Load Balancing Config

Commonly Used Load Balancing Configurations

Conclusion

Subscribe to our newsletter

What is LLM Observability ? Complete Guide

AI Guardrails in Enterprise: Ensuring Safe Innovation

Portkey vs LiteLLM : Which is Best ?

Top 5 Azure ML Alternatives of 2025

Blazingly fast way to build, track and deploy your models!

Company

Product

Resources

Goodreads

The Complete Guide to AI Gateways and MCP Servers

Load Balancing in AI Gateway: Optimizing Performance

Subscribe to our Newsletter

Why Do We Need Load Balancing In The AI Gateway?

How Load Balancing Works in TrueFoundry AI Gateway

TrueFoundry Load Balancing: The Best AI Gateway Solution

How to configure load balancing in True Foundry?

Understanding True Foundry’s Load Balancing Config

Commonly Used Load Balancing Configurations

Conclusion

Subscribe to our newsletter

Discover More

What is LLM Observability ? Complete Guide

AI Guardrails in Enterprise: Ensuring Safe Innovation

Portkey vs LiteLLM : Which is Best ?

Top 5 Azure ML Alternatives of 2025

Related Blogs

Blazingly fast way to build, track and deploy your models!

Company

Product

Resources

Goodreads

Subscribe to our newsletter

The Complete Guide to AI Gateways and MCP Servers