Load Balancing in AI Gateway: Optimizing Performance

June 2, 2025
min read
Share this post
https://www.truefoundry.com/blog/load-balancing-in-ai-gateway
URL
Load Balancing in AI Gateway: Optimizing Performance

Load balancing between multiple large language models in an AI gateway means routing incoming inference requests across a set of model endpoints (whether from different providers or different versions of the same model) so that no single model becomes a bottleneck or single point of failure. The gateway continuously monitors each endpoint’s health by tracking metrics such as requests per minute, tokens per minute, and error rates. When a model exceeds configured usage limits, returns errors, or experiences a lag in response time, it is marked unhealthy and excluded from routing. You can choose weight-based routing to assign fixed traffic proportions to each model or latency-based routing to dynamically prefer the fastest model based on recent performance data. All behavior is defined declaratively in a YAML configuration that specifies global usage limits, failure tolerances, and routing rules. This approach ensures high availability, consistent performance, and seamless failover without any changes to application code.

This blog explains what load balancing entails and why it is essential, shows how TrueFoundry AI Gateway implements it under the hood, walks through YAML configuration steps, reviews common setup patterns, and concludes with practical best practices for production deployments.

Why Do We Need Load Balancing In The AI Gateway?

Enterprises rely on uninterrupted access to language models for critical workflows. Yet individual providers can suffer service outages or planned maintenance windows that leave applications offline. By configuring load balancing across multiple model endpoints, TrueFoundry ensures that when one provider’s service becomes unavailable, traffic automatically shifts to healthy alternatives. This seamless failover prevents downtime for end users and maintains consistent application availability.

Latency fluctuations present another challenge. Response times vary by model architecture, geographic region, and provider capacity. A static routing setup risks sending traffic to a slower endpoint, degrading user experience. TrueFoundry’s latency-based routing continuously measures per-token response times over recent requests and dynamically routes each inference call to the fastest available model. That guarantees consistently low latency even as network conditions or provider load change.

API rate limits impose hard caps on requests or token throughput per minute. If a single provider’s quota is exhausted, subsequent calls fail, causing application errors. With weight-based routing in TrueFoundry, you can distribute traffic according to defined proportions so that no single endpoint exceeds its limits. Combined with global usage limits in the model_configs section, the gateway automatically keeps each model within its quota and reroutes calls when thresholds are reached, preventing unexpected failures.

Canary testing new model versions in production carries inherent risks. A flawed update could introduce errors or degrade performance. TrueFoundry makes canary deployments straightforward by letting you assign a small weight percentage to a new model in a weight-based rule. Traffic is routed incrementally, perhaps ten percent to the canary and ninety percent to the stable model, so you can monitor error rates and latency metrics before shifting the full load. If any issues arise, the gateway simply maintains the original traffic mix, safeguarding the user experience.

Together, these capabilities, automatic failover, dynamic latency optimization, rate limit management, and controlled canary rollouts make load balancing an essential practice for robust, high-performance LLM deployments on the TrueFoundry AI Gateway.

How Load Balancing Works in TrueFoundry AI Gateway

TrueFoundry’s AI Gateway orchestrates traffic distribution by continuously monitoring three core metrics for each configured model endpoint: requests per minute, tokens processed per minute, and failures per minute. These metrics feed into the health evaluation engine and determine which models are “healthy” at any given time.

  1. Health Assessment
    • Usage Limits: If a model exceeds its configured request or token throughput limits (defined under model_configs), it is marked unhealthy.
    • Failure Tolerance: Models that accrue more errors than allowed, based on allowed_failures_per_minute and scoped by specific HTTP status codes, are similarly sidelined for the duration of their cooldown period.
  2. Rule Evaluation
    The gateway evaluates routing rules in the order they appear in your YAML configuration. Each rule’s when block filters incoming requests by model name, user, or team subjects, or custom metadata. Only the first matching rule is applied, ensuring deterministic routing behavior.
  3. Weight-Based Routing
    Under a weight-based rule, you specify a list of target models along with integer weights that sum to 100. For example, you might route 90 percent of traffic to azure/gpt-4o and 10 percent to openai/gpt-4o. The gateway randomly distributes each request in proportion to these weights among the currently healthy targets. You can also include override_params to tweak settings like temperature or max tokens on a per-model basis.
  4. Latency-Based Routing
    When using latency-based rules, no manual weights are needed. The gateway calculates each model’s average per-token latency over recent traffic, considering either the last twenty minutes of requests or the most recent one hundred calls, whichever is fewer. Models with fewer than three data points are treated as “fast” to gather more statistics. Any endpoint whose latency falls within 1.2 times the fastest model is considered equally eligible, preventing rapid switching due to minor performance fluctuations. Incoming requests are then directed to the fastest healthy model.

All routing decisions occur in real time within the gateway. Unhealthy models are automatically excluded, and traffic seamlessly flows to the best available endpoints—all without requiring changes to application code.

TrueFoundry Load Balancing: The Best AI Gateway Solution

Tired of single-model bottlenecks and unpredictable downtime? TrueFoundry’s load balancing lets you distribute traffic across multiple LLMs, ensuring low latency, high availability, and seamless scaling.

Experience rock-solid performance with these capabilities:

  • Intelligent request distribution: Evenly route queries across multiple models to optimize throughput and prevent overload.
  • Health-aware routing: Automatically detects unhealthy endpoints and reroutes traffic to available models, avoiding downtime.
  • Weighted and latency-based strategies: Assign weights or route to the lowest-latency models for cost-effective performance.
  • Declarative YAML configuration: Manage all load-balancing rules in a simple gateway-load-balancing-config file—no code changes needed.
  • Near-zero overhead and auto-scaling: Add only ~3 ms latency at 250 RPS, and scale to tens of thousands of requests per second with more CPU or replicas.

How to configure load balancing in True Foundry?

TrueFoundry’s AI Gateway supports two primary methods for applying load-balancing configurations via YAML: directly through the Gateway UI or programmatically with GitOps and the tfy CLI.

To update load balancing in the Gateway UI, navigate to your project’s AI Gateway and select the Config tab under “Load Balancing”. The YAML editor displays your current gateway-load-balancing-config manifest, including top-level fields like name and type, optional model_configs for rate limits, and the core rules array for routing strategies. 

Simply edit the YAML inline, modifying model identifiers, adjusting usage_limits or failure_tolerance, and redefining load_balance_targets with weights or latency strategies—and click Save to validate and deploy immediately without downtime. Under the hood, TrueFoundry validates syntax, applies the new rules in order, and instantly routes traffic according to your updated policy.

Alternatively, for teams practicing GitOps, store your load-balancing manifest (e.g., loadbalancer-config.yaml) in a version-controlled repository alongside your infrastructure code. After committing and pushing changes, run the TrueFoundry CLI:

  • pip install truefoundry and tfy login --host https://app.truefoundry.com to authenticate
  • tfy apply -f loadbalancer-config.yaml to push the YAML into the Gateway

This workflow enforces pull-request reviews, CI/CD validations, and full auditability before any policy change reaches production. Whether you prefer direct UI edits for rapid iterations or GitOps for robust governance, TrueFoundry’s declarative YAML approach ensures your load-balancing policies are transparent, versioned, and applied consistently without touching application code. 

Understanding True Foundry’s Load Balancing Config

TrueFoundry’s load balancing configuration is defined entirely in a declarative YAML manifest that consists of two primary sections: model_configs and rules. At the top level, you specify name, a human-readable identifier used for logging, and type, which must be gateway-load-balancing-config so the platform recognizes this file as a load-balancing specification.

The optional model_configs block lets you enforce global constraints on each model endpoint. For every entry you include:

  • model: the gateway identifier (for example, azure/gpt4)
  • usage_limits: caps on tokens_per_minute and requests_per_minute to prevent any model from exceeding its allocated throughput
  • failure_tolerance: parameters that dictate when a model is deemed unhealthy, including allowed_failures_per_minute, cooldown_period_minutes, and a list of HTTP status codes that count as failures

When a model breaches any usage or failure threshold, the gateway marks it unhealthy for the specified cooldown period and excludes it from routing until it recovers.

The core of the configuration is the rules array. Each rule must declare:

  • id: a unique name used for metrics and logs
  • type: either weight-based-routing or latency-based-routing
  • when: conditions that scope the rule to specific requests by models and optionally by subjects or metadata

Rules are evaluated in the order they appear, and only the first matching rule takes effect. This ensures predictable, deterministic traffic routing.

Under load_balance_targets, list one or more target models. For weight-based routing, each target needs an integer weight between 0 and 100, with all weights summing to 100. For latency-based routing, no weights are needed; the gateway measures recent per-token latency and routes each request to the fastest healthy model. Both strategies support optional override_params per target, allowing customization of runtime parameters such as temperature or max_tokens.

By centralizing traffic distribution policies in a single YAML file, TrueFoundry enables version control, pull-request reviews, and rapid iteration of load-balancing strategies without any changes to application code.

Commonly Used Load Balancing Configurations

Enterprises often adopt distinct load-balancing patterns to meet different operational goals. Below are four widely used setups on the TrueFoundry AI Gateway, each tailored to a specific use case.

1. Canary Deployment

Gradual roll-outs let teams safely introduce new model versions. You assign a small percentage of traffic to the canary model and the remainder to the stable version. Monitoring error rates and latency on the canary ensures any regressions are caught before full cutover.

name: loadbalancing-config
type: gateway-load-balancing-config
rules:
  - id: "gpt4-canary"
    type: "weight-based-routing"
    when:
      models:
        - "gpt-4"
    load_balance_targets:
      - target: "azure/gpt4-v1"
        weight: 90
      - target: "azure/gpt4-v2"
        weight: 10

2. Health-Aware Weight-Based Routing

Premium users or high-priority workflows can be steered toward best-performing models. By defining failure tolerances in model_configs, any model that exceeds error thresholds is automatically removed until it recovers. Traffic proportions then continue among the remaining healthy endpoints.

name: loadbalancing-config
type: gateway-load-balancing-config
model_configs:
  - model: "azure/gpt4"
    failure_tolerance:
      allowed_failures_per_minute: 3
      cooldown_period_minutes: 5
      failure_status_codes: [429, 500, 502, 503, 504]
  - model: "openai/gpt4"
    failure_tolerance:
      allowed_failures_per_minute: 5
      cooldown_period_minutes: 10
      failure_status_codes: [429, 500, 502, 503, 504]
rules:
  - id: "premium-users"
    type: "weight-based-routing"
    when:
      subjects:
        - "virtualaccount:premium"
      models:
        - "gpt-4"
    load_balance_targets:
      - target: "azure/gpt4"
        weight: 80
        override_params:
          temperature: 0.7
      - target: "openai/gpt4"
        weight: 20

3. Token-Aware Latency-Based Routing

To balance cost and performance, you may cap token usage on one model while allowing an alternative endpoint to handle overflow. Latency-based routing then ensures each request goes to the fastest model among those still within quota.

name: loadbalancing-config
type: gateway-load-balancing-config
model_configs:
  - model: "azure/gpt4"
    usage_limits:
      tokens_per_minute: 50000
      requests_per_minute: 100
rules:
  - id: "cost-effective"
    type: "latency-based-routing"
    when:
      models:
        - "gpt-4"
    load_balance_targets:
      - target: "azure/gpt4"
        override_params:
          max_tokens: 500
      - target: "openai/gpt4"
        override_params:
          max_tokens: 1000

4. Environment-Based Routing

Different environments like development, staging, or production often require distinct routing policies. Environment metadata lets you enforce weight-based or latency-based rules conditional on the request context.

name: loadbalancing-config
type: gateway-load-balancing-config
rules:
  - id: "dev-environment"
    type: "weight-based-routing"
    when:
      models:
        - "gpt-4"
      metadata:
        environment: "development"
    load_balance_targets:
      - target: "openai/gpt4"
        weight: 100
        override_params:
          temperature: 0.8
  - id: "prod-environment"
    type: "latency-based-routing"
    when:
      models:
        - "gpt-4"
      metadata:
        environment: "production"
    load_balance_targets:
      - target: "azure/gpt4"
      - target: "openai/gpt4"

Each of these configurations illustrates how TrueFoundry’s declarative YAML lets teams quickly implement sophisticated routing logic, whether for gradual roll-outs, health-aware traffic splitting, cost-sensitive performance optimization, or environment-driven policies, all without touching application code.

Conclusion

Load balancing transforms AI gateway from simple routers into intelligent traffic managers, ensuring high availability, consistent performance, and seamless failover across multiple LLM endpoints. By defining global usage limits and failure tolerances, you prevent overloaded or error-prone models from disrupting service. Weight-based routing lets you control traffic proportions precisely, ideal for canary releases or premium workflows, while latency-based routing dynamically steers requests to the fastest healthy models. Declarative YAML configuration makes these policies transparent, version-controlled, and easy to review. With TrueFoundry’s load balancing features, teams can deploy LLMs confidently, knowing that traffic distribution adapts automatically to real-time conditions without any changes to application code.

Discover More

June 2, 2025

AI Guardrails in Enterprise: Ensuring Safe Innovation

LLM Tools
April 4, 2025

Portkey vs LiteLLM : Which is Best ?

LLM Tools
April 17, 2025

Top 5 Azure ML Alternatives of 2025

LLM Tools
May 8, 2024

Exploring Alternatives to VertexAI

LLM Tools

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!

pipeline