Open-Weight Routing at Scale: GLM-5.1 vs Claude Opus 4.7 on TrueFoundry AI Gateway

Diseñado para la velocidad: ~ 10 ms de latencia, incluso bajo carga
¡Una forma increíblemente rápida de crear, rastrear e implementar sus modelos!
- Gestiona más de 350 RPS en solo 1 vCPU, sin necesidad de ajustes
- Listo para la producción con soporte empresarial completo
We ran 20 fixed prompts through TrueFoundry AI Gateway comparing four strategies: all Claude Opus 4.7, all Z.AI GLM-5.1, a Haiku classifier router (easy → open, hard → frontier), and an 80/20 virtual model. On this mix, classifier routing cut blended cost ~31% versus all- Opus ($15.72 vs $22.72 per 1M tokens) while scoring higher on our Sonnet judge (4.94 vs 4.85). All-open was cheapest ($3.00 / 1M) but slower and slightly lower quality. The takeaway: you do not need a single model string for every request — Gateway routing plus a cheap classifier can preserve frontier quality on hard tasks without paying frontier prices on easy ones.
Why this matters now
The open-weight wave is no longer theoretical. Models like GLM-5.1 ship with agentic coding positioning, 200K-token context, and list prices an order of magnitude below frontier APIs. At the same time, Claude Opus 4.7 remains the reference for hard reasoning — with $5 / $25 per 1M input/output tokens on Anthropic's current public rate card.
Platform teams face a familiar tradeoff:
- Route everything to frontier → predictable quality, painful unit economics at volume.
- Route everything to open-weight → attractive cost, uneven quality and latency tails on hard prompts.
- Build custom routers → flexible, but you own classification logic, failover, billing reconciliation, and cache semantics across providers.
TrueFoundry AI Gateway sits in the middle: 1000+ LLMs through a unified OpenAI compatible API, virtual models with weight-based routing, x-tfy-cache-config semantic cache headers, and Metrics → Download Raw Data for billing truth. We wanted to measure whether a simple EASY/HARD classifier — one Haiku call per request — could beat both extremes on cost and quality for a realistic 20-prompt workload.
What we compared (technical tour)
Open-weight baseline: GLM-5.1
GLM-5.1 is Z.AI's April 2026 flagship, aimed at long-horizon agentic work — planning, tool use, and multi-step coding loops. We accessed it via Gateway as open-router/z-ai-glm-5.1(OpenRouter-routed). Public list pricing on OpenRouter is $0.98 / $3.08 per 1M input/output tokens.
Frontier baseline: Claude Opus 4.7
Opus 4.7 is Anthropic's top-tier model for complex reasoning. We used anthropic/claudeopus-4-7 through Gateway. Public list pricing is $5 / $25 per 1M input/output. Note: Opus 4.7uses a new tokenizer that can emit more tokens than older Claude models for the same text —cost comparisons should use measured token counts, not character counts.
App-level classifier router
Our router (open-router/anthropic-claude-haiku-4.5 ) classifies each prompt as EASY orHARD in a single call (~8 output tokens). EASY → GLM-5.1; HARD → Opus 4.7. Quality scoring uses Claude Sonnet 4.6 as an LLM judge (1–5 against per-prompt rubrics).
Gateway virtual model (80/20)
We also tested a virtual-model in Gateway configured for weight-based routing (80%open / 20% frontier in the UI). This measures provider-side load balancing without app-level classification — a different knob than the Haiku router.
About our benchmark
Prompts: 20 tasks — 10 labeled easy (summarize, format JSON, translate) and 10 hard(distributed systems tradeoffs, SQL injection review, contract ambiguity, K8s OOM debug, etc.).
Metrics per strategy:
What we did not claim: vendor SWE-bench scores, production traffic shapes.
Vendor pricing context (May 2026)
GLM-5.1 is roughly 5× cheaper on input and ~8× cheaper on output than Opus 4.7 at list price — before routing, caching, or enterprise discounts. The interesting question is how much of that gap you keep after sending hard prompts to frontier.
Our analysis (20-prompt run)
Cost per 1M tokens (this run's token mix)
Router split (classifier)
The Haiku router sent 10/20 prompts to GLM-5.1 and 10/20 to Opus 4.7 — a 50/50 split on this prompt set (10 easy + 10 hard by design). Token volume followed suit: 7,774 tokens on GLM vs 10,072 on Opus for completion traffic.
Latency tails matter
Open-weight-only had the slowest p50 (20.1s) and an extreme p95 (~115s) — one long GLM completion on a hard prompt dominated the tail. Opus-only was fastest at p50 (9.1s) with a moderate p95 (~21s). The classifier landed in between on p50 (14.9s) with p95 ~26s.
Quality vs cost: the classifier sweet spot
- Router vs all-Opus: ~31% lower blended $/1M ($15.72 vs $22.72) with higher mean judge score (4.94 vs 4.85). Total dollar cost for 20 prompts was essentially the same (~$0.28) because judge + router overhead offset GLM savings — at higher volume, the per-token gap compounds.
- Router vs all-open: ~5.2× higher $/1M but +0.19 quality points. Cheapest is not best if hard prompts matter.
- Virtual 80/20: $7.19 / 1M on a list-price blend estimate, but quality (4.50) trailed both baselines. Weight-based routing without task awareness is not a substitute for classification on this workload — validate the actual backend mix in Gateway Metrics, not just the virtual model id.
Why these results matter
- Classification is cheap relative to frontier completions. One Haiku call per request is noise compared to a 1,024-token Opus completion on hard tasks. The router's economics work when easy traffic is a large share of volume — and when misroutes are rare.
- List price ≠ your bill. Gateway may route through different providers, apply caching, or negotiate rates. We applied public list prices to measured tokens from our run; you should reconcile with Gateway Metrics → Download Raw Data before setting FinOps guardrails.
- Latency and quality are coupled. Saving 31% on tokens does not help if p95 latency breaches SLOs. Our open-weight baseline showed that a single bad routing decision (sending a hard prompt only to GLM) can explode tail latency.
- Two routing patterns, two stories. App-level EASY/HARD routing optimized quality-cost on this set. UI-level 80/20 virtual models optimized for operational simplicity but underperformed on quality here — useful for gradual rollouts, not a full replacement for task-aware routing.
Practical takeaways for platform teams
- Start with a frontier + open-weight pair wired through one Gateway base URL. Swap models by changing the model string — no SDK fork per provider.
- Add a cheap classifier (Haiku or similar) before you add complexity to virtual-model weights. Measure misroute rate on a gold subset of prompts.
- Publish a prompt tier list (easy / hard) aligned with your rubrics — our 20-prompt set is a template, not your production distribution.
- Reconcile cost in Gateway Metrics, not in notebook estimates. Export raw billing CSV and join on trace metadata
- Layer semantic cache after routing stabilizes — x-tfy-cache-config on easy, paraphrased prompts is where cache ROI usually appears (not measured in this baseline run).
How TrueFoundry AI Gateway made this possible
- Unified OpenAI-compatible API — one client, base_url pointed at Gateway; same codepath for GLM, Opus, Haiku, and Sonnet.
- Virtual models — weight-based 80/20 routing without application changes (docs).
- Semantic cache — x-tfy-cache-config header for similarity-based reuse (docs).
- Observability — token usage, latency, and cost headers for reconciliation; ~3–4 ms latency and 350+ RPS on 1 vCPU at the gateway layer for high-throughput proxy scenarios.
Conclusion
Open-weight models like GLM-5.1 are priced to win easy traffic. Claude Opus 4.7 still earns its keep on hard prompts. The gap between them is large enough that routing matters more than model marketing.
On our 20-prompt harness through TrueFoundry AI Gateway, a Haiku classifier router delivered the best combined story: ~31% lower blended cost per million tokens than all- Opus, with a higher mean judge score (4.94 vs 4.85). All-open remained the cost floor; all- Opus the quality-and-speed ceiling for p50 latency.
TrueFoundry AI Gateway ofrece una latencia de entre 3 y 4 ms, gestiona más de 350 RPS en una vCPU, se escala horizontalmente con facilidad y está listo para la producción, mientras que LitellM presenta una latencia alta, tiene dificultades para superar un RPS moderado, carece de escalado integrado y es ideal para cargas de trabajo ligeras o de prototipos.
La forma más rápida de crear, gobernar y escalar su IA













.png)




.png)






.webp)



