Blank white background with no objects or features visible.

TrueFoundry recognized in Gartner Hype Cycle for Platform Engineering 2026. Read the full report →

Join our VAR & VAD ecosystem — deliver enterprise AI governance across LLMs, MCPs & Agents. Become a Partner →

Online Evaluation and Quality Monitoring at the Gateway

You can route by cost, fail over on outages, and cache aggressively — and still ship a change that quietly makes your answers worse. Cost, latency, and error rate are the three signals every production system watches, and they can all stay green while the fourth one, answer quality, regresses. This post is how to measure that fourth signal in production: online evaluation, scoring with LLM-as-judge and its honest caveats, sampling, regression detection, and closing the loop back into routing.

Key Takeaways

  • Production systems instrument cost, latency, and errors — and usually miss the signal that matters most: answer quality. A model, prompt, or routing change can keep every operational dashboard green while quality silently regresses.
  • Offline evaluation (a fixed test set, pre-deploy) catches known cases; online evaluation (scoring real production traffic) catches the drift, edge cases, and regressions a static set never sees. Mature teams run both.
  • You score a response with LLM-as-judge, heuristic checks (format, grounding, length), and guardrail signals — but LLM-as-judge is a noisy estimator, not ground truth: it has biases and inconsistency, so calibrate it against human labels and trend it rather than treating it as a verdict.
  • You can't score every response — scoring has cost and latency of its own — so sample: a small random fraction plus targeted sampling of high-risk routes, treating the result as a statistical estimate with uncertainty.
  • Quality has to be sliced like cost: by model, route, and prompt version, alongside latency and spend, so a regression is attributable to the specific change that caused it.
  • Regression detection is the payoff — with uncertainty attached — when a change moves the quality metric down on a slice, you find out before your customers do, which is the failure the cold open is about.
  • The gateway is the natural place for the cross-cutting online-evaluation layer: it already sees the request/response envelope, model, route, latency, cost, errors, and metadata, so a sampled quality score attaches to the same slices and the loop back to routing closes there. Application-level outcome evaluation — did the ticket actually get resolved? — still belongs in the app, where the domain context lives. TrueFoundry's AI Gateway provides the observability substrate the gateway layer attaches to.

Leena, an ML engineer, made a change everyone wanted. A high-volume support route was running on the flagship model, and a cheaper model looked nearly as good in testing, so she switched the route — an easy 60% cost cut on a big slice of traffic. Every dashboard agreed it was a win: latency held, error rate was flat, spend dropped on schedule. The change shipped, the savings landed, and the team moved on. Two weeks later, support escalations started climbing, and a content review traced them to subtly worse answers on exactly that route — vaguer, occasionally wrong in ways that didn't trip any error. The quality had dropped the day she shipped. Nothing measured it, so nothing caught it for two weeks.

This is the blind spot at the center of LLM operations. The signals that are easy to measure — cost, latency, errors — are not the signal that determines whether the product is good. Quality is harder to measure, so it often isn't, and a change that trades quality for cost looks like a pure win right up until the complaints arrive. Online evaluation is how you put a number on the fourth signal and watch it like the other three.

1. The Signal You're Missing: Quality in Production

Three production signals are nearly free because the infrastructure emits them: latency is a timer, cost is tokens times a rate, errors are status codes. Quality is none of these. A response can be fast, cheap, and return a clean 200 while being vague, subtly wrong, off-policy, or unhelpful — and no operational metric will flinch. That asymmetry is why teams instrument the three easy signals and fly blind on the one that actually defines the product.

Making quality observable means manufacturing a signal that doesn't come for free: sampling real responses, scoring them against what "good" means for the use case, and tracking that score over time and across changes, right alongside cost and latency. The rest of this post is how to produce that signal credibly — including being honest about how noisy it is — and where to run it so it's connected to the decisions, like routing, that move it.

The fastest way to build, govern and scale your AI

Sign Up
Table of Contents

One Gateway for Every LLM, Agent and MCP Server

Book a 30-min with our AI expert

Book a Demo

The fastest way to build, govern and scale your AI

Book Demo

Discover More

No items found.
June 15, 2026
|
5 min read

Spec-Driven Development for AI Agents, Done Right: Specs as Governed Artifacts

No items found.
June 15, 2026
|
5 min read

Online Evaluation and Quality Monitoring at the Gateway

No items found.
May 21, 2026
|
5 min read

Injeção de Prompt e Riscos de Segurança de Agentes de IA: Como os Ataques Funcionam Contra o Claude Code e Como Preveni-los

No items found.
May 21, 2026
|
5 min read

Observabilidade de Agentes de IA: Monitoramento e Depuração de Fluxos de Trabalho de Agentes

No items found.
No items found.

Recent Blogs

Black left pointing arrow symbol on white background, directional indicator.
Black left pointing arrow symbol on white background, directional indicator.
Take a quick product tour
Start Product Tour
Product Tour