Online LLM Evaluation: Quality Monitoring at the Gateway

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

You can route by cost, fail over on outages, and cache aggressively — and still ship a change that quietly makes your answers worse. Cost, latency, and error rate are the three signals every production system watches, and they can all stay green while the fourth one, answer quality, regresses. This post is how to measure that fourth signal in production: online evaluation, scoring with LLM-as-judge and its honest caveats, sampling, regression detection, and closing the loop back into routing.

Key Takeaways

Production systems instrument cost, latency, and errors — and usually miss the signal that matters most: answer quality. A model, prompt, or routing change can keep every operational dashboard green while quality silently regresses.
Offline evaluation (a fixed test set, pre-deploy) catches known cases; online evaluation (scoring real production traffic) catches the drift, edge cases, and regressions a static set never sees. Mature teams run both.
You score a response with LLM-as-judge, heuristic checks (format, grounding, length), and guardrail signals — but LLM-as-judge is a noisy estimator, not ground truth: it has biases and inconsistency, so calibrate it against human labels and trend it rather than treating it as a verdict.
You can't score every response — scoring has cost and latency of its own — so sample: a small random fraction plus targeted sampling of high-risk routes, treating the result as a statistical estimate with uncertainty.
Quality has to be sliced like cost: by model, route, and prompt version, alongside latency and spend, so a regression is attributable to the specific change that caused it.
Regression detection is the payoff — with uncertainty attached — when a change moves the quality metric down on a slice, you find out before your customers do, which is the failure the cold open is about.
The gateway is the natural place for the cross-cutting online-evaluation layer: it already sees the request/response envelope, model, route, latency, cost, errors, and metadata, so a sampled quality score attaches to the same slices and the loop back to routing closes there. Application-level outcome evaluation — did the ticket actually get resolved? — still belongs in the app, where the domain context lives. TrueFoundry's AI Gateway provides the observability substrate the gateway layer attaches to.

Leena, an ML engineer, made a change everyone wanted. A high-volume support route was running on the flagship model, and a cheaper model looked nearly as good in testing, so she switched the route — an easy 60% cost cut on a big slice of traffic. Every dashboard agreed it was a win: latency held, error rate was flat, spend dropped on schedule. The change shipped, the savings landed, and the team moved on. Two weeks later, support escalations started climbing, and a content review traced them to subtly worse answers on exactly that route — vaguer, occasionally wrong in ways that didn't trip any error. The quality had dropped the day she shipped. Nothing measured it, so nothing caught it for two weeks.

This is the blind spot at the center of LLM operations. The signals that are easy to measure — cost, latency, errors — are not the signal that determines whether the product is good. Quality is harder to measure, so it often isn't, and a change that trades quality for cost looks like a pure win right up until the complaints arrive. Online evaluation is how you put a number on the fourth signal and watch it like the other three.

1. The Signal You're Missing: Quality in Production

Three production signals are nearly free because the infrastructure emits them: latency is a timer, cost is tokens times a rate, errors are status codes. Quality is none of these. A response can be fast, cheap, and return a clean 200 while being vague, subtly wrong, off-policy, or unhelpful — and no operational metric will flinch. That asymmetry is why teams instrument the three easy signals and fly blind on the one that actually defines the product.

Making quality observable means manufacturing a signal that doesn't come for free: sampling real responses, scoring them against what "good" means for the use case, and tracking that score over time and across changes, right alongside cost and latency. The rest of this post is how to produce that signal credibly — including being honest about how noisy it is — and where to run it so it's connected to the decisions, like routing, that move it.

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now