Online LLM Evaluation: Quality Monitoring at the Gateway

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Une méthode incroyablement rapide pour créer, suivre et déployer vos modèles !

Gère plus de 350 RPS sur un seul processeur virtuel, aucun réglage n'est nécessaire
Prêt pour la production avec un support complet pour les entreprises

Commencez à utiliser Truefoundry dès maintenant Parlez à l'expert

You can route by cost, fail over on outages, and cache aggressively — and still ship a change that quietly makes your answers worse. Cost, latency, and error rate are the three signals every production system watches, and they can all stay green while the fourth one, answer quality, regresses. This post is how to measure that fourth signal in production: online evaluation, scoring with LLM-as-judge and its honest caveats, sampling, regression detection, and closing the loop back into routing.

Key Takeaways

Production systems instrument cost, latency, and errors — and usually miss the signal that matters most: answer quality. A model, prompt, or routing change can keep every operational dashboard green while quality silently regresses.
Offline evaluation (a fixed test set, pre-deploy) catches known cases; online evaluation (scoring real production traffic) catches the drift, edge cases, and regressions a static set never sees. Mature teams run both.
You score a response with LLM-as-judge, heuristic checks (format, grounding, length), and guardrail signals — but LLM-as-judge is a noisy estimator, not ground truth: it has biases and inconsistency, so calibrate it against human labels and trend it rather than treating it as a verdict.
You can't score every response — scoring has cost and latency of its own — so sample: a small random fraction plus targeted sampling of high-risk routes, treating the result as a statistical estimate with uncertainty.
Quality has to be sliced like cost: by model, route, and prompt version, alongside latency and spend, so a regression is attributable to the specific change that caused it.
Regression detection is the payoff — with uncertainty attached — when a change moves the quality metric down on a slice, you find out before your customers do, which is the failure the cold open is about.
The gateway is the natural place for the cross-cutting online-evaluation layer: it already sees the request/response envelope, model, route, latency, cost, errors, and metadata, so a sampled quality score attaches to the same slices and the loop back to routing closes there. Application-level outcome evaluation — did the ticket actually get resolved? — still belongs in the app, where the domain context lives. TrueFoundry's AI Gateway provides the observability substrate the gateway layer attaches to.

Leena, an ML engineer, made a change everyone wanted. A high-volume support route was running on the flagship model, and a cheaper model looked nearly as good in testing, so she switched the route — an easy 60% cost cut on a big slice of traffic. Every dashboard agreed it was a win: latency held, error rate was flat, spend dropped on schedule. The change shipped, the savings landed, and the team moved on. Two weeks later, support escalations started climbing, and a content review traced them to subtly worse answers on exactly that route — vaguer, occasionally wrong in ways that didn't trip any error. The quality had dropped the day she shipped. Nothing measured it, so nothing caught it for two weeks.

This is the blind spot at the center of LLM operations. The signals that are easy to measure — cost, latency, errors — are not the signal that determines whether the product is good. Quality is harder to measure, so it often isn't, and a change that trades quality for cost looks like a pure win right up until the complaints arrive. Online evaluation is how you put a number on the fourth signal and watch it like the other three.

1. The Signal You're Missing: Quality in Production

Three production signals are nearly free because the infrastructure emits them: latency is a timer, cost is tokens times a rate, errors are status codes. Quality is none of these. A response can be fast, cheap, and return a clean 200 while being vague, subtly wrong, off-policy, or unhelpful — and no operational metric will flinch. That asymmetry is why teams instrument the three easy signals and fly blind on the one that actually defines the product.

Making quality observable means manufacturing a signal that doesn't come for free: sampling real responses, scoring them against what "good" means for the use case, and tracking that score over time and across changes, right alongside cost and latency. The rest of this post is how to produce that signal credibly — including being honest about how noisy it is — and where to run it so it's connected to the decisions, like routing, that move it.

Fig 1: Online evaluation as a loop: the gateway samples live responses, scorers attach a (noisy) quality estimate, scores are sliced by model/route/prompt version, regressions raise alerts when the drop clears the chosen sample-size and uncertainty threshold, and the signal feeds back into routing decisions. The dashed line is what makes it a loop rather than a dashboard.

2. Offline vs. Online Evaluation

Offline evaluation runs a fixed test set against a model or prompt before you ship — a curated set of inputs with known-good answers or rubrics, scored in CI. It's essential and it's not enough. A static test set only contains the cases you thought of; production traffic contains the ones you didn't, plus distribution drift as user behavior and the world change. Leena's cheaper model passed offline testing precisely because the test set didn't resemble the messy long tail of the live support route.

Online evaluation scores real production traffic, after the fact, on a sample. It catches what offline misses: the edge cases outside your test set, gradual drift, and regressions introduced by any change to the live system. The two are complementary — offline is your pre-flight check against known cases, online is your continuous instrument on reality. This post focuses on online, because that's the gap that let a two-week regression go unnoticed.

3. How You Score a Response: LLM-as-Judge, Heuristics, and Guardrail Signals

There are three practical ways to put a number on a response, and you usually combine them. Heuristics are cheap, deterministic checks: did the output parse as valid JSON, does it cite a source when it should, is it within a sane length, does it contain a refusal. Guardrail signals reuse the detectors from earlier in this series — a PII hit, a toxicity flag, an injection-detector firing on the output are all quality signals too. And LLM-as-judge uses a model to score a response against a rubric, which is the only one of the three that can assess open-ended qualities like helpfulness, faithfulness, or tone.

LLM-as-judge scorer with an explicit rubric (illustrative)

JUDGE_PROMPT = """You are grading a support answer against a rubric.
Rate each dimension 1-5 and return ONLY JSON.
- faithful: supported by the provided context, no fabrication
- helpful: directly addresses the user's question
- safe: no PII leakage, no policy violation
Question: {question}
Context: {context}
Answer: {answer}
Return: {{"faithful": int, "helpful": int, "safe": int, "reason": str}}"""

def judge(question, context, answer):
    raw = judge_model.complete(JUDGE_PROMPT.format(...), temperature=0)
    return parse_json(raw)   # trend these scores; do not treat as ground truth

LLM-as-judge is a noisy estimator, not ground truth

A judge model is still a model. It has known biases — it can favor longer answers, prefer outputs from its own model family, and be sensitive to ordering in pairwise comparisons — and it is not perfectly consistent across runs. Treat its scores as a noisy signal to trend over time and across slices, not as a verdict on any single response. Calibrate it against a set of human-labeled examples so you know how well it tracks human judgment for your task, re-check that calibration periodically, and never gate a release solely on an uncalibrated judge. Concretely: keep a small, continuously refreshed human-labeled calibration set per high-value route; track judge–human agreement by rubric dimension, not just in aggregate; and recalibrate whenever the judge model, the rubric, the prompt, the product policy, or the traffic distribution shifts — any of which can move the score without the underlying quality changing. Keep the scorer version in metadata so a judge change never masquerades as a product-quality change. These cautions are well documented: the foundational LLM-as-judge study (Zheng et al., NeurIPS 2023) names position, verbosity, and self-enhancement biases directly, and a later systematic study of position bias (Shi et al., 2025) confirms it is not random and varies sharply across judges and tasks. Online evaluation reduces your blind spots; it does not guarantee quality.

4. Sampling: You Can't (and Shouldn't) Score Everything

Scoring has its own cost and latency — an LLM-as-judge call is another model call — so scoring 100% of traffic is rarely worth it and can rival the cost of the traffic itself. The answer is sampling, with a little statistical honesty. A small random fraction of every route gives you an unbiased estimate of overall quality; targeted sampling raises the rate on the routes you care about most — high-volume, high-stakes, or recently changed. Because you're estimating from a sample, every quality number carries uncertainty, and a small sample on a low-volume route can move for reasons that have nothing to do with a real change.

Sampling and scoring asynchronously, off the hot path (illustrative)

# Scoring runs after the response is returned — never adds latency to the user.
def on_response(req, resp):
    rate = 0.20 if req.route in HIGH_RISK_ROUTES else 0.02   # targeted + baseline
    if random() < rate:
        enqueue_for_scoring(                                  # async; off the hot path
            response=resp,
            tags={"model": req.model, "route": req.route,
                  "prompt_version": req.prompt_version},      # slice keys
        )

Two disciplines keep this honest: run scoring asynchronously so it never adds latency to the user's response, and report quality with its sample size so a noisy low-volume slice isn't mistaken for a trend. Sampling turns an unaffordable "score everything" into an affordable, statistically valid instrument.

5. Quality Metrics by Model, Route, and Prompt Version

A single global quality number is nearly useless for diagnosis — it can't tell you that one route regressed while everything else held. Quality has to be sliced the same way cost is sliced in our cost-attribution post: by model, by route, by prompt version, and by any other dimension a change can move. Those slice keys are exactly the metadata the gateway already attaches to every request, which is why quality belongs next to cost and latency rather than in a separate system.

‍

TrueFoundry AI Gateway offre une latence d'environ 3 à 4 ms, gère plus de 350 RPS sur 1 processeur virtuel, évolue horizontalement facilement et est prête pour la production, tandis que LiteLM souffre d'une latence élevée, peine à dépasser un RPS modéré, ne dispose pas d'une mise à l'échelle intégrée et convient parfaitement aux charges de travail légères ou aux prototypes.

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Planifiez votre démo dès maintenant