Online LLM Evaluation: Quality Monitoring at the Gateway

Auf Geschwindigkeit ausgelegt: ~ 10 ms Latenz, auch unter Last

Unglaublich schnelle Methode zum Erstellen, Verfolgen und Bereitstellen Ihrer Modelle!

Verarbeitet mehr als 350 RPS auf nur 1 vCPU — kein Tuning erforderlich
Produktionsbereit mit vollem Unternehmenssupport

Beginnen Sie jetzt mit Truefoundry Sprechen Sie mit dem Experten

This post describes an architectural pattern you can implement using TrueFoundry’s AI Gateway. Note that online evaluation is not a native built-in feature — it is something you build on top of the gateway’s observability and routing capabilities.

You can route by cost, fail over on outages, and cache aggressively — and still ship a change that quietly makes your answers worse. Cost, latency, and error rate are the three signals every production system watches, and they can all stay green while the fourth one, answer quality, regresses. This post is how to measure that fourth signal in production: online evaluation, scoring with LLM-as-judge and its honest caveats, sampling, regression detection, and closing the loop back into routing.

Key Takeaways

Production systems instrument cost, latency, and errors — and usually miss the signal that matters most: answer quality. A model, prompt, or routing change can keep every operational dashboard green while quality silently regresses.
Offline evaluation (a fixed test set, pre-deploy) catches known cases; online evaluation (scoring real production traffic) catches the drift, edge cases, and regressions a static set never sees. Mature teams run both.
You score a response with LLM-as-judge, heuristic checks (format, grounding, length), and guardrail signals — but LLM-as-judge is a noisy estimator, not ground truth: it has biases and inconsistency, so calibrate it against human labels and trend it rather than treating it as a verdict.
You can't score every response — scoring has cost and latency of its own — so sample: a small random fraction plus targeted sampling of high-risk routes, treating the result as a statistical estimate with uncertainty.
Quality has to be sliced like cost: by model, route, and prompt version, alongside latency and spend, so a regression is attributable to the specific change that caused it.
Regression detection is the payoff — with uncertainty attached — when a change moves the quality metric down on a slice, you find out before your customers do, which is the failure the cold open is about.
The gateway is the natural place for the cross-cutting online-evaluation layer: it already sees the request/response envelope, model, route, latency, cost, errors, and metadata, so a sampled quality score attaches to the same slices and the loop back to routing closes there. Application-level outcome evaluation — did the ticket actually get resolved? — still belongs in the app, where the domain context lives. TrueFoundry's AI Gateway provides the observability substrate the gateway layer attaches to.

Leena, an ML engineer, made a change everyone wanted. A high-volume support route was running on the flagship model, and a cheaper model looked nearly as good in testing, so she switched the route — an easy 60% cost cut on a big slice of traffic. Every dashboard agreed it was a win: latency held, error rate was flat, spend dropped on schedule. The change shipped, the savings landed, and the team moved on. Two weeks later, support escalations started climbing, and a content review traced them to subtly worse answers on exactly that route — vaguer, occasionally wrong in ways that didn't trip any error. The quality had dropped the day she shipped. Nothing measured it, so nothing caught it for two weeks.

This is the blind spot at the center of LLM operations. The signals that are easy to measure — cost, latency, errors — are not the signal that determines whether the product is good. Quality is harder to measure, so it often isn't, and a change that trades quality for cost looks like a pure win right up until the complaints arrive. Online evaluation is how you put a number on the fourth signal and watch it like the other three.

1. The Signal You're Missing: Quality in Production

Three production signals are nearly free because the infrastructure emits them: latency is a timer, cost is tokens times a rate, errors are status codes. Quality is none of these. A response can be fast, cheap, and return a clean 200 while being vague, subtly wrong, off-policy, or unhelpful — and no operational metric will flinch. That asymmetry is why teams instrument the three easy signals and fly blind on the one that actually defines the product.

Making quality observable means manufacturing a signal that doesn't come for free: sampling real responses, scoring them against what "good" means for the use case, and tracking that score over time and across changes, right alongside cost and latency. The rest of this post is how to produce that signal credibly — including being honest about how noisy it is — and where to run it so it's connected to the decisions, like routing, that move it.

Fig 1: Online evaluation as a loop: the gateway samples live responses, scorers attach a (noisy) quality estimate, scores are sliced by model/route/prompt version, regressions raise alerts when the drop clears the chosen sample-size and uncertainty threshold, and the signal feeds back into routing decisions. The dashed line is what makes it a loop rather than a dashboard.

2. Offline vs. Online Evaluation

Offline evaluation runs a fixed test set against a model or prompt before you ship — a curated set of inputs with known-good answers or rubrics, scored in CI. It's essential and it's not enough. A static test set only contains the cases you thought of; production traffic contains the ones you didn't, plus distribution drift as user behavior and the world change. Leena's cheaper model passed offline testing precisely because the test set didn't resemble the messy long tail of the live support route.

Online evaluation scores real production traffic, after the fact, on a sample. It catches what offline misses: the edge cases outside your test set, gradual drift, and regressions introduced by any change to the live system. The two are complementary — offline is your pre-flight check against known cases, online is your continuous instrument on reality. This post focuses on online, because that's the gap that let a two-week regression go unnoticed.

3. How You Score a Response: LLM-as-Judge, Heuristics, and Guardrail Signals

There are three practical ways to put a number on a response, and you usually combine them. Heuristics are cheap, deterministic checks: did the output parse as valid JSON, does it cite a source when it should, is it within a sane length, does it contain a refusal. Guardrail signals reuse the detectors from earlier in this series — a PII hit, a toxicity flag, an injection-detector firing on the output are all quality signals too. And LLM-as-judge uses a model to score a response against a rubric, which is the only one of the three that can assess open-ended qualities like helpfulness, faithfulness, or tone.

LLM-as-judge scorer with an explicit rubric (illustrative)

JUDGE_PROMPT = """You are grading a support answer against a rubric.
Rate each dimension 1-5 and return ONLY JSON.
- faithful: supported by the provided context, no fabrication
- helpful: directly addresses the user's question
- safe: no PII leakage, no policy violation
Question: {question}
Context: {context}
Answer: {answer}
Return: {{"faithful": int, "helpful": int, "safe": int, "reason": str}}"""

def judge(question, context, answer):
    raw = judge_model.complete(JUDGE_PROMPT.format(...), temperature=0)
    return parse_json(raw)   # trend these scores; do not treat as ground truth

LLM-as-judge is a noisy estimator, not ground truth

A judge model is still a model. It has known biases — it can favor longer answers, prefer outputs from its own model family, and be sensitive to ordering in pairwise comparisons — and it is not perfectly consistent across runs. Treat its scores as a noisy signal to trend over time and across slices, not as a verdict on any single response. Calibrate it against a set of human-labeled examples so you know how well it tracks human judgment for your task, re-check that calibration periodically, and never gate a release solely on an uncalibrated judge. Concretely: keep a small, continuously refreshed human-labeled calibration set per high-value route; track judge–human agreement by rubric dimension, not just in aggregate; and recalibrate whenever the judge model, the rubric, the prompt, the product policy, or the traffic distribution shifts — any of which can move the score without the underlying quality changing. Keep the scorer version in metadata so a judge change never masquerades as a product-quality change. These cautions are well documented: the foundational LLM-as-judge study (Zheng et al., NeurIPS 2023) names position, verbosity, and self-enhancement biases directly, and a later systematic study of position bias (Shi et al., 2025) confirms it is not random and varies sharply across judges and tasks. Online evaluation reduces your blind spots; it does not guarantee quality.

4. Sampling: You Can't (and Shouldn't) Score Everything

Scoring has its own cost and latency — an LLM-as-judge call is another model call — so scoring 100% of traffic is rarely worth it and can rival the cost of the traffic itself. The answer is sampling, with a little statistical honesty. A small random fraction of every route gives you an unbiased estimate of overall quality; targeted sampling raises the rate on the routes you care about most — high-volume, high-stakes, or recently changed. Because you're estimating from a sample, every quality number carries uncertainty, and a small sample on a low-volume route can move for reasons that have nothing to do with a real change.

Sampling and scoring asynchronously, off the hot path (illustrative)

# Scoring runs after the response is returned — never adds latency to the user.
def on_response(req, resp):
    rate = 0.20 if req.route in HIGH_RISK_ROUTES else 0.02   # targeted + baseline
    if random() < rate:
        enqueue_for_scoring(                                  # async; off the hot path
            response=resp,
            tags={"model": req.model, "route": req.route,
                  "prompt_version": req.prompt_version},      # slice keys
        )

Two disciplines keep this honest: run scoring asynchronously so it never adds latency to the user's response, and report quality with its sample size so a noisy low-volume slice isn't mistaken for a trend. Sampling turns an unaffordable "score everything" into an affordable, statistically valid instrument.

5. Quality Metrics by Model, Route, and Prompt Version

A single global quality number is nearly useless for diagnosis — it can't tell you that one route regressed while everything else held. Quality has to be sliced the same way cost is sliced in our cost-attribution post: by model, by route, by prompt version, and by any other dimension a change can move. Those slice keys are exactly the metadata the gateway already attaches to every request, which is why quality belongs next to cost and latency rather than in a separate system.

Fig 2: TrueFoundry's AI Gateway observability already breaks cost, tokens, and latency down by model, team, and metadata. Online evaluation adds the fourth signal — a sampled quality score — onto this same surface and the same slice keys, so quality sits beside cost and latency rather than in a separate tool. Source: *TrueFoundry AI Gateway*.

Putting quality on the same axes as cost and latency is what makes the tradeoff visible instead of hidden. Leena's change would have shown up immediately as a quality drop on one route, on the day she shipped, sitting right next to the cost drop she was celebrating — the two numbers that should always be read together. TrueFoundry's AI Gateway provides the observability substrate — request/response logs, metadata tagging, tracing, cost, latency, and routing context, sliced by model, team, and metadata — that this scoring attaches to. The judge-and-score loop described here is an architectural pattern you build on top of that telemetry unless wired in through a specific evaluation integration; online evaluation is what adds the quality estimate to signals the gateway is already collecting.

Concretely, the unit online evaluation emits is a scoring event joined back to the original response. A workable minimum schema looks like this — the fields are what separate an actionable signal from a misleading one:

Field	Why it matters
request_id / trace_id	Joins the score back to the exact response it grades — and to its cost, latency, and trace.
route	Detects route-specific regressions instead of drowning them in a global average.
model	Lets you compare a model substitution like Leena's directly, quality against cost.
prompt_version	Attributes a regression to a prompt edit rather than a model or traffic change.
scorer_version / judge_model	Keeps a judge change from masquerading as a product-quality change, and tracks evaluator drift and cost.
quality_score / rubric_dimensions	The numeric trend signal, plus the per-dimension breakdown that tells you what degraded.
sample_policy	Records how this response was selected, so you can reason about selection bias.
n / confidence_interval	Stops a noisy low-volume slice from being mistaken for a real regression.
human_label_available	Marks the calibration set — the rows where judge and human judgment can be compared.

6. Detecting Regression When a Change Degrades Quality

Slicing quality is what makes regression detection possible: you compare the quality estimate on a slice before and after a change — a new model on a route, a prompt edit, a routing-policy update — and alert when it drops by more than the noise. Because the scores are sampled and noisy, the comparison has to respect uncertainty: a drop within the sample's margin isn't a regression, and a small slice needs a larger or longer sample before you trust the move.

Comparing a slice across a change, accounting for sample noise (illustrative)

before = quality_scores(route="support", prompt_version="v3")   # baseline window
after  = quality_scores(route="support", prompt_version="v4")   # after the change

drop = before.mean() - after.mean()
if drop > THRESHOLD and significant(before, after):             # beyond sample noise
    alert(f"quality regression on support: {before.mean():.2f} -> {after.mean():.2f}")
    # optionally: auto-roll back the route to the prior version/model

The win is timing. A regression check on the right slices turns Leena's two-week gap into a same-day alert: the moment the support route's quality estimate drops below its baseline by more than the noise, someone is paged — long before the escalations would have surfaced it. Whether you auto-roll-back or just alert is a judgment call that depends on how much you trust the signal on that slice, which is exactly why the calibration from section 3 matters.

7. Closing the Loop: Feeding Quality Back into Routing

The reason to measure quality at the gateway, rather than in a separate analytics pipeline, is that the gateway is also where routing decisions are made — so the signal can feed the decision. Our routing post described quality-aware routing as an aspiration that needs a quality signal to be real; online evaluation is that signal. With per-slice quality scores in hand, routing stops being a static guess and becomes a feedback loop: promote a cheaper model on a route only while its measured quality holds, and alert or roll back when it doesn't — which of the two depends on the route's risk and how much you trust the signal on that slice.

That closes the loop the cold open left open. Leena's cost-saving change is exactly the kind of decision that should be gated on a live quality signal: ship the cheaper model, watch the quality estimate on that route, and keep the savings only as long as the quality stays within tolerance. The gateway is the one place that sees the responses to score and makes the routing decision to adjust, which is what makes it the right home for the loop rather than just the measurement.

8. Where Evaluation Lives: Gateway vs. Application vs. Offline

Not all evaluation belongs in one place, and it's worth being precise about the division. Offline evaluation lives in CI, against fixed test sets, gating deploys on known cases. Application-level evaluation lives in the app when scoring needs context the gateway doesn't have — domain ground truth, business outcomes, whether the user's task actually succeeded. Gateway-level online evaluation lives at the gateway for the cross-cutting signal: a sampled, sliced quality estimate on live traffic, attached to the cost and latency telemetry, feeding routing.

Layer	What it measures	Why there
Offline (CI)	Known cases, pre-deploy, against a fixed set	Gate releases on regressions you can anticipate
Gateway (online)	Sampled quality on live traffic, sliced by model/route/version	Sees every response and the routing decision; cross-cutting and consistent
Application	Task success, business outcomes, domain ground truth	Needs context only the app has

The gateway doesn't replace the other two; it fills the gap they leave — continuous, consistent quality monitoring across all traffic, in the one place that can both observe responses and act on routing. That's the role this whole series has argued the gateway plays: the cross-cutting control plane, here applied to the signal that's hardest to measure and matters most.

9. FAQs

Why isn't offline evaluation enough?

Because a fixed test set only contains the cases you anticipated. Production has the long tail you didn't, plus drift over time, plus regressions from any live change. Leena's cheaper model passed offline testing and still regressed in production, because the test set didn't resemble the real support traffic. Offline is your pre-flight check; online is your continuous instrument on reality. You want both.

Can I trust an LLM to grade another LLM?

As a trend signal, with calibration — not as ground truth. Judge models have biases (length, self-preference, position) and aren't perfectly consistent, so calibrate the judge against human-labeled examples to learn how well it tracks human judgment for your task, trend the scores over time and across slices rather than acting on any single one, and don't gate a release solely on an uncalibrated judge. It's a useful, imperfect instrument.

What would have caught the cold open?

A sampled quality score on the support route, sliced by model and prompt version, with a regression check against the pre-change baseline. The day Leena switched models, the route's quality estimate would have dropped next to the cost drop, and the regression alert would have fired — turning a two-week blind spot into a same-day signal. The cost saving wasn't the mistake; shipping it without a quality signal was.

How much traffic do I need to score?

Enough for the slice you care about to be statistically meaningful, which depends on volume and how large a change you need to detect. A small random baseline across all routes plus a higher targeted rate on high-stakes or recently changed routes is a sensible default. Always report quality with its sample size, and be skeptical of moves on low-volume slices until the sample is large enough to trust.

Gateway or application for online evaluation?

Both, for different signals. The gateway owns the cross-cutting one — sampled quality on live traffic, sliced and attached to cost and latency, feeding routing — because it sees every response and makes the routing decision. The application owns evaluation that needs context the gateway lacks, like whether the user's actual task succeeded. They're complementary, not competing.

The three easy signals will always be the ones you instrument first, because the infrastructure hands them to you. Quality is the one you have to build a signal for — by sampling responses, scoring them honestly, slicing the scores, and watching for regressions. Build that signal at the gateway, where the responses and the routing decisions already are, and the next cost-saving change that quietly hurts quality becomes a same-day alert instead of a two-week mystery.

References

TrueFoundry — AI Gateway (request/response logs, metadata, observability)
TrueFoundry — Tracing
TrueFoundry — OpenTelemetry for LLM gateway instrumentation
TrueFoundry — LLM cost attribution by team and budget
TrueFoundry — Intelligent LLM routing (the loop this post closes)
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (NeurIPS 2023) — the foundational LLM-as-judge method, and its own account of position, verbosity, and self-enhancement biases.
Shi et al., “Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge” (IJCNLP-AACL 2025) — position bias is systematic, not random, and varies sharply across judges and tasks.

Northwind and Leena are illustrative, as are the quality figures and thresholds shown. LLM-as-judge is a noisy estimator with known biases and is not ground truth; the scores described should be calibrated against human labels and trended rather than treated as verdicts, and online evaluation reduces blind spots without guaranteeing quality. TrueFoundry capabilities are summarized from public product documentation as of June 2026 and will evolve. Code samples are illustrative of the patterns described, not copied from a reference implementation.

‍

TrueFoundry AI Gateway bietet eine Latenz von ~3—4 ms, verarbeitet mehr als 350 RPS auf einer vCPU, skaliert problemlos horizontal und ist produktionsbereit, während LiteLM unter einer hohen Latenz leidet, mit moderaten RPS zu kämpfen hat, keine integrierte Skalierung hat und sich am besten für leichte Workloads oder Prototyp-Workloads eignet.

Auf Geschwindigkeit ausgelegt: ~ 10 ms Latenz, auch unter Last

Vereinbaren Sie jetzt Ihre Demo