What a Production Eval System Actually Looks Like

One of the stranger failures I have seen was a reporting agent confidently giving product-support advice for features it had no business talking about.

The system was built to answer questions about reporting, but when users saw a chat box, reasonably enough, they treated it like a help bot. When they asked how to use unrelated product features, the agent sometimes improvised.

We had not built an eval for that.

We were measuring whether the agent could answer the reporting questions we expected. Production users were testing whether the product surface made sense to them. That's a fundamental misaligment.

The first lesson: production evals are not just about grading model answers. They are about discovering the task users have decided your system performs.

A production eval system is a loop

Most teams start with evals as a pre-launch artifact:

a notebook
a spreadsheet
a prompt playground
a small set of golden examples

Fine at the beginning, but not enough for production.

In production, an eval system becomes a loop:

capture real behavior
surface suspicious sessions
run deterministic checks and model judges
send hard cases to humans
turn those human decisions into labels
use the labels to improve evaluators, prompts, retrieval, tools, and product boundaries

The loop matters more than any single judge.

The first job is finding failures

Users do not always click thumbs down.

Sometimes they abandon the session. Sometimes they rephrase the same request four times. Sometimes they correct the agent. Sometimes they ask a reporting agent for help-center advice because the product made the boundary unclear.

Good failure surfacing combines explicit and implicit signals:

user reports
repeated rephrasing
abandoned workflows
tool errors
policy violations
low-confidence retrieval
unusual follow-up patterns

The output is not a score. It is a queue of sessions worth looking at.

That queue is where the system starts learning from reality.

Most evaluators should be plain code

If an eval can be deterministic, make it deterministic.

Use code for:

schema validation
SQL parse checks
required fields
chart spec validity
permission boundaries
tool-call constraints
output formatting

Deterministic checks are cheaper, faster, more stable, and easier to debug than another model call.

LLM judges are useful when the question is genuinely subjective: relevance, faithfulness, tone, helpfulness, completeness. Even then, I prefer narrow judges. One judge, one job, explicit criteria, binary or near-binary output.

When a judge tries to grade five things at once, a bad score stops telling you what broke.

OpenAI's eval guidance makes a similar point in a different vocabulary: evaluations are most useful when the task, criteria, and grading signal are clear. Vague evals produce vague confidence.

The review tool is the labeling system

The hardest part is usually not generating evaluations. It's maintaining labels that stay connected to production.

A common mistake is treating annotation as a separate project:

production failures over here
labeling sprint over there

That split creates drift; examples go stale; the taxonomy stops matching real failures. Reviewers and evaluators stop speaking the same language.

The better pattern is to make review and labeling the same workflow.

When a human reviews a trace, they should be able to mark:

did the agent answer correctly?
did the evaluator catch the issue?
was the route/tool/policy correct?
what failure category applies?
is this a new failure mode?

Those labels become one of the most valuable assets in the system. They let you measure judges, compare model versions, detect drift, and decide whether a product change actually helped.

Without labels, evals are mostly intuition with dashboards.

Closed taxonomies age better

Open-ended failure labels feel flexible at first. Then they decay.

One person writes bad_filtering. Another writes filter_issue. Someone else writes query_too_restrictive. Three months later, the dashboard has twenty categories and no one trusts the rollup.

Use a closed taxonomy by default.

Make adding a category possible, but intentional. The friction is useful. It forces the team to decide whether a failure mode is truly new or just a different spelling of an old one.

This sounds bureaucratic until you are trying to compare regressions across model versions and realize your labels are not usable.

Evaluators need evals too

Judges and classifiers are production models once they affect routing, alerts, launch decisions, or prioritization.

They need their own measurement:

true positive rate
true negative rate
per-category performance
stability across repeated runs
performance by product surface

They also need versioning. If you change the judge prompt, model, rubric, or examples, that version should travel with every result.

Otherwise, a trend line can move and nobody knows whether the product changed, the model changed, or the ruler changed.

The product boundary is part of the eval

The help-bot failure taught me this the annoying way.

The agent was not only being evaluated on answer quality. It was being evaluated on whether it understood what it should refuse, redirect, or hand off. That is a product-boundary problem, not just a model-quality problem.

For AI systems, evals should cover:

what the system should answer
what it should not answer
when it should ask a clarifying question
when it should route to another surface
when it should refuse to act

If you do not evaluate the boundary, users will.

Where this lands

A production eval system is not a launch checklist. It is the feedback infrastructure that tells you whether the product still behaves the way you think it does.

The judges matter. The dashboards matter. The deterministic checks matter.

But the labels matter most.

The teams that age well keep human judgment close to real production behavior and turn that judgment into measurement. Everything else is just a nicer way to be surprised.