Back to Inference Glossary
Inference Glossary

LLM Evaluation

The practice of grading model outputs against a target — using a stack of synthetic data, deterministic heuristics, calibrated LLM judges, and shadow evaluation against production traffic.

LLM evaluation is the discipline of producing a defensible answer to "is this model better, worse, or equivalent for this workflow than the alternative?" The naive answer — three LLM judges and a majority vote — fails often enough to lose trust. A real evaluation stack uses four layers, weighted differently per workflow.

**Synthetic data** generates test inputs that target the edge cases live traffic does not show often. The platform that owns observability can mine real traffic for low-coverage clusters and seed synthetic inputs from there, which is more useful than hand-authored test sets that drift away from production.

**Deterministic heuristics** are cheap, fast, no-ML checks. Is the JSON valid? Does the tool call match the schema? Did the model refuse a request that should not be refused? Did the response stay under the latency budget? Heuristics catch the regressions judges are bad at and they cost nothing.

**LLM judges with explicit rubrics** are the next layer. Not "is this answer good" but "score 1 to 5 on (a) factual accuracy against the cited source, (b) instruction following, (c) tone match to the example set." Multiple judges, divergence flagged for human review, judges calibrated against a small set of human labels before being trusted.

**Shadow evaluation against production traffic** is the final layer and the most important. The candidate model runs in parallel with the production one on real requests. The production response is served to the user; the candidate's response is logged and graded by the stack above. After enough volume, the dashboard shows whether the candidate is safe to promote, with confidence intervals.

Cumulus' Evaluation subsystem implements all four layers and integrates with the Router so promotions and demotions are configuration changes, not deploys.