Shadow Evaluation
Running a candidate model in parallel with the production model on real traffic, serving the production response to users, and grading the candidate's response asynchronously. The cleanest way to evaluate a model swap.
Shadow evaluation is the technique of running two model calls per request — one to the production model whose response is returned to the user, and one to a candidate model whose response is logged and graded but never returned. Over enough requests, the gradings aggregate into a dataset that says whether the candidate is better than, equivalent to, or worse than production for this workflow.
The reason shadow evaluation wins over offline benchmarks is that it tests on the workload that actually matters. Hand-authored test sets drift from production. Public benchmarks may have nothing to do with the workflow. Synthetic data is useful but biased toward the generator's idea of what hard inputs look like. Live traffic is the only source of truth.
The mechanical requirement is that the inference platform can fan a request out to two models, return one to the user without blocking on the other, and score the second asynchronously. The cost overhead is bounded by the percentage of traffic being shadowed — typically 1 to 10% — and the cost of the candidate model. The complexity overhead is concentrated in the scoring stack: synthetic-data-derived rubrics, heuristics, and judges, the same stack used for offline evaluation.
The output of a shadow evaluation campaign is a recommendation: promote the candidate, leave it in shadow, or revert. Inference platforms that integrate shadow evaluation with routing make the promotion itself a configuration change — flip the rule and the candidate becomes production. Cumulus is one such platform.
Related Terms
LLM Evaluation
The practice of grading model outputs against a target — using a stack of synthetic data, deterministic heuristics, calibrated LLM judges, and shadow evaluation against production traffic.
LLM Observability
A replayable audit log of every model call — input, output, model, provider, latency, cost, and quality signals — plus real-time dashboards over the same data.
LLM Router
A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.
Fine-Tuning
Adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset. Usually done with parameter-efficient methods like LoRA that update less than 1% of weights.