Inference Glossary

LLM Observability

A replayable audit log of every model call — input, output, model, provider, latency, cost, and quality signals — plus real-time dashboards over the same data.

LLM observability is the practice of capturing every model call in a structured, queryable, replayable log and surfacing aggregate views over the captured data. It is the foundation that every other operational discipline — debugging, evaluation, fine-tune candidate selection, cost attribution, incident response, compliance review — depends on.

The minimum useful schema for an observability record is: input (prompt, messages, tool definitions), output (response, tool calls, finish reason), model name, provider, latency components (TTFT, ITL, total), token counts (input, output, cached), cost in dollars, and trace identifiers tying the call back to the upstream request. Quality signals — judge scores, heuristic results, user thumbs-up-or-down — go in the same record when available.

The "replayable" part matters. The first time an audit reviewer asks "what did the model say to user X on Tuesday at 3:14 PM," a system without observability cannot answer. The first time a regression appears and the team needs to bisect, a system without observability cannot bisect. The first time a fine-tune candidate needs hard examples, a system without observability cannot mine them.

The platform that owns the gateway is the only place observability can be cleanly captured across providers. A per-provider observability tool sees only its provider's slice. A platform-level one sees everything, attributes cost in the same units, and can correlate quality signals across models. Cumulus' Observability subsystem is the data substrate the rest of the platform reads.

Related Terms

LLM Evaluation

The practice of grading model outputs against a target — using a stack of synthetic data, deterministic heuristics, calibrated LLM judges, and shadow evaluation against production traffic.

Shadow Evaluation

Running a candidate model in parallel with the production model on real traffic, serving the production response to users, and grading the candidate's response asynchronously. The cleanest way to evaluate a model swap.

LLM Gateway

An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.

LLM Router

A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.