Inference Glossary

LLM Router

A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.

An LLM router is the dispatch layer of an inference platform. Given an incoming request, the router decides which model and which provider should serve it. The decision is bounded by **declared rules** (specified per workflow), **health** (continuous checks against every provider), **latency budgets**, and **cost constraints**. The router is the place where reliability and economics meet.

The alternative to a router is hand-written if-else logic scattered across application code. That alternative breaks in three predictable ways: failover paths are never tested and do not work when an incident actually happens, routing logic is decentralized so different services have different definitions of "healthy," and the system does not learn — provider behavior shifts and the hardcoded rules go stale.

A good router is **deterministic** (the same request with the same rule version always picks the same path), **traceable** (every dispatch decision is in the audit log), and **fast** (the router itself adds less than a millisecond to request latency). It supports primary-plus-fallback chains, weighted splits for canary deployments, and per-workflow overrides.

The most underrated benefit of a router is latency. Once routing is centralized, the platform can short-circuit tail latency — if the primary breaches budget, the secondary fires speculatively, and the first response wins. P99 drops in ways that hand-written failover cannot match. Cumulus' Router is one of the eight subsystems behind the platform.

Related Terms

LLM Gateway

An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.

LLM Observability

A replayable audit log of every model call — input, output, model, provider, latency, cost, and quality signals — plus real-time dashboards over the same data.

Shadow Evaluation

Running a candidate model in parallel with the production model on real traffic, serving the production response to users, and grading the candidate's response asynchronously. The cleanest way to evaluate a model swap.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.