Back to Inference Glossary
Inference Glossary

LLM Router

A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.

An LLM router is the dispatch layer of an inference platform. Given an incoming request, the router decides which model and which provider should serve it. The decision is bounded by **declared rules** (specified per workflow), **health** (continuous checks against every provider), **latency budgets**, and **cost constraints**. The router is the place where reliability and economics meet.

The alternative to a router is hand-written if-else logic scattered across application code. That alternative breaks in three predictable ways: failover paths are never tested and do not work when an incident actually happens, routing logic is decentralized so different services have different definitions of "healthy," and the system does not learn — provider behavior shifts and the hardcoded rules go stale.

A good router is **deterministic** (the same request with the same rule version always picks the same path), **traceable** (every dispatch decision is in the audit log), and **fast** (the router itself adds less than a millisecond to request latency). It supports primary-plus-fallback chains, weighted splits for canary deployments, and per-workflow overrides.

The most underrated benefit of a router is latency. Once routing is centralized, the platform can short-circuit tail latency — if the primary breaches budget, the secondary fires speculatively, and the first response wins. P99 drops in ways that hand-written failover cannot match. Cumulus' Router is one of the eight subsystems behind the platform.