Inference
Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.
Inference is the act of executing a trained model on new data to produce outputs. It is what happens every time a user sends a prompt to a language model, asks an image generator to render a scene, or calls a classifier on a document. While training updates a model's parameters, inference holds them fixed and uses them to compute predictions for new inputs.
Production inference has different characteristics than training. It is latency-sensitive rather than throughput-oriented, it processes small batches or single inputs rather than huge minibatches, and it must handle variable, unpredictable traffic patterns. Most production inference is also autoregressive — for language models, each generated token depends on all the tokens that came before, which constrains how the work can be parallelized.
The economics of inference dominate the cost of running an AI product. A model that takes weeks of GPU time to train can serve billions of inference requests, and the inference bill is typically orders of magnitude larger than the training bill across the lifetime of a deployment. This is why the engineering effort spent on inference optimization — quantization, KV caching, custom attention kernels, routing — pays back so quickly.
An **inference platform** consolidates the work of serving inference at scale: the gateway that accepts requests, the router that picks a model and provider, the cache that avoids redundant work, the runtime that executes the forward pass on a GPU, the observability layer that logs every call, and the evaluation system that grades quality. Owning all of those pieces in one place is what allows a platform to optimize them against each other.
Related Terms
vLLM
An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.
KV Cache
A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.
Inference Latency
The end-to-end time from request arrival to response delivery. For LLMs, decomposed into time-to-first-token (TTFT) and inter-token latency (ITL).
LLM Router
A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.
Ion
Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.