Back to Inference Glossary
Inference Glossary

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.

Inference is the act of executing a trained model on new data to produce outputs. It is what happens every time a user sends a prompt to a language model, asks an image generator to render a scene, or calls a classifier on a document. While training updates a model's parameters, inference holds them fixed and uses them to compute predictions for new inputs.

Production inference has different characteristics than training. It is latency-sensitive rather than throughput-oriented, it processes small batches or single inputs rather than huge minibatches, and it must handle variable, unpredictable traffic patterns. Most production inference is also autoregressive — for language models, each generated token depends on all the tokens that came before, which constrains how the work can be parallelized.

The economics of inference dominate the cost of running an AI product. A model that takes weeks of GPU time to train can serve billions of inference requests, and the inference bill is typically orders of magnitude larger than the training bill across the lifetime of a deployment. This is why the engineering effort spent on inference optimization — quantization, KV caching, custom attention kernels, routing — pays back so quickly.

An **inference platform** consolidates the work of serving inference at scale: the gateway that accepts requests, the router that picks a model and provider, the cache that avoids redundant work, the runtime that executes the forward pass on a GPU, the observability layer that logs every call, and the evaluation system that grades quality. Owning all of those pieces in one place is what allows a platform to optimize them against each other.