Inference Glossary

Batch Inference

Processing multiple inference requests in the same forward pass to maximize GPU throughput and hardware utilization.

Batch inference groups multiple inputs and processes them together in a single forward pass through a model. Instead of running one input at a time, the GPU processes a batch of inputs simultaneously, exploiting its parallel architecture to achieve much higher throughput than sequential processing.

The efficiency gains come from GPU architecture. Modern GPUs have thousands of cores that operate in parallel, but a single small input often does not fully utilize them. Batching spreads work across more cores and amortizes fixed overheads — memory transfers, kernel launches, attention prefill — across many inputs. A batch of 32 might complete in only two to three times the wall-clock of a single input, a 10 to 16x improvement in throughput per dollar.

There are two batching approaches for serving. **Static batching** waits until a fixed number of requests accumulates or a timeout expires, then processes them all at once. **Continuous batching** dynamically adds new requests to running batches as slots become available, which avoids the head-of-line blocking that affects static batching when one long-running request holds up an entire batch of shorter ones. Continuous batching is the default in vLLM, SGLang, TensorRT-LLM, and Cumulus' Ion.

Batch size is a latency-throughput tradeoff. Larger batches improve throughput but increase per-request latency because each request waits for the batch to complete. Memory also constrains batch size — each request in a batch needs activations, intermediate results, and (for language models) its own KV cache. Production deployments tune batch size per workload, sometimes with separate latency-sensitive and throughput-sensitive endpoints.

Related Terms

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

SGLang

An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.