Batch Inference
The technique of grouping multiple inference requests together and processing them simultaneously on a GPU to maximize throughput and hardware utilization.
Batch inference is the practice of collecting multiple input requests and processing them together in a single forward pass through a neural network. Instead of running one input at a time through the model, the GPU processes a batch of inputs simultaneously, leveraging its parallel architecture to achieve much higher throughput than sequential processing.
The efficiency gains from batching stem from GPU architecture. Modern GPUs have thousands of cores that operate in parallel, but many inference operations do not fully utilize all available cores when processing a single input. By batching multiple inputs, the GPU can spread work across more cores and amortize fixed overhead costs like memory transfers and kernel launch latency. A batch of 32 inputs might complete in only 2-3x the time of a single input, yielding a 10-16x improvement in throughput.
There are two primary approaches to batching for serving workloads: static batching and continuous (dynamic) batching. Static batching waits until a fixed number of requests accumulate or a timeout expires, then processes them all at once. Continuous batching, used by engines like vLLM, dynamically adds new requests to running batches as slots become available, which avoids head-of-line blocking and maintains consistently high GPU utilization.
The optimal batch size depends on the model, GPU, and latency requirements. Larger batches generally improve throughput but increase latency for individual requests, since each request must wait for the entire batch to complete. For latency-sensitive applications, smaller batches or continuous batching strategies offer a better balance. Memory constraints also limit batch size, as each additional request in a batch requires additional GPU memory for activations, intermediate results, and (for language models) KV cache.
Batch inference is particularly valuable for offline processing tasks like embedding generation, document classification, and bulk image analysis where latency is less critical. For these workloads, maximizing throughput directly reduces cost. Serverless GPU platforms can automatically batch incoming requests to optimize utilization, combining the cost efficiency of batch processing with the simplicity of a request-response API.
Related Terms
GPU Inference
The process of running a trained machine learning model on a GPU to generate predictions or outputs from new input data.
GPU Utilization
A metric measuring the percentage of time a GPU's compute cores are actively processing work, indicating how efficiently the hardware is being used.
vLLM
An open-source, high-throughput serving engine for large language models that uses PagedAttention to efficiently manage GPU memory and maximize inference performance.
Inference Latency
The total time elapsed from when an inference request is received to when the response is returned, including preprocessing, model execution, and postprocessing.
Model Serving
The infrastructure and process of deploying trained machine learning models as accessible endpoints that can receive inputs and return predictions in real time.