Inference Glossary

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

vLLM is an open-source library for fast and efficient large language model inference and serving. Originally developed at UC Berkeley, it has become one of the most widely adopted LLM serving engines because of its strong throughput, broad model coverage, and OpenAI-compatible API. vLLM supports LLaMA, Mistral, Qwen, GPT-NeoX, Falcon, and most other major architectures.

The key innovation in vLLM is **PagedAttention**, an attention algorithm inspired by virtual memory paging in operating systems. Traditional LLM serving allocates contiguous blocks of GPU memory for each request's KV cache, leading to fragmentation and wasted memory. PagedAttention stores the cache in non-contiguous pages that are dynamically allocated, which eliminates fragmentation and enables near-optimal memory utilization. The result is the ability to serve two to four times more concurrent requests on the same GPU.

Beyond PagedAttention, vLLM ships **continuous batching**, which dynamically adds new requests to running batches rather than waiting for all requests in a batch to finish. This avoids head-of-line blocking — the problem where one long-running request holds up everyone else — and keeps GPU utilization high regardless of how response lengths vary.

vLLM is the right baseline to measure other serving engines against. Cumulus' Ion engine claims a 30 to 50% throughput advantage over vLLM on Grace Hopper for the workloads we run; SGLang takes a different approach with RadixAttention. All three are reasonable choices, and the right one depends on hardware, model, and traffic shape.

Related Terms

SGLang

An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Batch Inference

Processing multiple inference requests in the same forward pass to maximize GPU throughput and hardware utilization.

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.