Inference Glossary

SGLang

An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.

SGLang is an open-source structured generation language and serving engine for large language models. It is often compared to vLLM but takes a different approach to KV cache management. Where vLLM's PagedAttention focuses on memory efficiency within a request, SGLang's **RadixAttention** focuses on KV cache reuse across requests that share prefixes.

The core observation behind RadixAttention is that many production workloads send requests with substantial overlap — the same system prompt, the same few-shot examples, the same tool definitions. Computing the KV cache for these shared prefixes once and reusing it across requests dramatically reduces redundant work. SGLang organizes cached prefixes in a radix tree and matches incoming requests against it on entry, serving the first N tokens from cache and only computing the suffix.

SGLang is particularly strong on three workload shapes: agentic workflows with shared tool prompts, batch evaluation where many test cases share a system prompt, and RAG pipelines where retrieved context dominates the prefix. On these workloads SGLang can outperform vLLM by a wide margin; on workloads without prefix overlap, the two are roughly comparable.

Inference platforms typically evaluate vLLM, SGLang, and their own custom runtimes per model and per workload, dispatching the right engine for each. Cumulus uses SGLang and vLLM where they win and Ion where Ion wins.

Related Terms

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Prompt Cache

A response cache keyed on the request — exact-match for identical requests, prefix for shared system prompts, and optionally semantic for paraphrased questions. Cuts input tokens dramatically on real traffic.

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.