Back to Inference Glossary
Inference Glossary

SGLang

An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.

SGLang is an open-source structured generation language and serving engine for large language models. It is often compared to vLLM but takes a different approach to KV cache management. Where vLLM's PagedAttention focuses on memory efficiency within a request, SGLang's **RadixAttention** focuses on KV cache reuse across requests that share prefixes.

The core observation behind RadixAttention is that many production workloads send requests with substantial overlap — the same system prompt, the same few-shot examples, the same tool definitions. Computing the KV cache for these shared prefixes once and reusing it across requests dramatically reduces redundant work. SGLang organizes cached prefixes in a radix tree and matches incoming requests against it on entry, serving the first N tokens from cache and only computing the suffix.

SGLang is particularly strong on three workload shapes: agentic workflows with shared tool prompts, batch evaluation where many test cases share a system prompt, and RAG pipelines where retrieved context dominates the prefix. On these workloads SGLang can outperform vLLM by a wide margin; on workloads without prefix overlap, the two are roughly comparable.

Inference platforms typically evaluate vLLM, SGLang, and their own custom runtimes per model and per workload, dispatching the right engine for each. Cumulus uses SGLang and vLLM where they win and Ion where Ion wins.