Inference Glossary

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

The KV cache (key-value cache) is the central optimization for autoregressive transformer models such as GPT, LLaMA, Claude, and Qwen. During text generation, each new token depends on every prior token through self-attention. Without caching, generating each new token would require recomputing the key and value projections for every position in the sequence — generation time would grow quadratically with sequence length.

The KV cache stores the key and value tensors for all previously processed tokens. When generating the next token, only the new token's query, key, and value need to be computed; the cached values from prior positions are reused. This transforms generation from quadratic to linear time complexity in sequence length and is the reason long-context generation is tractable at all.

The memory cost of the KV cache is substantial and grows linearly with both sequence length and batch size. For LLaMA-70B, the KV cache for a single 4096-token sequence requires roughly 2.5 GB of GPU memory in FP16. When serving many concurrent requests, the KV cache often consumes more GPU memory than the model weights themselves. This is why KV cache management is one of the most important factors in LLM serving efficiency.

Major innovations in KV cache management include vLLM's PagedAttention, which allocates cache in small pages to eliminate fragmentation; SGLang's RadixAttention, which shares cache across requests with overlapping prefixes; and KV cache quantization, which stores cached values at reduced precision to fit more requests in memory. Ion's eager KV writeback uses Grace Hopper's TMA to stage pages out to LPDDR over NVLink-C2C, keeping the kernel's working set bounded regardless of context length.

Related Terms

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

SGLang

An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.

Prompt Cache

A response cache keyed on the request — exact-match for identical requests, prefix for shared system prompts, and optionally semantic for paraphrased questions. Cuts input tokens dramatically on real traffic.

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

Inference Latency

The end-to-end time from request arrival to response delivery. For LLMs, decomposed into time-to-first-token (TTFT) and inter-token latency (ITL).