Back to Inference Glossary
Inference Glossary

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

The KV cache (key-value cache) is the central optimization for autoregressive transformer models such as GPT, LLaMA, Claude, and Qwen. During text generation, each new token depends on every prior token through self-attention. Without caching, generating each new token would require recomputing the key and value projections for every position in the sequence — generation time would grow quadratically with sequence length.

The KV cache stores the key and value tensors for all previously processed tokens. When generating the next token, only the new token's query, key, and value need to be computed; the cached values from prior positions are reused. This transforms generation from quadratic to linear time complexity in sequence length and is the reason long-context generation is tractable at all.

The memory cost of the KV cache is substantial and grows linearly with both sequence length and batch size. For LLaMA-70B, the KV cache for a single 4096-token sequence requires roughly 2.5 GB of GPU memory in FP16. When serving many concurrent requests, the KV cache often consumes more GPU memory than the model weights themselves. This is why KV cache management is one of the most important factors in LLM serving efficiency.

Major innovations in KV cache management include vLLM's PagedAttention, which allocates cache in small pages to eliminate fragmentation; SGLang's RadixAttention, which shares cache across requests with overlapping prefixes; and KV cache quantization, which stores cached values at reduced precision to fit more requests in memory. Ion's eager KV writeback uses Grace Hopper's TMA to stage pages out to LPDDR over NVLink-C2C, keeping the kernel's working set bounded regardless of context length.