Back to GPU Glossary
GPU Glossary

KV Cache

A memory buffer that stores previously computed key and value tensors during autoregressive language model inference to avoid redundant recalculation for each new token.

The KV cache (key-value cache) is a critical optimization for autoregressive transformer models such as GPT, LLaMA, and Mistral. During text generation, each new token depends on all previously generated tokens through the self-attention mechanism. Without caching, generating each new token would require recomputing the key and value projections for every prior token, making generation time grow quadratically with sequence length.

The KV cache stores the key and value tensors computed during the attention mechanism for all previous tokens. When generating the next token, only the new token's query, key, and value need to be computed. The cached keys and values from prior positions are reused, reducing the per-token computation from processing the entire sequence to processing a single token. This transforms generation from quadratic to linear time complexity.

The memory cost of the KV cache is substantial and grows linearly with both sequence length and batch size. For a large language model like LLaMA 70B, the KV cache for a single 4096-token sequence requires approximately 2.5 GB of GPU memory in FP16. When serving many concurrent requests, the KV cache can consume more GPU memory than the model weights themselves. This is why KV cache management is one of the most important factors in LLM serving efficiency.

Efficient KV cache management has become a major area of innovation. vLLM's PagedAttention allocates KV cache memory in small pages rather than contiguous blocks, dramatically reducing fragmentation. Other techniques include KV cache quantization, which stores cached values at reduced precision, and KV cache compression, which selectively evicts or merges cache entries for less important tokens to reduce memory consumption for very long sequences.

Understanding KV cache behavior is essential for capacity planning when deploying language models. The maximum concurrent users a GPU can support depends not only on the model size but also on the expected sequence lengths and the efficiency of the KV cache management strategy. Platforms like Cumulus automatically handle KV cache optimization as part of the model serving infrastructure.