GPU Memory Bandwidth
The rate at which data can be read from or written to GPU memory (VRAM), often the primary bottleneck for AI inference performance.
GPU memory bandwidth measures how quickly data can be transferred between the GPU's processing cores and its high-bandwidth memory (HBM) or GDDR memory. It is expressed in gigabytes per second (GB/s) and varies significantly across GPU models. For example, the NVIDIA A100 offers 2 TB/s of memory bandwidth, while the H100 delivers 3.35 TB/s.
For many AI inference workloads, particularly large language models, memory bandwidth is the primary performance bottleneck rather than raw computational throughput. During autoregressive text generation, each token requires reading the entire model's weights from memory. A 70-billion parameter model in FP16 occupies 140 GB of memory, and generating each token requires reading a substantial portion of those weights. The speed at which these weights can be read directly determines tokens-per-second throughput.
The ratio of computation to memory access, known as arithmetic intensity, determines whether a workload is compute-bound or memory-bound. Most inference workloads, especially with small batch sizes, have low arithmetic intensity and are therefore memory-bandwidth-bound. This is why increasing batch size can improve GPU utilization — it amortizes the cost of reading model weights across more inputs, increasing arithmetic intensity.
Memory bandwidth constraints have driven innovation in model optimization techniques. Quantization reduces the number of bytes that must be read per parameter, directly improving memory-bandwidth-limited throughput. A model quantized from FP16 to INT4 requires reading 4x fewer bytes per parameter, translating to a roughly proportional speedup for memory-bound workloads.
Understanding memory bandwidth is essential for capacity planning and GPU selection. When evaluating GPUs for inference, teams should consider not just the total VRAM capacity but also the bandwidth. A GPU with ample memory but insufficient bandwidth will leave compute resources idle waiting for data, while a GPU with high bandwidth but insufficient memory cannot fit the model at all.
Related Terms
Model Quantization
The process of reducing the numerical precision of a model's weights and activations (e.g., from 32-bit to 8-bit or 4-bit) to decrease memory usage and increase inference speed.
GPU Inference
The process of running a trained machine learning model on a GPU to generate predictions or outputs from new input data.
Tensor Cores
Specialized hardware units within NVIDIA GPUs designed to accelerate matrix multiplication and convolution operations used in deep learning.
GPU Utilization
A metric measuring the percentage of time a GPU's compute cores are actively processing work, indicating how efficiently the hardware is being used.
Model Weights
The learned numerical parameters of a neural network that encode the model's knowledge and capabilities, stored as large multi-dimensional arrays of floating-point numbers.