GPU Glossary

GPU Utilization

A metric measuring the percentage of time a GPU's compute cores are actively processing work, indicating how efficiently the hardware is being used.

GPU utilization measures what fraction of time the GPU's streaming multiprocessors (SMs) are actively executing work. A GPU at 100% utilization has its compute cores busy every clock cycle, while a GPU at 10% utilization is idle 90% of the time. Utilization is reported by NVIDIA's nvidia-smi tool and monitoring APIs, and is one of the most important metrics for evaluating infrastructure efficiency.

In practice, most GPU deployments suffer from low utilization. Industry surveys consistently report average GPU utilization rates between 15% and 35% in datacenter environments. The causes include over-provisioning for peak demand, idle time between requests, inefficient batching, memory-bound workloads that leave compute cores waiting for data, and workloads that do not fully leverage GPU parallelism.

Low GPU utilization directly translates to wasted money. If a GPU instance costs $3 per hour and runs at 20% utilization, 80% of that cost is paying for idle hardware. Serverless GPU platforms address this by sharing GPU hardware across many users and workloads. When one user's workload is idle, the GPU can serve another user's requests, achieving higher aggregate utilization and passing the savings on through pay-per-compute pricing.

Improving GPU utilization requires attention at multiple levels. At the application level, batching requests increases the work done per GPU kernel launch. At the model level, quantization and operator fusion reduce memory bottlenecks that cause compute cores to stall. At the infrastructure level, intelligent scheduling and bin-packing place multiple workloads on shared GPUs, and autoscaling removes idle instances. Multi-instance GPU (MIG) technology on NVIDIA A100 and H100 GPUs enables hardware-level partitioning of a single GPU into isolated instances.

Monitoring GPU utilization is essential for cost optimization and capacity planning. Teams should track not only the headline utilization percentage but also memory utilization, memory bandwidth utilization, and Tensor Core utilization to understand the full picture. A GPU might show 90% SM utilization but only 20% Tensor Core utilization, indicating that the workload is not effectively leveraging the GPU's most powerful compute units.

Related Terms

GPU Autoscaling

The automatic adjustment of the number of GPU instances serving a workload based on real-time demand, scaling up during traffic spikes and down during quiet periods.

Batch Inference

The technique of grouping multiple inference requests together and processing them simultaneously on a GPU to maximize throughput and hardware utilization.

GPU Memory Bandwidth

The rate at which data can be read from or written to GPU memory (VRAM), often the primary bottleneck for AI inference performance.

Pay-Per-Compute

A pricing model where users are billed only for the actual GPU compute time consumed during inference, rather than paying for reserved instances by the hour.

GPU Cluster

A group of interconnected servers equipped with GPUs that work together to provide scalable compute capacity for AI training, inference, and other GPU-accelerated workloads.