GPU Autoscaling
The automatic adjustment of the number of GPU instances serving a workload based on real-time demand, scaling up during traffic spikes and down during quiet periods.
GPU autoscaling dynamically adjusts the number of active GPU instances allocated to a workload based on incoming traffic, queue depth, or other signals. When demand increases, additional GPU instances are provisioned and added to the serving pool. When demand decreases, excess instances are drained and deallocated, reducing costs. The most aggressive form of autoscaling, scale-to-zero, removes all instances when there is no traffic at all.
Autoscaling for GPU workloads is significantly more challenging than autoscaling for CPU-based web applications. GPU provisioning takes longer due to hardware allocation, driver initialization, and model loading. Cold start latency means that reactive scaling alone may not be fast enough for traffic spikes. Effective GPU autoscaling requires a combination of reactive scaling based on current load and predictive scaling based on anticipated traffic patterns.
Key metrics used to drive GPU autoscaling decisions include request queue depth, GPU utilization percentage, inference latency percentiles, and requests per second. Each metric has tradeoffs. Queue-depth-based scaling responds quickly to demand changes but can oscillate if not dampened. Utilization-based scaling provides stable behavior but may not react fast enough to sudden traffic spikes. Production systems typically use a combination of metrics with configurable thresholds.
The scaling speed — how quickly new instances become ready to serve traffic — is a critical differentiator among serverless GPU platforms. A platform with 60-second cold starts needs to scale proactively with a longer lead time, often requiring persistent warm capacity. A platform like Cumulus with 12.5-second cold starts can scale reactively to traffic changes while maintaining acceptable latency, enabling more aggressive scale-down policies and greater cost savings.
Advanced autoscaling strategies include scheduled scaling for predictable traffic patterns, burst scaling with cloud spillover for on-premises clusters, and per-model scaling policies that account for different models having different resource requirements and latency sensitivities. The goal is to maintain the target latency SLO while minimizing the total GPU-seconds consumed.
Related Terms
Cold Start
The initial delay when a serverless GPU instance must be provisioned, loaded, and initialized before it can serve its first request.
Scale to Zero
The ability of a serverless platform to completely deallocate all compute resources when there are no active requests, reducing cost to zero during idle periods.
Serverless GPU
A cloud computing model where GPU resources are automatically provisioned on demand and billed per second of actual use, with no server management required.
GPU Utilization
A metric measuring the percentage of time a GPU's compute cores are actively processing work, indicating how efficiently the hardware is being used.
Container Orchestration
The automated management of containerized application lifecycles, including deployment, scaling, networking, and health monitoring across a cluster of machines.