Scale to Zero
The ability of a serverless platform to completely deallocate all compute resources when there are no active requests, reducing cost to zero during idle periods.
Scale to zero is a capability of serverless computing platforms where all allocated resources — including GPU instances — are completely released when no requests are being processed. Unlike traditional cloud deployments that maintain at least one running instance, scale-to-zero deployments incur no cost during idle periods. The platform provisions resources on demand when the next request arrives, performing a cold start as needed.
The economic impact of scale to zero is significant for workloads with intermittent or variable traffic. A model that receives traffic only during business hours (roughly 10 hours per day) would cost about 42% as much with scale-to-zero compared to a 24/7 reserved instance. For development and staging environments that are used even less frequently, the savings can exceed 90%. This makes GPU compute accessible to teams and projects that cannot justify the cost of always-on instances.
The key tradeoff with scale to zero is cold start latency. When all instances have been deallocated, the first request after an idle period must wait for a cold start before receiving a response. This makes the cold start time of the underlying platform critically important. A platform with 60-second cold starts makes scale to zero impractical for many real-time applications, while a platform like Cumulus with 12.5-second cold starts makes it viable for a much broader range of use cases.
Scale to zero requires intelligent implementation to avoid premature scaling down. Platforms typically use configurable idle timeouts — the duration of inactivity before instances are deallocated. A 5-minute timeout means instances persist for 5 minutes after the last request, ready to serve any follow-up traffic without a cold start. Shorter timeouts maximize cost savings but increase the frequency of cold starts.
For production workloads, teams often use scale to zero in combination with minimum instance counts. Non-critical endpoints might scale to zero freely, while latency-sensitive production endpoints maintain a minimum of one warm instance at all times. This hybrid approach balances cost optimization with latency requirements, and serverless GPU platforms typically provide per-endpoint configuration for these policies.
Related Terms
Cold Start
The initial delay when a serverless GPU instance must be provisioned, loaded, and initialized before it can serve its first request.
Serverless GPU
A cloud computing model where GPU resources are automatically provisioned on demand and billed per second of actual use, with no server management required.
Pay-Per-Compute
A pricing model where users are billed only for the actual GPU compute time consumed during inference, rather than paying for reserved instances by the hour.
GPU Autoscaling
The automatic adjustment of the number of GPU instances serving a workload based on real-time demand, scaling up during traffic spikes and down during quiet periods.