Back to GPU Glossary
GPU Glossary

Scale to Zero

The ability of a serverless platform to completely deallocate all compute resources when there are no active requests, reducing cost to zero during idle periods.

Scale to zero is a capability of serverless computing platforms where all allocated resources — including GPU instances — are completely released when no requests are being processed. Unlike traditional cloud deployments that maintain at least one running instance, scale-to-zero deployments incur no cost during idle periods. The platform provisions resources on demand when the next request arrives, performing a cold start as needed.

The economic impact of scale to zero is significant for workloads with intermittent or variable traffic. A model that receives traffic only during business hours (roughly 10 hours per day) would cost about 42% as much with scale-to-zero compared to a 24/7 reserved instance. For development and staging environments that are used even less frequently, the savings can exceed 90%. This makes GPU compute accessible to teams and projects that cannot justify the cost of always-on instances.

The key tradeoff with scale to zero is cold start latency. When all instances have been deallocated, the first request after an idle period must wait for a cold start before receiving a response. This makes the cold start time of the underlying platform critically important. A platform with 60-second cold starts makes scale to zero impractical for many real-time applications, while a platform like Cumulus with 12.5-second cold starts makes it viable for a much broader range of use cases.

Scale to zero requires intelligent implementation to avoid premature scaling down. Platforms typically use configurable idle timeouts — the duration of inactivity before instances are deallocated. A 5-minute timeout means instances persist for 5 minutes after the last request, ready to serve any follow-up traffic without a cold start. Shorter timeouts maximize cost savings but increase the frequency of cold starts.

For production workloads, teams often use scale to zero in combination with minimum instance counts. Non-critical endpoints might scale to zero freely, while latency-sensitive production endpoints maintain a minimum of one warm instance at all times. This hybrid approach balances cost optimization with latency requirements, and serverless GPU platforms typically provide per-endpoint configuration for these policies.