GPU Glossary

Cold Start

The initial delay when a serverless GPU instance must be provisioned, loaded, and initialized before it can serve its first request.

A cold start occurs when a serverless GPU platform receives a request but has no active instance ready to handle it. The platform must allocate a GPU, pull the container image, load the model weights into GPU memory, and initialize the inference runtime before the first prediction can be served. This entire sequence contributes to the cold start latency.

Cold start times vary dramatically across providers and workloads. Traditional cloud GPU provisioning can take several minutes, while modern serverless GPU platforms like Cumulus have reduced this to as little as 12.5 seconds. The primary factors affecting cold start duration include container image size, model weight size, GPU availability in the cluster, and the efficiency of the platform's orchestration layer.

Cold starts are one of the most critical metrics for production AI applications. In real-time use cases such as chatbots, image generation, or code completion, a long cold start means the first user after an idle period experiences unacceptable latency. Teams often resort to workarounds like keeping instances warm with periodic pings, which defeats the cost benefits of serverless.

Modern platforms address cold starts through techniques such as pre-warmed GPU pools, container layer caching, memory snapshot restoration, and predictive scaling. Pre-warmed pools keep GPUs provisioned and ready for assignment, eliminating hardware provisioning time. Snapshot restoration allows the platform to restore a previously captured GPU memory state rather than re-loading model weights from scratch.

Minimizing cold starts is essential for achieving the promise of serverless GPU inference: pay only for what you use, scale to zero when idle, and still deliver fast responses when traffic arrives. The ability to cold start quickly is what makes true scale-to-zero economically viable for production workloads.

Related Terms

Serverless GPU

A cloud computing model where GPU resources are automatically provisioned on demand and billed per second of actual use, with no server management required.

Scale to Zero

The ability of a serverless platform to completely deallocate all compute resources when there are no active requests, reducing cost to zero during idle periods.

GPU Autoscaling

The automatic adjustment of the number of GPU instances serving a workload based on real-time demand, scaling up during traffic spikes and down during quiet periods.

Inference Latency

The total time elapsed from when an inference request is received to when the response is returned, including preprocessing, model execution, and postprocessing.