Cumulus — The Fastest Serverless GPU Cloud for AI Inference
Deploy Any Model. The Fastest GPU Cold Starts.
Serverless GPU cloud with 12.5s cold starts. Cloud GPU inference that scales to zero — pay only for compute.
Founded by alumni from
























GPU Cold Start to First Inference
Serverless GPU Inference — Flux 2 Diffusion Model
* Based on internal testing with memory snapshots and torch.compile() enabled on Modal.
How Serverless GPU Hosting Works
Deploy to GPU Cloud Instantly
Deploy your model to serverless GPUs with one function call. Get back a model_id and inference endpoint ready to use.
from cumulus import deploy
# Deploy your model
model = deploy("./flux-2-schnell")
# Returns model_id and endpoint
print(model.id) # "flux-2-abc123"
print(model.endpoint) # "api.cumulus.io/flux-2-abc123"Run GPU Inference
Use the model_id to run GPU inference, or hit the endpoint directly from any language.
from cumulus import run
# Call using model_id
result = run("flux-2-abc123", {
"prompt": "a watch on marble"
})
# Or use the endpoint directly
requests.post("api.cumulus.io/flux-2-abc123", ...)Never Think About GPU Infra
We handle GPU selection, replicas, autoscaling, and failover. Fast GPU hosting without the ops.
Scale from 1 to 100+ replicas instantly
You write code. We handle the rest.
Pay Per GPU Compute Cycle
Billed by actual GPU compute used, not idle time. Scale to zero, pay nothing when idle. The cheapest way to run GPU inference.
Cumulus OS
On-premises GPU hosting for your own private cloud.
GPU cluster management powered by Ion.