GPU Glossary

Pay-Per-Compute

A pricing model where users are billed only for the actual GPU compute time consumed during inference, rather than paying for reserved instances by the hour.

Pay-per-compute is a billing model used by serverless GPU platforms where charges accrue only during active computation. Users pay for the actual GPU-seconds consumed while processing inference requests, and the meter stops the moment computation finishes. There is no charge for idle time, provisioning time, or time spent waiting for requests. This contrasts with reserved instance pricing, where users pay a fixed hourly or monthly rate regardless of actual utilization.

The granularity of pay-per-compute billing varies across providers. Some bill by the second, others by the millisecond, and some round up to the nearest minute. Finer granularity benefits workloads with short, frequent inference calls. A model that processes each request in 200 milliseconds would pay for only 200 milliseconds per request with millisecond billing, versus a full second with per-second billing, and a full minute with per-minute billing.

Pay-per-compute fundamentally changes the cost optimization calculus for AI teams. Instead of optimizing for high utilization of reserved instances, teams optimize for inference efficiency — reducing the GPU time required per request. Techniques like model quantization, efficient batching, and faster inference engines directly reduce the per-request cost. A model that runs 2x faster through quantization costs half as much per inference under pay-per-compute pricing.

This pricing model aligns provider and customer incentives in a unique way. The provider is incentivized to achieve fast cold starts (so users trust scale-to-zero), high GPU utilization across their fleet (to reduce their own costs), and efficient inference runtimes (so users process requests quickly and can serve more requests per dollar). The customer benefits from all of these improvements through lower bills without needing to implement the optimizations themselves.

Pay-per-compute makes GPU access economically viable for a broader range of applications. Startups and small teams can deploy GPU-powered features without committing to expensive reserved instances. Experimental models can be deployed to production with minimal financial risk. And mature applications can let usage-based costs scale naturally with revenue, maintaining consistent unit economics as traffic grows.

Related Terms

Scale to Zero

The ability of a serverless platform to completely deallocate all compute resources when there are no active requests, reducing cost to zero during idle periods.

Serverless GPU

A cloud computing model where GPU resources are automatically provisioned on demand and billed per second of actual use, with no server management required.

GPU Utilization

A metric measuring the percentage of time a GPU's compute cores are actively processing work, indicating how efficiently the hardware is being used.

GPU Autoscaling

The automatic adjustment of the number of GPU instances serving a workload based on real-time demand, scaling up during traffic spikes and down during quiet periods.