Back to GPU Glossary
GPU Glossary

Inference Latency

The total time elapsed from when an inference request is received to when the response is returned, including preprocessing, model execution, and postprocessing.

Inference latency measures the end-to-end time required to process a single inference request, from the moment the request arrives at the serving endpoint to the moment the response is sent back to the caller. It encompasses network transmission time, request queuing, input preprocessing, GPU computation, output postprocessing, and response serialization. For user-facing applications, inference latency directly determines how responsive the application feels.

Latency is typically measured at multiple percentiles rather than as a simple average. The P50 (median) latency represents the typical user experience, while P95 and P99 latencies capture the tail experience — what the slowest 5% or 1% of requests experience. Tail latency is particularly important because even a small percentage of slow requests can affect overall user satisfaction, and in chained microservice architectures, tail latency compounds across services.

The components of inference latency vary by workload type. For image classification or embedding generation, the dominant component is typically GPU compute time for a single forward pass. For autoregressive language models, latency has two phases: time-to-first-token (TTFT), which includes prompt processing, and inter-token latency (ITL), the time between each subsequent generated token. Users perceive TTFT as the initial responsiveness and ITL as the generation speed.

Several factors influence inference latency. Model size and architecture determine the amount of computation required. GPU hardware capabilities set the ceiling for compute and memory throughput. Batch size affects per-request latency since larger batches increase individual request latency in exchange for higher overall throughput. Quantization reduces computation and memory access time. Inference engine optimizations like operator fusion, flash attention, and speculative decoding can significantly reduce latency.

For serverless GPU platforms, latency includes additional components not present in dedicated deployments. Cold start latency applies to the first request after scale-to-zero. Request routing adds a small overhead for load balancing across instances. Queue wait time occurs when all instances are busy and a request must wait for capacity. Understanding and monitoring these components helps teams optimize their deployment configuration and set realistic SLOs for their applications.