Inference Latency
The total time elapsed from when an inference request is received to when the response is returned, including preprocessing, model execution, and postprocessing.
Inference latency measures the end-to-end time required to process a single inference request, from the moment the request arrives at the serving endpoint to the moment the response is sent back to the caller. It encompasses network transmission time, request queuing, input preprocessing, GPU computation, output postprocessing, and response serialization. For user-facing applications, inference latency directly determines how responsive the application feels.
Latency is typically measured at multiple percentiles rather than as a simple average. The P50 (median) latency represents the typical user experience, while P95 and P99 latencies capture the tail experience — what the slowest 5% or 1% of requests experience. Tail latency is particularly important because even a small percentage of slow requests can affect overall user satisfaction, and in chained microservice architectures, tail latency compounds across services.
The components of inference latency vary by workload type. For image classification or embedding generation, the dominant component is typically GPU compute time for a single forward pass. For autoregressive language models, latency has two phases: time-to-first-token (TTFT), which includes prompt processing, and inter-token latency (ITL), the time between each subsequent generated token. Users perceive TTFT as the initial responsiveness and ITL as the generation speed.
Several factors influence inference latency. Model size and architecture determine the amount of computation required. GPU hardware capabilities set the ceiling for compute and memory throughput. Batch size affects per-request latency since larger batches increase individual request latency in exchange for higher overall throughput. Quantization reduces computation and memory access time. Inference engine optimizations like operator fusion, flash attention, and speculative decoding can significantly reduce latency.
For serverless GPU platforms, latency includes additional components not present in dedicated deployments. Cold start latency applies to the first request after scale-to-zero. Request routing adds a small overhead for load balancing across instances. Queue wait time occurs when all instances are busy and a request must wait for capacity. Understanding and monitoring these components helps teams optimize their deployment configuration and set realistic SLOs for their applications.
Related Terms
Cold Start
The initial delay when a serverless GPU instance must be provisioned, loaded, and initialized before it can serve its first request.
GPU Inference
The process of running a trained machine learning model on a GPU to generate predictions or outputs from new input data.
Batch Inference
The technique of grouping multiple inference requests together and processing them simultaneously on a GPU to maximize throughput and hardware utilization.
KV Cache
A memory buffer that stores previously computed key and value tensors during autoregressive language model inference to avoid redundant recalculation for each new token.
GPU Memory Bandwidth
The rate at which data can be read from or written to GPU memory (VRAM), often the primary bottleneck for AI inference performance.