Back to Inference Glossary
Inference Glossary

Inference Latency

The end-to-end time from request arrival to response delivery. For LLMs, decomposed into time-to-first-token (TTFT) and inter-token latency (ITL).

Inference latency measures the total time required to process a single inference request, from the moment it arrives at the serving endpoint to the moment the response is returned. For user-facing applications it directly determines how responsive the system feels. It is typically tracked at multiple percentiles — P50 for the typical experience, P95 and P99 for the tail — because a small fraction of slow requests can dominate user perception.

For autoregressive language models, latency has two distinct components. **Time-to-first-token (TTFT)** is the time from request start to the first output token; it is dominated by prompt processing and is what the user perceives as initial responsiveness. **Inter-token latency (ITL)** is the time between subsequent generated tokens; combined with output length it determines generation speed. A good streaming experience has low TTFT and low ITL; one of the two being good is not enough.

Latency components scale with different things. TTFT scales with prompt length and prefill compute. ITL scales with model size and memory bandwidth. Batch size affects per-request latency since larger batches increase individual response time in exchange for higher overall throughput. The right batch size is a per-workload tuning question.

Beyond raw hardware, the techniques that reduce latency are: quantization (less data to move per token), KV caching (no replay of prior tokens), prompt caching (no prefill for shared prefixes), continuous batching (no head-of-line blocking), and custom attention kernels like IonAttention that overlap prefill and decode. The Cumulus Router additionally reduces tail latency by speculatively dispatching slow primaries to a faster secondary and returning whichever finishes first.