Inference Glossary

Model Quantization

Reducing the numerical precision of a model's weights and activations — from 32-bit to 16, 8, or 4 bits — to shrink memory footprint and speed up memory-bandwidth-bound inference.

Quantization converts a neural network's parameters from higher-precision floating-point formats to lower-precision representations. A model originally trained in FP32 (32 bits per parameter) might be quantized to FP16 (16 bits), INT8 (8 bits), or even INT4 (4 bits). Memory footprint shrinks proportionally — a 70B parameter model occupying 280 GB in FP32 shrinks to 70 GB in INT4.

The primary benefits of quantization are reduced memory usage, faster inference, and lower cost. Smaller models fit on fewer or smaller GPUs. Because LLM inference is typically memory-bandwidth-bound, reading fewer bytes per parameter translates almost directly to higher tokens-per-second throughput. A well-quantized INT4 model can run nearly 4x faster than its FP16 counterpart on memory-bandwidth-limited hardware.

Production-grade quantization is more than rounding. **GPTQ**, **AWQ**, and **SqueezeLLM** use calibration data to find per-layer quantization parameters that preserve accuracy. Quantization-aware training (QAT) bakes quantization into training itself for the highest possible quality at low precision. The accuracy impact of any of these depends on the model — larger models tolerate aggressive quantization better because they have more parameter redundancy.

Quantization is standard in production AI deployment. vLLM, SGLang, TensorRT-LLM, and Cumulus' Ion all ship with quantized-model support, and the open-source community publishes pre-quantized versions of most popular models within days of release. For a workload that runs on the edge of fitting in GPU memory, quantization is often the difference between needing an 80 GB GPU and fitting on a more available, cheaper one.

Related Terms

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Model Weights

The learned numerical parameters of a neural network, stored as large multi-dimensional arrays. The artifact that defines what a trained model does.

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.