Back to Inference Glossary
Inference Glossary

Model Quantization

Reducing the numerical precision of a model's weights and activations — from 32-bit to 16, 8, or 4 bits — to shrink memory footprint and speed up memory-bandwidth-bound inference.

Quantization converts a neural network's parameters from higher-precision floating-point formats to lower-precision representations. A model originally trained in FP32 (32 bits per parameter) might be quantized to FP16 (16 bits), INT8 (8 bits), or even INT4 (4 bits). Memory footprint shrinks proportionally — a 70B parameter model occupying 280 GB in FP32 shrinks to 70 GB in INT4.

The primary benefits of quantization are reduced memory usage, faster inference, and lower cost. Smaller models fit on fewer or smaller GPUs. Because LLM inference is typically memory-bandwidth-bound, reading fewer bytes per parameter translates almost directly to higher tokens-per-second throughput. A well-quantized INT4 model can run nearly 4x faster than its FP16 counterpart on memory-bandwidth-limited hardware.

Production-grade quantization is more than rounding. **GPTQ**, **AWQ**, and **SqueezeLLM** use calibration data to find per-layer quantization parameters that preserve accuracy. Quantization-aware training (QAT) bakes quantization into training itself for the highest possible quality at low precision. The accuracy impact of any of these depends on the model — larger models tolerate aggressive quantization better because they have more parameter redundancy.

Quantization is standard in production AI deployment. vLLM, SGLang, TensorRT-LLM, and Cumulus' Ion all ship with quantized-model support, and the open-source community publishes pre-quantized versions of most popular models within days of release. For a workload that runs on the edge of fitting in GPU memory, quantization is often the difference between needing an 80 GB GPU and fitting on a more available, cheaper one.