Back to GPU Glossary
GPU Glossary

Model Quantization

The process of reducing the numerical precision of a model's weights and activations (e.g., from 32-bit to 8-bit or 4-bit) to decrease memory usage and increase inference speed.

Model quantization converts a neural network's parameters from higher-precision floating-point formats to lower-precision representations. A model originally trained in FP32 (32 bits per parameter) might be quantized to FP16 (16 bits), INT8 (8 bits), or even INT4 (4 bits). This reduces the model's memory footprint proportionally — a 70B parameter model occupying 280 GB in FP32 shrinks to 70 GB in INT4.

The primary benefits of quantization are reduced memory usage, faster inference, and lower cost. Smaller models fit on fewer or less expensive GPUs, reducing hardware requirements. Because inference in large language models is often memory-bandwidth-bound, reading fewer bytes per parameter directly translates to higher tokens-per-second throughput. A well-quantized INT4 model can run nearly 4x faster than its FP16 counterpart on memory-bandwidth-limited hardware.

Several quantization approaches exist with different tradeoffs. Post-training quantization (PTQ) applies quantization to an already-trained model without additional training, making it simple to apply but potentially less accurate. Quantization-aware training (QAT) incorporates quantization into the training process itself, producing models that maintain higher accuracy at low precision. Techniques like GPTQ, AWQ, and SqueezeLLM use calibration data to determine optimal quantization parameters for each layer.

The accuracy impact of quantization depends on the model architecture, size, and quantization method. Larger models tend to be more robust to quantization because they have more redundancy in their parameters. State-of-the-art quantization methods like AWQ and GPTQ can compress large language models to 4-bit precision with negligible perplexity degradation, while naive round-to-nearest quantization at the same precision would cause significant quality loss.

Quantization has become a standard practice in production AI deployment. Serving frameworks like vLLM and TensorRT-LLM have built-in support for quantized models, and the open-source community regularly publishes pre-quantized versions of popular models. For teams deploying on serverless GPU platforms, quantization can mean the difference between needing an 80 GB A100 and fitting on a more available and affordable 24 GB GPU.