Back to Inference Glossary
Inference Glossary

Model Weights

The learned numerical parameters of a neural network, stored as large multi-dimensional arrays. The artifact that defines what a trained model does.

Model weights are the numerical parameters that define a trained neural network's behavior. During training, they are iteratively adjusted through backpropagation to minimize a loss function, effectively encoding the patterns learned from the training data. Once training completes, the weights are frozen and used as-is during inference.

The size of model weights varies enormously across architectures. A small image classifier might have a few million parameters occupying megabytes of storage; a frontier language model like LLaMA-70B has 70 billion parameters requiring 140 GB in FP16. The trend toward larger models has made weight management a serious infrastructure problem — these files must be stored, transferred, and loaded into GPU memory efficiently.

Weights are typically stored in standardized formats: PyTorch's `.pt`, **SafeTensors** (preferred for production because it cannot execute arbitrary code on load), GGUF (optimized for quantized models), or ONNX (cross-framework). SafeTensors has become the production default because of its memory-mapping support and its safety properties — a model weight file pulled from a public hub should not be able to run code on the host that loads it.

Loading large weights into GPU memory is often the dominant component of cold-start latency on platforms that scale to zero. Optimization strategies include staging weights on local NVMe, memory-mapped loading that begins serving before transfer completes, model parallelism that distributes weights across multiple GPUs, and quantization that reduces the bytes to transfer. Custom-hosting subsystems on inference platforms handle weight staging and caching as part of platform infrastructure.