Back to GPU Glossary
GPU Glossary

Model Weights

The learned numerical parameters of a neural network that encode the model's knowledge and capabilities, stored as large multi-dimensional arrays of floating-point numbers.

Model weights are the numerical parameters that define a trained neural network's behavior. During training, these weights are iteratively adjusted through backpropagation to minimize a loss function, effectively encoding the patterns and knowledge extracted from the training data. Once training is complete, the weights are frozen and used as-is during inference to produce predictions from new inputs.

The size of model weights varies enormously across architectures. A small image classification model might have a few million parameters occupying megabytes of storage, while a large language model like LLaMA 70B has 70 billion parameters requiring 140 GB of storage in FP16 precision. The trend toward larger models has made weight management a critical infrastructure challenge, as these files must be stored, transferred, and loaded into GPU memory efficiently.

Model weights are typically stored in standardized formats such as PyTorch's .pt files, SafeTensors (a safe, efficient format that prevents code execution during loading), GGUF (optimized for quantized models), or ONNX (for cross-framework compatibility). The choice of format affects loading speed, safety, and compatibility with serving frameworks. SafeTensors has become the preferred format for production deployments due to its memory-mapping capability and resistance to arbitrary code execution vulnerabilities.

Loading model weights into GPU memory is often the largest component of cold start latency on serverless GPU platforms. A 70B parameter model in FP16 requires transferring 140 GB from storage to GPU memory, which can take tens of seconds even with high-speed storage and PCIe connections. Optimization strategies include weight pre-staging on local NVMe storage, memory-mapped loading that starts serving before all weights are transferred, model parallelism that distributes weights across multiple GPUs, and quantization that reduces the total data to transfer.

Understanding model weight characteristics is essential for infrastructure planning. The total weight size determines minimum GPU memory requirements and affects storage costs. The precision format affects both memory usage and inference performance. The number of distinct models being served affects total storage and caching requirements. Serverless GPU platforms handle weight distribution and caching as part of the platform infrastructure, pre-staging popular models across their fleet for faster cold starts.