Model Weights
The learned numerical parameters of a neural network that encode the model's knowledge and capabilities, stored as large multi-dimensional arrays of floating-point numbers.
Model weights are the numerical parameters that define a trained neural network's behavior. During training, these weights are iteratively adjusted through backpropagation to minimize a loss function, effectively encoding the patterns and knowledge extracted from the training data. Once training is complete, the weights are frozen and used as-is during inference to produce predictions from new inputs.
The size of model weights varies enormously across architectures. A small image classification model might have a few million parameters occupying megabytes of storage, while a large language model like LLaMA 70B has 70 billion parameters requiring 140 GB of storage in FP16 precision. The trend toward larger models has made weight management a critical infrastructure challenge, as these files must be stored, transferred, and loaded into GPU memory efficiently.
Model weights are typically stored in standardized formats such as PyTorch's .pt files, SafeTensors (a safe, efficient format that prevents code execution during loading), GGUF (optimized for quantized models), or ONNX (for cross-framework compatibility). The choice of format affects loading speed, safety, and compatibility with serving frameworks. SafeTensors has become the preferred format for production deployments due to its memory-mapping capability and resistance to arbitrary code execution vulnerabilities.
Loading model weights into GPU memory is often the largest component of cold start latency on serverless GPU platforms. A 70B parameter model in FP16 requires transferring 140 GB from storage to GPU memory, which can take tens of seconds even with high-speed storage and PCIe connections. Optimization strategies include weight pre-staging on local NVMe storage, memory-mapped loading that starts serving before all weights are transferred, model parallelism that distributes weights across multiple GPUs, and quantization that reduces the total data to transfer.
Understanding model weight characteristics is essential for infrastructure planning. The total weight size determines minimum GPU memory requirements and affects storage costs. The precision format affects both memory usage and inference performance. The number of distinct models being served affects total storage and caching requirements. Serverless GPU platforms handle weight distribution and caching as part of the platform infrastructure, pre-staging popular models across their fleet for faster cold starts.
Related Terms
Model Quantization
The process of reducing the numerical precision of a model's weights and activations (e.g., from 32-bit to 8-bit or 4-bit) to decrease memory usage and increase inference speed.
Fine-Tuning
The process of adapting a pre-trained machine learning model to a specific task or domain by continuing training on a smaller, task-specific dataset.
GPU Memory Bandwidth
The rate at which data can be read from or written to GPU memory (VRAM), often the primary bottleneck for AI inference performance.
Cold Start
The initial delay when a serverless GPU instance must be provisioned, loaded, and initialized before it can serve its first request.
Model Serving
The infrastructure and process of deploying trained machine learning models as accessible endpoints that can receive inputs and return predictions in real time.