Inference Glossary

Model Weights

The learned numerical parameters of a neural network, stored as large multi-dimensional arrays. The artifact that defines what a trained model does.

Model weights are the numerical parameters that define a trained neural network's behavior. During training, they are iteratively adjusted through backpropagation to minimize a loss function, effectively encoding the patterns learned from the training data. Once training completes, the weights are frozen and used as-is during inference.

The size of model weights varies enormously across architectures. A small image classifier might have a few million parameters occupying megabytes of storage; a frontier language model like LLaMA-70B has 70 billion parameters requiring 140 GB in FP16. The trend toward larger models has made weight management a serious infrastructure problem — these files must be stored, transferred, and loaded into GPU memory efficiently.

Weights are typically stored in standardized formats: PyTorch's `.pt`, **SafeTensors** (preferred for production because it cannot execute arbitrary code on load), GGUF (optimized for quantized models), or ONNX (cross-framework). SafeTensors has become the production default because of its memory-mapping support and its safety properties — a model weight file pulled from a public hub should not be able to run code on the host that loads it.

Loading large weights into GPU memory is often the dominant component of cold-start latency on platforms that scale to zero. Optimization strategies include staging weights on local NVMe, memory-mapped loading that begins serving before transfer completes, model parallelism that distributes weights across multiple GPUs, and quantization that reduces the bytes to transfer. Custom-hosting subsystems on inference platforms handle weight staging and caching as part of platform infrastructure.

Related Terms

Model Quantization

Reducing the numerical precision of a model's weights and activations — from 32-bit to 16, 8, or 4 bits — to shrink memory footprint and speed up memory-bandwidth-bound inference.

Fine-Tuning

Adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset. Usually done with parameter-efficient methods like LoRA that update less than 1% of weights.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.