GPU Glossary
A comprehensive reference for GPU computing, AI infrastructure, and serverless inference terminology. From cold starts to tensor cores, understand the concepts that power modern AI deployment.
Cold Start
The initial delay when a serverless GPU instance must be provisioned, loaded, and initialized before it can serve its first request.
GPU Inference
The process of running a trained machine learning model on a GPU to generate predictions or outputs from new input data.
Serverless GPU
A cloud computing model where GPU resources are automatically provisioned on demand and billed per second of actual use, with no server management required.
Tensor Cores
Specialized hardware units within NVIDIA GPUs designed to accelerate matrix multiplication and convolution operations used in deep learning.
GPU Memory Bandwidth
The rate at which data can be read from or written to GPU memory (VRAM), often the primary bottleneck for AI inference performance.
Model Quantization
The process of reducing the numerical precision of a model's weights and activations (e.g., from 32-bit to 8-bit or 4-bit) to decrease memory usage and increase inference speed.
vLLM
An open-source, high-throughput serving engine for large language models that uses PagedAttention to efficiently manage GPU memory and maximize inference performance.
KV Cache
A memory buffer that stores previously computed key and value tensors during autoregressive language model inference to avoid redundant recalculation for each new token.
Batch Inference
The technique of grouping multiple inference requests together and processing them simultaneously on a GPU to maximize throughput and hardware utilization.
GPU Autoscaling
The automatic adjustment of the number of GPU instances serving a workload based on real-time demand, scaling up during traffic spikes and down during quiet periods.
CUDA
NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose computing, including AI training and inference.
Model Serving
The infrastructure and process of deploying trained machine learning models as accessible endpoints that can receive inputs and return predictions in real time.
GPU Utilization
A metric measuring the percentage of time a GPU's compute cores are actively processing work, indicating how efficiently the hardware is being used.
Scale to Zero
The ability of a serverless platform to completely deallocate all compute resources when there are no active requests, reducing cost to zero during idle periods.
Pay-Per-Compute
A pricing model where users are billed only for the actual GPU compute time consumed during inference, rather than paying for reserved instances by the hour.
Container Orchestration
The automated management of containerized application lifecycles, including deployment, scaling, networking, and health monitoring across a cluster of machines.
GPU Cluster
A group of interconnected servers equipped with GPUs that work together to provide scalable compute capacity for AI training, inference, and other GPU-accelerated workloads.
Fine-Tuning
The process of adapting a pre-trained machine learning model to a specific task or domain by continuing training on a smaller, task-specific dataset.
Inference Latency
The total time elapsed from when an inference request is received to when the response is returned, including preprocessing, model execution, and postprocessing.
Model Weights
The learned numerical parameters of a neural network that encode the model's knowledge and capabilities, stored as large multi-dimensional arrays of floating-point numbers.