Inference Glossary

CUDA

NVIDIA's parallel computing platform and programming model. The runtime, libraries, and API that let software use NVIDIA GPUs for general-purpose computation.

CUDA (Compute Unified Device Architecture) is the parallel computing platform created by NVIDIA. It includes a runtime, a driver, a set of libraries (cuBLAS, cuDNN, NCCL, TensorRT, and many others), and language extensions for C, C++, and Python. CUDA is what lets PyTorch, TensorFlow, JAX, and every major inference engine run on NVIDIA hardware.

A CUDA program organizes work into kernels — functions that execute in parallel across thousands of GPU threads. Threads are grouped into blocks, blocks into grids, and the CUDA runtime maps blocks onto the GPU's streaming multiprocessors. Developers write logic at the kernel level; the hardware handles fine-grained scheduling.

Most AI engineers never write CUDA directly. They call `torch.matmul` on a GPU tensor, and PyTorch dispatches optimized CUDA kernels through cuBLAS or cuDNN. The framework handles memory allocation, host-device transfers, kernel selection, and synchronization. Understanding CUDA basics still matters when debugging performance or evaluating hardware — knowing why an attention kernel is memory-bound on Hopper but compute-bound on Blackwell requires reading at the CUDA layer.

CUDA's dominance creates a strong lock-in to NVIDIA hardware. AMD's ROCm and Intel's oneAPI exist as alternatives, but CUDA's mature tooling and library ecosystem keep most production inference on NVIDIA. Inference platforms like Cumulus abstract CUDA management away — driver versions, library compatibility, and kernel selection happen as part of the platform infrastructure.

Related Terms

Tensor Cores

Specialized hardware units in NVIDIA GPUs that perform matrix multiply-and-accumulate operations in a single clock cycle, accelerating deep learning by an order of magnitude over standard CUDA cores.

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.