Back to Inference Glossary
Inference Glossary

CUDA

NVIDIA's parallel computing platform and programming model. The runtime, libraries, and API that let software use NVIDIA GPUs for general-purpose computation.

CUDA (Compute Unified Device Architecture) is the parallel computing platform created by NVIDIA. It includes a runtime, a driver, a set of libraries (cuBLAS, cuDNN, NCCL, TensorRT, and many others), and language extensions for C, C++, and Python. CUDA is what lets PyTorch, TensorFlow, JAX, and every major inference engine run on NVIDIA hardware.

A CUDA program organizes work into kernels — functions that execute in parallel across thousands of GPU threads. Threads are grouped into blocks, blocks into grids, and the CUDA runtime maps blocks onto the GPU's streaming multiprocessors. Developers write logic at the kernel level; the hardware handles fine-grained scheduling.

Most AI engineers never write CUDA directly. They call `torch.matmul` on a GPU tensor, and PyTorch dispatches optimized CUDA kernels through cuBLAS or cuDNN. The framework handles memory allocation, host-device transfers, kernel selection, and synchronization. Understanding CUDA basics still matters when debugging performance or evaluating hardware — knowing why an attention kernel is memory-bound on Hopper but compute-bound on Blackwell requires reading at the CUDA layer.

CUDA's dominance creates a strong lock-in to NVIDIA hardware. AMD's ROCm and Intel's oneAPI exist as alternatives, but CUDA's mature tooling and library ecosystem keep most production inference on NVIDIA. Inference platforms like Cumulus abstract CUDA management away — driver versions, library compatibility, and kernel selection happen as part of the platform infrastructure.