Back to GPU Glossary
GPU Glossary

Tensor Cores

Specialized hardware units within NVIDIA GPUs designed to accelerate matrix multiplication and convolution operations used in deep learning.

Tensor Cores are specialized processing units found in NVIDIA GPUs starting with the Volta architecture. They are purpose-built to perform mixed-precision matrix multiply-and-accumulate operations in a single clock cycle. A single Tensor Core can perform a 4x4 matrix multiplication and addition per cycle, dramatically accelerating the linear algebra operations at the heart of deep learning.

Standard CUDA cores process one floating-point operation per clock cycle, while Tensor Cores process entire matrix operations. This architectural difference translates to massive throughput improvements for deep learning workloads. On an NVIDIA A100 GPU, Tensor Cores deliver up to 312 teraflops of mixed-precision performance, compared to 19.5 teraflops from standard CUDA cores alone.

Tensor Cores support multiple precision formats including FP16, BF16, TF32, INT8, and FP8 depending on the GPU generation. Lower precision formats like INT8 and FP8 enable even higher throughput at the cost of numerical precision, which is often acceptable for inference workloads. This flexibility allows developers to choose the optimal balance between speed and accuracy for their specific model and use case.

The introduction of Tensor Cores fundamentally changed GPU inference economics. Operations that previously required minutes on CPU or seconds on standard GPU cores can complete in milliseconds with Tensor Core acceleration. Frameworks like PyTorch and TensorFlow automatically leverage Tensor Cores when the data types and operation shapes are compatible, often requiring no code changes from the developer.

Modern AI inference engines like vLLM and TensorRT are designed to maximize Tensor Core utilization through techniques like operator fusion, optimal memory layout, and precision-aware scheduling. Getting the most out of Tensor Cores requires ensuring that matrix dimensions are aligned to the Tensor Core tile sizes, typically multiples of 8 or 16 depending on the precision format.