Inference Glossary

Tensor Cores

Specialized hardware units in NVIDIA GPUs that perform matrix multiply-and-accumulate operations in a single clock cycle, accelerating deep learning by an order of magnitude over standard CUDA cores.

Tensor Cores are specialized processing units found in NVIDIA GPUs starting with the Volta architecture. They perform mixed-precision matrix multiply-and-accumulate operations as a single hardware primitive — one Tensor Core can multiply two 4x4 matrices and add a third in a single clock cycle. This architectural choice maps directly onto the linear algebra at the heart of deep learning.

Standard CUDA cores process one floating-point operation per clock cycle; Tensor Cores process entire matrix operations. The throughput difference is dramatic. On an NVIDIA H100, Tensor Cores deliver close to a petaflop of mixed-precision compute, compared to a small fraction of that from standard cores alone.

Tensor Cores support a growing set of precision formats depending on the GPU generation: FP16, BF16, TF32, INT8, FP8, and on Blackwell, FP4. Lower-precision formats enable higher throughput at the cost of numerical precision, which is often acceptable for inference workloads where the model has been quantized appropriately. Frameworks like PyTorch and JAX, and inference engines like vLLM, SGLang, TensorRT-LLM, and Ion, automatically dispatch matrix operations to Tensor Cores when shapes and data types are compatible.

Getting the most out of Tensor Cores requires aligning matrix dimensions to the Tensor Core tile sizes, typically multiples of 8 or 16 depending on precision. Inference engines that have done this work — Ion among them — extract substantially more throughput from the same hardware than naive implementations.

Related Terms

Model Quantization

Reducing the numerical precision of a model's weights and activations — from 32-bit to 16, 8, or 4 bits — to shrink memory footprint and speed up memory-bandwidth-bound inference.

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.