Inference Glossary

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

Ion is the inference runtime built by Cumulus Labs. It executes the forward pass for every model that Cumulus serves on its own hardware — open-weight models, LoRA fine-tunes, and Cumulus' own custom-hosted variants. Ion is purpose-built for NVIDIA Grace Hopper and Blackwell architectures and exploits their specific advantages: NVLink-C2C coherency between CPU and GPU memory, Hopper's Tensor Memory Accelerator, and Blackwell's expanded HBM.

The core differentiator of Ion is its attention kernel, **IonAttention**. Where vLLM's PagedAttention focuses on memory management across a fleet of attention requests, IonAttention focuses on overlapping prefill and decode through phantom-tile scheduling, eager KV writeback through TMA, and a bounded working set that drains to LPDDR over NVLink-C2C. On the same GH200, IonAttention delivers 30 to 50% more tokens per second than vLLM and SGLang for the production workloads Cumulus serves.

Ion is not a drop-in replacement for vLLM in every situation. For batch size one on tiny models, the overhead of warp specialization is not worth it. For non-Hopper hardware, IonAttention will not build. The Cumulus Router selects Ion when Ion is the right answer for a given workload and dispatches elsewhere when it is not.

Ion is the engine behind Cumulus' custom hosting subsystem, the Ion-served entries in the routing graph, and the throughput claim that lets the platform offer competitive per-token pricing. It is the lowest layer of the stack, and it is what makes the higher layers economically viable.

Related Terms

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.

SGLang

An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.