Back to Inference Glossary
Inference Glossary

Ion

Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.

Ion is the inference runtime built by Cumulus Labs. It executes the forward pass for every model that Cumulus serves on its own hardware — open-weight models, LoRA fine-tunes, and Cumulus' own custom-hosted variants. Ion is purpose-built for NVIDIA Grace Hopper and Blackwell architectures and exploits their specific advantages: NVLink-C2C coherency between CPU and GPU memory, Hopper's Tensor Memory Accelerator, and Blackwell's expanded HBM.

The core differentiator of Ion is its attention kernel, **IonAttention**. Where vLLM's PagedAttention focuses on memory management across a fleet of attention requests, IonAttention focuses on overlapping prefill and decode through phantom-tile scheduling, eager KV writeback through TMA, and a bounded working set that drains to LPDDR over NVLink-C2C. On the same GH200, IonAttention delivers 30 to 50% more tokens per second than vLLM and SGLang for the production workloads Cumulus serves.

Ion is not a drop-in replacement for vLLM in every situation. For batch size one on tiny models, the overhead of warp specialization is not worth it. For non-Hopper hardware, IonAttention will not build. The Cumulus Router selects Ion when Ion is the right answer for a given workload and dispatches elsewhere when it is not.

Ion is the engine behind Cumulus' custom hosting subsystem, the Ion-served entries in the routing graph, and the throughput claim that lets the platform offer competitive per-token pricing. It is the lowest layer of the stack, and it is what makes the higher layers economically viable.