Model Serving
The infrastructure that turns a trained model into a live, scalable endpoint — handling request routing, batching, health checks, versioning, and metrics.
Model serving is the practice of making a trained machine learning model available for inference through an API endpoint. It encompasses loading model weights, preprocessing requests, running inference, postprocessing results, and returning responses to callers. Model serving turns a static artifact into a live service.
A real serving system handles more than the forward pass. It includes request routing and load balancing across replicas, health checks and automatic restart of failed instances, input validation and error handling, batching of concurrent requests, and metrics collection. Production systems also need versioning, rolling updates, canary deployments, and A/B testing — none of which are part of "just call the model."
Several open-source and commercial frameworks address these concerns at different levels. **vLLM** and **TGI** (Text Generation Inference) specialize in LLM serving with optimizations like continuous batching and PagedAttention. **TensorRT-LLM** provides NVIDIA-optimized inference. **Triton Inference Server** is a general-purpose serving framework. Each has strengths depending on the model and deployment shape.
Inference platforms like Cumulus sit one layer above these serving frameworks. Instead of choosing and configuring a serving framework, applications hit an OpenAI-compatible gateway and the platform decides which framework, which model, and which hardware serve each request. This trades some configurability for a significant reduction in operational complexity, and makes the rest of the platform — routing, caching, observability, evaluation, fine-tuning — possible in the first place.
Related Terms
vLLM
An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.
SGLang
An open-source LLM serving engine that uses RadixAttention to share KV cache across requests with overlapping prefixes — especially strong for agentic and tool-use workloads.
Ion
Cumulus' inference engine. A custom runtime with proprietary attention kernels that serves 30 to 50% more tokens per second than vLLM and SGLang on NVIDIA Grace and Blackwell.
LLM Gateway
An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.
Inference
Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.