Back to Inference Glossary
Inference Glossary

Model Serving

The infrastructure that turns a trained model into a live, scalable endpoint — handling request routing, batching, health checks, versioning, and metrics.

Model serving is the practice of making a trained machine learning model available for inference through an API endpoint. It encompasses loading model weights, preprocessing requests, running inference, postprocessing results, and returning responses to callers. Model serving turns a static artifact into a live service.

A real serving system handles more than the forward pass. It includes request routing and load balancing across replicas, health checks and automatic restart of failed instances, input validation and error handling, batching of concurrent requests, and metrics collection. Production systems also need versioning, rolling updates, canary deployments, and A/B testing — none of which are part of "just call the model."

Several open-source and commercial frameworks address these concerns at different levels. **vLLM** and **TGI** (Text Generation Inference) specialize in LLM serving with optimizations like continuous batching and PagedAttention. **TensorRT-LLM** provides NVIDIA-optimized inference. **Triton Inference Server** is a general-purpose serving framework. Each has strengths depending on the model and deployment shape.

Inference platforms like Cumulus sit one layer above these serving frameworks. Instead of choosing and configuring a serving framework, applications hit an OpenAI-compatible gateway and the platform decides which framework, which model, and which hardware serve each request. This trades some configurability for a significant reduction in operational complexity, and makes the rest of the platform — routing, caching, observability, evaluation, fine-tuning — possible in the first place.