Model Serving
The infrastructure and process of deploying trained machine learning models as accessible endpoints that can receive inputs and return predictions in real time.
Model serving is the practice of making a trained machine learning model available for inference through an API endpoint, gRPC service, or other interface. It encompasses loading model weights onto compute hardware, preprocessing incoming requests, running inference, postprocessing results, and returning predictions to the caller. Model serving transforms a static artifact (a trained model) into a live, scalable service.
A model serving system must handle several concerns beyond simple inference. These include request routing and load balancing across multiple replicas, health checking and automatic restart of failed instances, input validation and error handling, batching of concurrent requests for efficiency, and metrics collection for monitoring. Production serving systems also need versioning support to enable rolling updates, canary deployments, and A/B testing of model versions.
Several open-source and commercial serving frameworks have emerged to address these requirements. vLLM and TGI (Text Generation Inference) specialize in large language model serving with optimizations like continuous batching and PagedAttention. TensorRT-LLM provides NVIDIA-optimized inference with quantization support. Triton Inference Server offers a general-purpose serving framework supporting multiple model formats and frameworks. Each has different strengths depending on the model type and deployment requirements.
Serverless GPU platforms like Cumulus represent a higher-level abstraction for model serving. Instead of configuring and managing a serving framework, developers deploy a model and the platform handles framework selection, infrastructure provisioning, scaling, and monitoring. This approach trades some configurability for significant reduction in operational complexity, making it accessible to teams without dedicated ML infrastructure engineers.
The choice of serving infrastructure has direct implications for latency, throughput, cost, and reliability. Key decisions include which serving framework to use, how to handle model loading and warm-up, what batch sizes and timeout policies to configure, and how to set up autoscaling. For teams deploying multiple models with different characteristics, the serving layer needs to support per-model configuration while maintaining a consistent operational model.
Related Terms
GPU Inference
The process of running a trained machine learning model on a GPU to generate predictions or outputs from new input data.
vLLM
An open-source, high-throughput serving engine for large language models that uses PagedAttention to efficiently manage GPU memory and maximize inference performance.
Batch Inference
The technique of grouping multiple inference requests together and processing them simultaneously on a GPU to maximize throughput and hardware utilization.
Serverless GPU
A cloud computing model where GPU resources are automatically provisioned on demand and billed per second of actual use, with no server management required.
Container Orchestration
The automated management of containerized application lifecycles, including deployment, scaling, networking, and health monitoring across a cluster of machines.