Inference Glossary

OpenAI-Compatible API

An HTTP API that accepts the same request shape and returns the same response shape as the OpenAI Chat Completions endpoint — letting any OpenAI SDK client point at a different base URL.

An OpenAI-compatible API is an HTTP endpoint that accepts the same request bodies and returns the same response bodies as OpenAI's Chat Completions, Embeddings, and related endpoints. The practical consequence is that any client written against the OpenAI SDK — in Python, TypeScript, Go, Ruby, or any other language — can be pointed at a non-OpenAI service by changing the `base_url` and the API key.

This compatibility has become the de facto standard for LLM serving because the OpenAI SDK ecosystem is enormous. Anthropic's Claude API, the LiteLLM proxy, vLLM's serving mode, SGLang, Ollama, and inference platforms like Cumulus all expose an OpenAI-compatible endpoint. An application written six months ago against `openai.chat.completions.create` can move to any of them with a one-line change.

The features that are well-supported across compatible implementations are: chat completions with system and user messages, streaming responses, tool calling, structured output via JSON schema, embeddings, and most modalities (vision, audio). Edge cases — fine-tune training endpoints, file APIs, organization management — are less consistently supported, but for the inference path the compatibility is usually clean.

For inference platforms, OpenAI compatibility is what makes the "drop-in" pitch real. The Cumulus Gateway is OpenAI-compatible at `api.cumuluslabs.io/v1`, and the entire stack — routing, caching, observability, evaluation — sits behind that interface without requiring any application code changes.

Related Terms

LLM Gateway

An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.

LLM Router

A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.

vLLM

An open-source LLM serving engine that introduced PagedAttention. The de facto baseline for high-throughput open-weight model serving.