Back to Inference Glossary
Inference Glossary

OpenAI-Compatible API

An HTTP API that accepts the same request shape and returns the same response shape as the OpenAI Chat Completions endpoint — letting any OpenAI SDK client point at a different base URL.

An OpenAI-compatible API is an HTTP endpoint that accepts the same request bodies and returns the same response bodies as OpenAI's Chat Completions, Embeddings, and related endpoints. The practical consequence is that any client written against the OpenAI SDK — in Python, TypeScript, Go, Ruby, or any other language — can be pointed at a non-OpenAI service by changing the `base_url` and the API key.

This compatibility has become the de facto standard for LLM serving because the OpenAI SDK ecosystem is enormous. Anthropic's Claude API, the LiteLLM proxy, vLLM's serving mode, SGLang, Ollama, and inference platforms like Cumulus all expose an OpenAI-compatible endpoint. An application written six months ago against `openai.chat.completions.create` can move to any of them with a one-line change.

The features that are well-supported across compatible implementations are: chat completions with system and user messages, streaming responses, tool calling, structured output via JSON schema, embeddings, and most modalities (vision, audio). Edge cases — fine-tune training endpoints, file APIs, organization management — are less consistently supported, but for the inference path the compatibility is usually clean.

For inference platforms, OpenAI compatibility is what makes the "drop-in" pitch real. The Cumulus Gateway is OpenAI-compatible at `api.cumuluslabs.io/v1`, and the entire stack — routing, caching, observability, evaluation — sits behind that interface without requiring any application code changes.