Back to Inference Glossary
Inference Glossary

LLM Gateway

An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.

An LLM gateway is the front door of an inference platform. It accepts requests in a single normalized protocol — almost always OpenAI-compatible, because the OpenAI SDK is the closest thing to a lingua franca for AI clients — and translates them to whatever the downstream provider (OpenAI itself, Anthropic, an open-weight model on Ion, a fine-tune) actually expects.

The drop-in story is the headline benefit. An application built against the OpenAI SDK can move to a gateway by changing the `base_url` and the API key. Method signatures, request shapes, response shapes, streaming, tool calls, and structured outputs all pass through. The application code does not know anything has changed.

The gateway is also the place where authentication, rate limiting, request shaping, and the entry point for every other subsystem live. Routing happens after the gateway. Caching happens after the gateway. Observability is written from the gateway. Evaluation reads from the audit log the gateway populates. Without a gateway, none of the higher-level subsystems can be built without invasive code changes in the application.

The other benefit is that the gateway makes provider lock-in optional. An application that talks directly to OpenAI cannot trivially route 10% of traffic to Anthropic for evaluation, or fall back to an open-weight model during an outage, without rewriting the call sites. An application that talks to a gateway gets both for free.