Inference Glossary

LLM Gateway

An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.

An LLM gateway is the front door of an inference platform. It accepts requests in a single normalized protocol — almost always OpenAI-compatible, because the OpenAI SDK is the closest thing to a lingua franca for AI clients — and translates them to whatever the downstream provider (OpenAI itself, Anthropic, an open-weight model on Ion, a fine-tune) actually expects.

The drop-in story is the headline benefit. An application built against the OpenAI SDK can move to a gateway by changing the `base_url` and the API key. Method signatures, request shapes, response shapes, streaming, tool calls, and structured outputs all pass through. The application code does not know anything has changed.

The gateway is also the place where authentication, rate limiting, request shaping, and the entry point for every other subsystem live. Routing happens after the gateway. Caching happens after the gateway. Observability is written from the gateway. Evaluation reads from the audit log the gateway populates. Without a gateway, none of the higher-level subsystems can be built without invasive code changes in the application.

The other benefit is that the gateway makes provider lock-in optional. An application that talks directly to OpenAI cannot trivially route 10% of traffic to Anthropic for evaluation, or fall back to an open-weight model during an outage, without rewriting the call sites. An application that talks to a gateway gets both for free.

Related Terms

LLM Router

A component that decides, per request, which model and which provider should serve it — based on declared rules, health checks, latency budget, and cost constraints.

Prompt Cache

A response cache keyed on the request — exact-match for identical requests, prefix for shared system prompts, and optionally semantic for paraphrased questions. Cuts input tokens dramatically on real traffic.

LLM Observability

A replayable audit log of every model call — input, output, model, provider, latency, cost, and quality signals — plus real-time dashboards over the same data.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.