Inference Glossary

Prompt Cache

A response cache keyed on the request — exact-match for identical requests, prefix for shared system prompts, and optionally semantic for paraphrased questions. Cuts input tokens dramatically on real traffic.

A prompt cache is a response cache that sits in front of model inference. When a request matches an entry in the cache, the cached response is returned without calling the model. On real production workloads, a well-designed prompt cache reduces input tokens by 40 to 70%, which translates directly into lower cost and lower latency.

There are three useful caches, and the cleanest deployments stack all three.

**Exact-match** is the simplest. The cache key is a hash of the full request. If the same prompt with the same parameters arrives, the cached response is returned in single-digit milliseconds. This catches repeated calls from the same client, deterministic test loops, and bot traffic.

**Prefix caching** matches on the longest common prefix between the incoming request and previously-served requests. This is where most production savings come from — system prompts, few-shot examples, tool definitions, and RAG context tend to be identical across thousands of requests. Prefix caching is what makes long shared-context workflows economical.

**Semantic caching** uses embeddings to detect paraphrased questions. "What is your refund policy?" and "How do I get a refund?" can return the same cached answer. Semantic caches are powerful but require care — false positives return wrong answers, and the threshold needs tuning per workflow.

The platform that owns the gateway is the only place a cache can correctly account for token cost across providers. A standalone cache that does not know what the response would have cost from OpenAI versus Anthropic versus an open-weight model cannot accurately attribute savings. This is one of the structural reasons inference platforms consolidate gateway, cache, and observability in one place.

Related Terms

LLM Gateway

An HTTP layer that speaks one normalized protocol — usually OpenAI-compatible — and translates to whatever each downstream provider expects. The seam between application code and the rest of the inference stack.

KV Cache

A buffer that stores previously computed key and value tensors during autoregressive generation, so each new token only requires computing one step of attention instead of replaying the whole sequence.

LLM Observability

A replayable audit log of every model call — input, output, model, provider, latency, cost, and quality signals — plus real-time dashboards over the same data.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.