Back to Inference Glossary
Inference Glossary

Prompt Cache

A response cache keyed on the request — exact-match for identical requests, prefix for shared system prompts, and optionally semantic for paraphrased questions. Cuts input tokens dramatically on real traffic.

A prompt cache is a response cache that sits in front of model inference. When a request matches an entry in the cache, the cached response is returned without calling the model. On real production workloads, a well-designed prompt cache reduces input tokens by 40 to 70%, which translates directly into lower cost and lower latency.

There are three useful caches, and the cleanest deployments stack all three.

**Exact-match** is the simplest. The cache key is a hash of the full request. If the same prompt with the same parameters arrives, the cached response is returned in single-digit milliseconds. This catches repeated calls from the same client, deterministic test loops, and bot traffic.

**Prefix caching** matches on the longest common prefix between the incoming request and previously-served requests. This is where most production savings come from — system prompts, few-shot examples, tool definitions, and RAG context tend to be identical across thousands of requests. Prefix caching is what makes long shared-context workflows economical.

**Semantic caching** uses embeddings to detect paraphrased questions. "What is your refund policy?" and "How do I get a refund?" can return the same cached answer. Semantic caches are powerful but require care — false positives return wrong answers, and the threshold needs tuning per workflow.

The platform that owns the gateway is the only place a cache can correctly account for token cost across providers. A standalone cache that does not know what the response would have cost from OpenAI versus Anthropic versus an open-weight model cannot accurately attribute savings. This is one of the structural reasons inference platforms consolidate gateway, cache, and observability in one place.