Chapter 19. Prompt Caching, Reusing Work Across Calls
Every time you send a message to ChatGPT, Claude, or Gemini, the API processes your entire conversation from scratch: the system prompt, the tool definitions, every previous message, and your new question. The model has no memory between API calls. It recomputes the same KV cache for the same system prompt, the same tool definitions, and the same conversation history, over and over again, for every single request. Prompt caching is the optimization that eliminates this redundancy: the server stores the KV cache from previous requests and reuses it when a new request shares the same prefix. This saves both time (up to 85% faster time-to-first-token) and money (up to 90% cheaper input tokens). Anthropic’s Claude Code team treats drops in cache hit rate as production incidents, triggering severity alerts when the numbers slip. If you build anything that calls a language model API repeatedly, prompt caching is one of the most impactful optimizations available to you.
The Stateless API Problem
Language model APIs are stateless. Each API call is independent: the server does not remember what you sent in the previous call. When you have a multi-turn conversation with ChatGPT or Claude, the client (your browser, your app, or your code) is responsible for sending the entire conversation history with every request.
Here is what a typical multi-turn conversation looks like from the API’s perspective:
# Turn 1: User asks a question
# The client sends:
messages_turn_1 = [
{"role": "system", "content": "You are a helpful coding assistant..."}, # ~500 tokens
{"role": "user", "content": "How do I read a CSV file in Python?"}, # ~15 tokens
]
# Total input: ~515 tokens
# Turn 2: User asks a follow-up
# The client sends THE ENTIRE CONVERSATION again:
messages_turn_2 = [
{"role": "system", "content": "You are a helpful coding assistant..."}, # ~500 tokens (same)
{"role": "user", "content": "How do I read a CSV file in Python?"}, # ~15 tokens (same)
{"role": "assistant", "content": "Here's how to read a CSV..."}, # ~200 tokens (same)
{"role": "user", "content": "How do I filter rows by a column value?"}, # ~15 tokens (new)
]
# Total input: ~730 tokens, but only ~15 are new
# Turn 10: After a long conversation
# The client sends EVERYTHING from the beginning:
messages_turn_10 = [
{"role": "system", "content": "You are a helpful coding assistant..."}, # ~500 tokens (same)
# ... 9 previous turns of conversation ... # ~3,000 tokens (same)
{"role": "user", "content": "Can you add error handling to that?"}, # ~12 tokens (new)
]
# Total input: ~3,512 tokens, but only ~12 are newBy turn 10, you are sending over 3,500 tokens, but only 12 of them are new. The other 3,500 tokens are identical to what you sent in the previous request. Without prompt caching, the server processes all 3,500 tokens through the full transformer stack, computing Q, K, and V vectors at every layer, running attention, and running the feed-forward networks, just to arrive at the same KV cache it already computed one request ago.
This is the same kind of redundancy that the KV cache (Chapter 18) eliminates within a single request. The KV cache avoids recomputing K and V vectors for previous tokens during generation. Prompt caching extends this idea across requests: it avoids recomputing the KV cache for the shared prefix between consecutive API calls.
The Scale of the Problem
The waste compounds quickly in real applications. Consider a coding assistant with a 2,000-token system prompt, 5,000 tokens of tool definitions, and a conversation that runs for 20 turns with an average of 300 tokens per turn:
def compute_wasted_tokens(system_tokens, tool_tokens, turns, avg_tokens_per_turn):
"""
Calculate how many tokens are redundantly processed
across a multi-turn conversation without prompt caching.
"""
prefix = system_tokens + tool_tokens
total_input_tokens = 0
new_tokens_per_turn = avg_tokens_per_turn # Approximate
for turn in range(1, turns + 1):
# Each turn sends: prefix + all previous turns + new message
conversation_so_far = prefix + (turn - 1) * avg_tokens_per_turn * 2 # user + assistant
new_content = new_tokens_per_turn
total_this_turn = conversation_so_far + new_content
total_input_tokens += total_this_turn
# What would be needed if we only processed new tokens each turn?
minimal_tokens = prefix + turns * avg_tokens_per_turn # Process prefix once + new content
return total_input_tokens, minimal_tokens
total, minimal = compute_wasted_tokens(
system_tokens=2_000,
tool_tokens=5_000,
turns=20,
avg_tokens_per_turn=300
)
print(f"Total tokens processed (no caching): {total:>10,}")
print(f"Minimal tokens needed (perfect cache): {minimal:>10,}")
print(f"Redundant tokens: {total - minimal:>10,}")
print(f"Waste ratio: {total / minimal:>10.1f}x")Without caching, a 20-turn conversation processes hundreds of thousands of tokens, the vast majority of which are redundant. The system prompt alone (2,000 tokens) is processed 20 times. The tool definitions (5,000 tokens) are processed 20 times. Every previous turn is reprocessed at every subsequent turn. This is not just a cost problem; it is a latency problem. Each redundant token adds to the prefill time, increasing the time-to-first-token (TTFT) that the user experiences as a pause before the response starts streaming.
What Prompt Caching Actually Does
Prompt caching works by storing the KV cache computed during the prefill phase (Chapter 18) and reusing it when a subsequent request shares the same prefix. Here is the process, step by step:
Request 1 arrives. The server processes the full prompt through all transformer layers, computing K and V vectors for every token at every layer. This populates the KV cache. The server stores this KV cache (or a portion of it) in GPU memory, tagged with a hash of the token sequence that produced it.
Request 2 arrives. The server compares the beginning of the new request’s token sequence against stored KV caches. If the new request starts with the same tokens as a previous request (a prefix match), the server can skip the prefill computation for those matching tokens and load the stored KV cache directly.
Only the new tokens are processed. The server runs the prefill phase only for the tokens that come after the cached prefix. This is dramatically faster because the expensive matrix multiplications for the shared prefix are skipped entirely.
import numpy as np
import hashlib
class PromptCacheServer:
"""
Simplified prompt caching server.
Demonstrates how KV caches are stored and reused across requests.
"""
def __init__(self, model):
self.model = model
self.cache_store = {} # hash -> {"kv_cache": ..., "num_tokens": ...}
def process_request(self, token_ids):
"""
Process a request, reusing cached KV state if possible.
Returns the response and cache hit information.
"""
# Check for prefix match in cache
best_match_len = 0
best_match_hash = None
for prefix_len in range(len(token_ids), 0, -1):
prefix = tuple(token_ids[:prefix_len])
prefix_hash = hashlib.sha256(str(prefix).encode()).hexdigest()
if prefix_hash in self.cache_store:
best_match_len = prefix_len
best_match_hash = prefix_hash
break
if best_match_len > 0:
# CACHE HIT: reuse stored KV cache for the prefix
cached_kv = self.cache_store[best_match_hash]["kv_cache"]
new_tokens = token_ids[best_match_len:]
# Only run prefill on the new (uncached) tokens
kv_cache = self.model.extend_kv_cache(cached_kv, new_tokens)
cached_tokens = best_match_len
computed_tokens = len(new_tokens)
else:
# CACHE MISS: process the entire prompt from scratch
kv_cache = self.model.prefill(token_ids)
cached_tokens = 0
computed_tokens = len(token_ids)
# Store the full KV cache for future reuse
full_prefix = tuple(token_ids)
full_hash = hashlib.sha256(str(full_prefix).encode()).hexdigest()
self.cache_store[full_hash] = {
"kv_cache": kv_cache,
"num_tokens": len(token_ids),
}
# Generate response using the KV cache
response = self.model.generate(kv_cache)
return {
"response": response,
"cached_tokens": cached_tokens,
"computed_tokens": computed_tokens,
"total_tokens": len(token_ids),
"cache_hit": cached_tokens > 0,
}The key insight is that the KV cache for a given prefix is deterministic: the same sequence of input tokens always produces the same K and V vectors at every layer (assuming the same model weights and the same numerical precision). This means the server can safely reuse a stored KV cache without any risk of producing different results.
What Gets Cached
Prompt caching works on the prefix of the request: the tokens at the beginning that match a previously seen request. In practice, the content that benefits most from caching is:
System prompts. These are typically the same across all requests for a given application. A coding assistant’s system prompt (“You are an expert Python developer…”) is identical for every user and every conversation.
Tool definitions. If your application uses function calling (tool use), the tool schemas are sent with every request and are usually identical across calls.
Few-shot examples. If you include example input/output pairs in your prompt, these are the same across requests.
Conversation history. In a multi-turn conversation, all previous turns are identical between consecutive requests. Only the latest user message is new.
Document context. If you are doing retrieval-augmented generation (RAG) and include the same documents in multiple requests, those document tokens can be cached.
The critical requirement is that the cached content must be a prefix: it must start at the very beginning of the token sequence. If you rearrange your prompt so that the variable content (the user’s new question) comes before the static content (the system prompt), caching will not work because the prefix has changed. This is why API providers recommend structuring prompts with static content first and dynamic content last.
One additional detail worth knowing: the cache hierarchy follows the order tools → system → messages. Changes at any level invalidate that level and all subsequent levels. If you modify a tool definition, the entire cache (tools, system, and messages) is invalidated. If you modify the system prompt, the tools cache remains valid but the system and messages caches are invalidated. If you only change a message, the tools and system caches remain valid.
Beyond content changes, several API-level settings also invalidate portions of the cache. On Anthropic’s API, toggling web search or citations invalidates the system and messages caches (because these features modify the system prompt internally). Changing tool_choice, adding or removing images, or modifying extended thinking parameters invalidates the messages cache. Even the ordering of keys in tool_use content blocks matters: some programming languages (Swift, Go) randomize JSON key order during serialization, which breaks cache matches. These subtle invalidation triggers are a common source of unexpectedly low cache hit rates in production.
# GOOD: Static content first, dynamic content last
# The system prompt and tools are cacheable as a prefix
good_structure = [
{"role": "system", "content": "You are a helpful assistant..."}, # Static (cacheable)
{"role": "user", "content": "Previous question..."}, # Grows but prefix-stable
{"role": "assistant", "content": "Previous answer..."}, # Grows but prefix-stable
{"role": "user", "content": "New question"}, # Dynamic (not cached)
]
# BAD: Dynamic content first
# Nothing is cacheable because the prefix changes every time
bad_structure = [
{"role": "user", "content": f"Current time: {datetime.now()}"}, # Changes every request!
{"role": "system", "content": "You are a helpful assistant..."}, # Cannot be cached
]How the Major Providers Implement Prompt Caching
As of March 2026, all three major API providers (OpenAI, Anthropic, and Google) offer prompt caching, but with significantly different designs. Understanding these differences is important for optimizing cost and latency in production applications.
OpenAI: Automatic Caching
OpenAI announced prompt caching at DevDay on October 1, 2024. It is automatic: there is no special API parameter to enable it. The system detects shared prefixes across recent requests and applies caching transparently.
How it works:
- Caching activates for prompts of 1,024 tokens or longer.
- Cache matches occur in 128-token increments beyond the initial 1,024 tokens. This means the number of cached tokens follows the sequence: 1,024, 1,152, 1,280, 1,408, and so on.
- By default, caches are retained for 5 to 10 minutes of inactivity and automatically cleared within an hour of their last use.
- For the Responses API, OpenAI introduced extended cache retention (around November 2025) via the
prompt_cache_retention="24h"parameter, which keeps the cache alive for up to 24 hours. This is available for GPT-5.1 and newer models. A companion parameter,prompt_cache_key, helps influence cache routing by grouping requests that share the same prefix onto the same inference server. - No code changes are required for basic caching. If your requests share a common prefix, caching is applied automatically.
Pricing (as of March 2026):
The cache read discount varies by model family:
| Model Family | Input Price | Cached Input Price | Cache Discount |
|---|---|---|---|
| GPT-5 / GPT-5 mini / GPT-5 nano | $1.25 / $0.25 / $0.05 per MTok | $0.125 / $0.025 / $0.005 per MTok | 90% |
| GPT-5.4 | $2.50 per MTok | $0.25 per MTok | 90% |
| GPT-4.1 / GPT-4.1 mini / GPT-4.1 nano | $2.00 / $0.40 / $0.10 per MTok | $0.50 / $0.10 / $0.025 per MTok | 75% |
| o3 / o4-mini | $2.00 / $1.10 per MTok | $0.50 / $0.275 per MTok | 75% |
| GPT-4o / GPT-4o mini | $2.50 / $0.15 per MTok | $1.25 / $0.075 per MTok | 50% |
The GPT-5 family gets the deepest discount at 90%. The GPT-4.1 family and o-series reasoning models get 75%. The older GPT-4o family gets 50%. This makes model selection for cache-heavy workloads particularly important: GPT-5’s effective cached input rate of $0.125 per million tokens is cheaper than GPT-4.1 Nano’s standard input rate.
Note that o3’s pricing dropped 80% in June 2025 (from $10/$40 to $2/$8 per million input/output tokens). At the current $2.00 per million input tokens with a 75% cache discount, o3’s cached input rate is $0.50 per million tokens.
GPT-5.4 also has a long-context surcharge: requests exceeding 272,000 input tokens are billed at 2x the standard input rate and 1.5x the output rate for the full session. This makes prompt caching even more valuable for GPT-5.4 workloads that approach the 272K threshold, since keeping the effective token count low through caching avoids triggering the surcharge.
An empirical evaluation of prompt caching across providers (Agarwal et al., arXiv:2601.06007, January 2026) found that prompt caching reduced API costs by 41 to 80% and improved time-to-first-token by 13 to 31% on long-horizon agentic tasks involving 500 to 50,000 token prompts and 3 to 50 tool calls per session.
OpenAI’s newest model, GPT-5.4 (released March 5, 2026), introduced a feature called tool search that is directly relevant to prompt caching. Instead of loading all tool definitions into the prompt upfront (which can add thousands of tokens), the model receives a lightweight tool list and fetches full definitions on demand. On benchmarks with 36 MCP servers enabled, tool search reduced total token usage by 47% while maintaining the same accuracy. This preserves the cached prefix by keeping the prompt smaller and more stable.
Latency improvement: OpenAI reports up to 80% reduction in time-to-first-token for long cached prefixes (over 10,000 tokens).
from openai import OpenAI
client = OpenAI()
# The system prompt and tools are the same across requests.
# OpenAI automatically caches the shared prefix.
system_prompt = "You are an expert Python developer. You write clean, " \
"well-documented code following PEP 8 conventions..." # ~500 tokens
# Request 1: First call processes everything from scratch
response_1 = client.chat.completions.create(
model="gpt-5",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Write a function to parse JSON files."},
],
)
# Check cache usage in the response
usage = response_1.usage
print(f"Total input tokens: {usage.prompt_tokens}")
print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")
# First request: cached_tokens = 0 (cache miss)
# Request 2: Same system prompt, different question
response_2 = client.chat.completions.create(
model="gpt-5",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Write a function to validate email addresses."},
],
)
usage_2 = response_2.usage
print(f"Total input tokens: {usage_2.prompt_tokens}")
print(f"Cached tokens: {usage_2.prompt_tokens_details.cached_tokens}")
# Second request: cached_tokens = 1024 (or more, in 128-token increments)
# The system prompt prefix was reused from the first requestThe usage.prompt_tokens_details.cached_tokens field in the API response tells you exactly how many tokens were served from cache. This is how you monitor whether caching is working for your application.
For the Responses API (GPT-5.1 and newer), you can opt into extended 24-hour cache retention:
# Extended cache retention (Responses API, GPT-5.1+)
response = client.responses.create(
model="gpt-5.4",
prompt_cache_retention="24h", # Keep cache for up to 24 hours
input=[
{"role": "developer", "content": system_prompt},
{"role": "user", "content": "Write a function to parse JSON files."},
],
)The 24-hour retention is useful for batch processing jobs or scheduled tasks where requests may be hours apart. The default 5-to-10-minute retention works well for interactive sessions where requests come in rapid succession.
Source: OpenAI, “Prompt Caching in the API,” October 1, 2024 (openai.com/index/api-prompt-caching). Announced at DevDay 2024. Caching starts at 1,024 tokens, increases in 128-token increments. Up to 80% latency reduction for prompts over 10,000 tokens. Default cache retained 5-10 minutes of inactivity; extended 24-hour retention available via prompt_cache_retention="24h" in the Responses API for GPT-5.1+ models (community.openai.com/t/new-24h-prompt-caching-retention-only-certain-models/1366221, November 2025). prompt_cache_key parameter for cache routing (community.openai.com/t/prompt-cache-routing-the-user-parameter/1267103). GPT-5 pricing: $1.25/MTok input, $0.125/MTok cached (90% discount, langcopilot.com). GPT-5.4 pricing: $2.50/MTok input, $0.25/MTok cached (90% discount, langcopilot.com, aibox365.com). GPT-5.4 long-context surcharge: requests exceeding 272K input tokens billed at 2x input and 1.5x output for the full session (openai.com/api/pricing, digitalapplied.com, community.openai.com). GPT-4.1 pricing: $2.00/MTok input, $0.50/MTok cached (75% discount, docsbot.ai, pecollective.com). o3 pricing (post-June 2025): $2.00/MTok input, $0.50/MTok cached (75% discount, apidog.com, langcopilot.com). o4-mini pricing: $1.10/MTok input, $0.275/MTok cached (75% discount, langcopilot.com, simonwillison.net). GPT-4o pricing: $2.50/MTok input, $1.25/MTok cached (50% discount, pecollective.com, langcopilot.com). GPT-5.4 tool search: 47% token reduction on 250 MCP Atlas tasks with 36 MCP servers (digitalapplied.com, openai.com/index/introducing-gpt-5-4, March 5, 2026). MCP Atlas benchmark: arXiv:2602.00933, 36 real MCP servers, 1,000 tasks (scale.com/leaderboard/mcp_atlas). Agarwal et al., “An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks,” arXiv:2601.06007, January 2026: 41-80% cost reduction, 13-31% TTFT improvement across providers on 500-50,000 token prompts with 3-50 tool calls.
Anthropic: Explicit Cache Control
Anthropic launched prompt caching for Claude on August 14, 2024. Unlike OpenAI’s automatic approach, Anthropic’s system gives developers explicit control over what gets cached using cache_control markers in the API request.
How it works:
- You add
cache_control: {"type": "ephemeral"}to specific content blocks in your request to mark them as cacheable. - Alternatively, you can add a single
cache_controlfield at the top level of the request for automatic caching, where the system applies the cache breakpoint to the last cacheable block and moves it forward as conversations grow. - The minimum cacheable content varies by model: 4,096 tokens for Opus 4.6, Opus 4.5, and Haiku 4.5; 2,048 tokens for Sonnet 4.6, Haiku 3.5, and Haiku 3; and 1,024 tokens for Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, and Sonnet 3.7.
- Cached content has a time-to-live (TTL) of 5 minutes by default, refreshed each time the cache is accessed. A 1-hour TTL option is also available by setting
"ttl": "1h"in the cache_control parameter. - You can place up to 4 cache breakpoints per request. The system uses a 20-block lookback window: when checking for cache hits, it examines up to 20 content blocks before each explicit breakpoint. If your prompt has more than 20 blocks before a breakpoint and you modify content earlier than those 20 blocks, you will not get a cache hit unless you add additional breakpoints closer to that content.
Pricing (Claude Sonnet 4.6, as of March 2026):
Anthropic uses a multiplier system that applies uniformly across all models:
| Operation | Multiplier | Sonnet 4.6 Price | Opus 4.6 Price | Haiku 4.5 Price |
|---|---|---|---|---|
| Standard input | 1.0x | $3.00/MTok | $5.00/MTok | $1.00/MTok |
| Cache write (5-min TTL) | 1.25x | $3.75/MTok | $6.25/MTok | $1.25/MTok |
| Cache write (1-hour TTL) | 2.0x | $6.00/MTok | $10.00/MTok | $2.00/MTok |
| Cache read (hit) | 0.1x | $0.30/MTok | $0.50/MTok | $0.10/MTok |
| Output | N/A | $15.00/MTok | $25.00/MTok | $5.00/MTok |
The 5-minute cache write costs 25% more than standard input, while the 1-hour cache write costs 100% more (double the standard price). But every cache hit saves 90% regardless of which TTL you chose. The 5-minute cache breaks even after just 1 hit (the 0.9x savings per hit exceeds the 0.25x surcharge). The 1-hour cache needs 2 hits to break even (the 1.0x surcharge requires two 0.9x savings to recover).
The 5-minute TTL automatically refreshes every time the cache is accessed. If your application makes at least one request within every 5-minute window, the cache stays alive indefinitely, making the 1-hour TTL unnecessary for high-frequency workloads. The 1-hour TTL is designed for batch processing or scheduled tasks that run every 15 to 30 minutes.
Long-context pricing update (March 13, 2026): On March 13, 2026, Anthropic removed the long-context surcharge for Opus 4.6 and Sonnet 4.6. Previously, requests exceeding 200,000 input tokens were charged at 2x the standard input rate and 1.5x the output rate. Now, standard pricing applies across the full 1 million token context window with no premium. This makes prompt caching even more valuable for long-context workloads, since the cached prefix can span hundreds of thousands of tokens at the standard rate.
Cache isolation: Since February 5, 2026, Anthropic refined cache isolation from the organization level to the workspace level. Caches are shared between different API keys within the same workspace, but completely isolated between different workspaces (even within the same organization) and between different organizations. This means two separate organizations sending identical prompts will each pay the cache write cost independently.
import anthropic
client = anthropic.Anthropic()
# Method 1: Automatic caching (simplest)
# Add cache_control at the top level; the system handles breakpoint placement
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # Enable automatic caching
system="You are an expert Python developer who writes clean, "
"well-documented code following PEP 8 conventions. "
"You always include type hints and docstrings...", # Must be >= 2,048 tokens for Sonnet 4.6
messages=[
{"role": "user", "content": "Write a function to parse JSON files."},
],
)
# Method 2: Explicit cache breakpoints with TTL control
# Place cache_control on specific content blocks
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert Python developer...", # Long system prompt
"cache_control": {"type": "ephemeral", "ttl": "1h"}, # 1-hour cache
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Here is the reference documentation...", # Large context
"cache_control": {"type": "ephemeral"}, # Default 5-minute cache
},
{
"type": "text",
"text": "Write a function to parse JSON files.",
},
],
},
],
)
# Check cache usage in the response
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation: {response.usage.cache_creation_input_tokens}")
print(f"Cache read: {response.usage.cache_read_input_tokens}")
# First request: cache_creation > 0, cache_read = 0
# Subsequent requests: cache_creation = 0, cache_read > 0When mixing TTLs in a single request, the 1-hour cache marker must appear before the 5-minute cache marker. A common pattern is to set the system prompt (which rarely changes) to a 1-hour cache and the few-shot examples or document context (which may change per session) to a 5-minute cache.
Anthropic’s explicit approach has a key advantage: you control exactly what gets cached. This is useful when your prompt has a complex structure with multiple static and dynamic sections. You can place up to 4 cache breakpoints per request, allowing you to cache different sections that change at different frequencies. The Claude Code team at Anthropic builds their entire harness around prompt caching and treats drops in cache hit rate as production incidents, triggering severity alerts when the numbers slip.
There is also a throughput benefit: on the Anthropic API, prompt cache read tokens do not count against your Input Tokens Per Minute (ITPM) rate limit. This means that effective caching can increase your effective throughput by 5 to 10x without upgrading your usage tier.
How critical is prompt caching in practice? On February 27, 2026, Anthropic reset the weekly usage limits for every Claude Code user after a prompt caching bug caused usage to be consumed at two to three times the normal rate. The bug was hotfixed in Claude Code version 2.1.62, and all affected users received a fresh quota reset. This incident illustrates how deeply prompt caching is woven into the economics of production AI tooling: when caching breaks, costs spike immediately and visibly.
Source: Anthropic, “Prompt caching” documentation (docs.anthropic.com/en/docs/build-with-claude/prompt-caching). Launched August 14, 2024 (simonwillison.net). Cache reads cost 10% of standard input price (90% discount). 5-minute cache writes cost 125% of standard input price (25% surcharge). 1-hour cache writes cost 200% of standard input price (100% surcharge, blog.wentuo.ai, March 7, 2026). 5-minute default TTL, refreshed on access. Minimum cacheable tokens vary by model: 4,096 for Opus 4.6, Opus 4.5, and Haiku 4.5; 2,048 for Sonnet 4.6, Haiku 3.5, and Haiku 3; 1,024 for Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, and Sonnet 3.7 (docs.anthropic.com/en/docs/build-with-claude/prompt-caching, “Cache limitations” section). Up to 4 cache breakpoints per request with 20-block lookback window (docs.anthropic.com). Claude Sonnet 4.6 pricing: $3.00/MTok input, $3.75/MTok 5-min cache write, $6.00/MTok 1-hour cache write, $0.30/MTok cache read (langcopilot.com, apidog.com). Claude Opus 4.6 pricing: $5.00/MTok input, $6.25/MTok 5-min cache write, $10.00/MTok 1-hour cache write, $0.50/MTok cache read (invertedstone.com, curlscape.com, docs.anthropic.com pricing table). Automatic caching via top-level cache_control field (docs.anthropic.com). Latency reduction up to 85%, cost reduction up to 90% (aifreeapi.com). Workspace-level cache isolation since February 5, 2026 (blog.wentuo.ai). Long-context surcharge removed for Opus 4.6 and Sonnet 4.6 on March 13, 2026, standard pricing across full 1M window (the-decoder.com, winbuzzer.com, blockchain.news, karangoyal.cc). Previously, requests exceeding 200K tokens were charged at 2x input and 1.5x output (blockchain.news). Cache read tokens do not count against ITPM rate limits (anthropic.com/news/token-saving-updates). Claude Code prompt caching bug on February 27, 2026 caused 2-3x normal usage consumption, hotfixed in v2.1.62 with full quota reset (piunikaweb.com, raylogue.com). Claude Code treats cache misses as production incidents (implicator.ai, Thariq Shafi, Anthropic engineer).
Google Gemini: Explicit and Implicit Caching
Google pioneered context caching for the Gemini API in May 2024 with explicit caching, and added implicit caching on May 9, 2025 for Gemini 2.5 models.
Explicit caching:
- You create a named cache object via the API, specifying the content to cache and a TTL.
- The minimum cacheable content is 2,048 tokens (on Vertex AI).
- Cached content incurs a storage cost in addition to the reduced per-token price: $4.50 per million tokens per hour for Gemini 2.5 Pro.
- Cached input tokens are billed at 75% less than standard input tokens (you pay 25% of the normal price).
Implicit caching (May 9, 2025):
- Works automatically, like OpenAI’s approach. No cache objects to manage.
- Available for Gemini 2.5 Pro and Gemini 2.5 Flash.
- Minimum token thresholds: 1,024 tokens for 2.5 Flash, 2,048 tokens for 2.5 Pro.
- Provides the same 75% discount on cached tokens as explicit caching.
- No storage cost for implicit caching (the cost savings are passed through automatically on cache hits).
Google’s implicit caching is the simplest of the three providers to use: there is nothing to configure, no markers to add, and no cache objects to manage. If your requests share a common prefix that exceeds the minimum token threshold, the discount is applied automatically.
import google.generativeai as genai
# Implicit caching: just use the API normally.
# Gemini automatically detects and caches shared prefixes.
model = genai.GenerativeModel("gemini-2.5-pro")
# Request 1: processes everything
response_1 = model.generate_content(
"You are an expert Python developer...\n\n" # Long system context
"Write a function to parse JSON files."
)
# Request 2: same prefix, different question
# Gemini automatically reuses the cached prefix
response_2 = model.generate_content(
"You are an expert Python developer...\n\n" # Same prefix (cached)
"Write a function to validate email addresses." # New content
)
# Check for cached tokens in usage metadata
# The cached_content_token_count field indicates cache hits
print(f"Cached tokens: {response_2.usage_metadata.cached_content_token_count}")Source: Google Developers Blog, “Gemini 2.5 Models now support implicit caching,” May 9, 2025 (developers.googleblog.com). Google pioneered explicit context caching in May 2024. Implicit caching provides 75% discount automatically (mlq.ai, techcrunch.com, winbuzzer.com). Minimum thresholds: 1,024 tokens for 2.5 Flash, 2,048 tokens for 2.5 Pro (ainvest.com). Explicit caching storage cost: $4.50/MTok/hour for Gemini 2.5 Pro (ai.google.dev/gemini-api/docs/pricing). Vertex AI context caching: 2,048 token minimum (cloud.google.com/blog/products/ai-machine-learning/vertex-ai-context-caching).
Provider Comparison
| Feature | OpenAI | Anthropic | Google Gemini |
|---|---|---|---|
| Launch date | October 1, 2024 | August 14, 2024 | May 2024 (explicit), May 9, 2025 (implicit) |
| Setup required | None (automatic) | cache_control markers or top-level flag | None for implicit; API call for explicit |
| Minimum tokens | 1,024 | 1,024 to 4,096 (varies by model) | 1,024 (2.5 Flash), 2,048 (2.5 Pro) |
| Cache granularity | 128-token increments | Content block level | Prefix-based |
| Cache TTL | 5-10 min default; 24 hours via Responses API (GPT-5.1+) | 5 minutes (default) or 1 hour | Configurable (explicit), automatic (implicit) |
| Cache write cost | Same as standard input | 1.25x (5-min) or 2.0x (1-hour) | Same as standard input (implicit) |
| Cache read discount | 50% (GPT-4o), 75% (GPT-4.1, o3), 90% (GPT-5) | 90% | 75% |
| Latency improvement | Up to 80% TTFT reduction | Up to 85% TTFT reduction | Significant (not quantified) |
| Storage cost | None | None | $4.50/MTok/hour (explicit only) |
| Cache isolation | Not documented | Workspace-level (since Feb 2026) | Not documented |
The Math: When Does Prompt Caching Pay Off?
Prompt caching is not free. With Anthropic, the first request that populates the cache costs 25% more than a normal request (the cache write surcharge). With Google’s explicit caching, you pay an ongoing storage cost. Even with OpenAI’s automatic caching (no write surcharge), there is an implicit cost: the server must allocate GPU memory to store cached KV states, which reduces the memory available for serving other requests.
The question is: how many requests do you need before caching pays for itself?
Anthropic’s Break-Even Calculation
With Anthropic’s pricing, the break-even point is straightforward to calculate. Let us use Claude Sonnet 4.6 as an example, comparing both the 5-minute and 1-hour cache TTLs:
- Standard input: $3.00 per million tokens
- Cache write (5-min TTL): $3.75 per million tokens (1.25x standard)
- Cache write (1-hour TTL): $6.00 per million tokens (2.0x standard)
- Cache read: $0.30 per million tokens (0.1x standard)
def anthropic_breakeven(base_price_per_mtok=3.00):
"""
Calculate how many requests are needed for Anthropic prompt caching
to break even compared to no caching.
5-min cache write = 1.25x base price
1-hour cache write = 2.0x base price
Cache read = 0.10x base price
"""
read_price = base_price_per_mtok * 0.10 # $0.30
for ttl_name, write_mult in [("5-minute", 1.25), ("1-hour", 2.0)]:
write_price = base_price_per_mtok * write_mult
# Break-even when:
# N * base_price = write_price + (N - 1) * read_price
# N = (write_price - read_price) / (base_price - read_price)
N = (write_price - read_price) / (base_price_per_mtok - read_price)
print(f"{ttl_name} cache (write = {write_mult}x):")
print(f" Cache write price: ${write_price:.2f}/MTok")
print(f" Cache read price: ${read_price:.2f}/MTok")
print(f" Break-even at: {N:.2f} requests")
print(f" (Caching pays for itself after {int(np.ceil(N))} requests)")
print()
# Show cost comparison for 5-minute cache
write_price = base_price_per_mtok * 1.25
print(f"Cost comparison (5-minute cache, Sonnet 4.6):")
print(f"{'Requests':>8} {'No Cache':>12} {'With Cache':>12} {'Savings':>10}")
print("-" * 44)
for n in [1, 2, 3, 5, 10, 20, 50]:
no_cache = n * base_price_per_mtok
with_cache = write_price + max(0, n - 1) * read_price
savings = (1 - with_cache / no_cache) * 100
print(f"{n:>8} ${no_cache:>10.2f} ${with_cache:>10.2f} {savings:>9.1f}%")
import numpy as np
anthropic_breakeven()The 5-minute cache breaks even at approximately 1.3 requests: after just 2 requests with the same prefix, you are saving money. The 1-hour cache breaks even at approximately 2.1 requests: after 3 requests, you are ahead. By 10 requests, both options save over 85%. By 50 requests, you are saving over 90%.
OpenAI’s Break-Even (GPT-4.1 and GPT-5)
OpenAI’s caching has no write surcharge, so it pays off immediately on the second request. The savings depend on which model family you use:
def openai_breakeven():
"""
OpenAI prompt caching break-even for GPT-4.1 and GPT-5.
No cache write surcharge; discount varies by model family.
"""
models = [
("GPT-4.1", 2.00, 0.75), # 75% discount
("GPT-5", 1.25, 0.90), # 90% discount
("GPT-5.4", 2.50, 0.90), # 90% discount
("GPT-4o", 2.50, 0.50), # 50% discount
]
for name, base_price, discount in models:
cached_price = base_price * (1 - discount)
print(f"{name}: ${base_price:.2f}/MTok input, "
f"${cached_price:.3f}/MTok cached ({discount*100:.0f}% discount)")
print()
print("Cost comparison for GPT-4.1 ($2.00/MTok, 75% cache discount):")
print(f"{'Requests':>8} {'No Cache':>12} {'With Cache':>12} {'Savings':>10}")
print("-" * 44)
base = 2.00
cached = 0.50
for n in [1, 2, 5, 10, 20, 50]:
no_cache = n * base
with_cache = base + max(0, n - 1) * cached
savings = (1 - with_cache / no_cache) * 100
print(f"{n:>8} ${no_cache:>10.2f} ${with_cache:>10.2f} {savings:>9.1f}%")
openai_breakeven()A Real-World Cost Example
Let us calculate the actual dollar savings for a realistic application: a customer support chatbot using Claude Sonnet 4.6 with a 3,000-token system prompt, 2,000 tokens of tool definitions, and an average conversation of 15 turns.
def chatbot_cost_comparison():
"""
Real-world cost comparison for a customer support chatbot.
Claude Sonnet 4.6 pricing (March 2026).
"""
# Configuration
system_tokens = 3_000
tool_tokens = 2_000
prefix_tokens = system_tokens + tool_tokens # 5,000 tokens
avg_new_tokens_per_turn = 200 # User message + assistant response
num_turns = 15
conversations_per_day = 1_000
# Pricing (Claude Sonnet 4.6, per million tokens)
standard_price = 3.00
cache_write_price = 3.75 # 5-minute TTL (1.25x)
cache_read_price = 0.30
# WITHOUT caching: every turn processes the full prefix + conversation history
total_input_no_cache = 0
for turn in range(1, num_turns + 1):
history_tokens = (turn - 1) * avg_new_tokens_per_turn * 2
input_this_turn = prefix_tokens + history_tokens + avg_new_tokens_per_turn
total_input_no_cache += input_this_turn
cost_no_cache_per_convo = (total_input_no_cache / 1_000_000) * standard_price
# WITH caching: first turn is a cache write, subsequent turns are cache reads
# The prefix is cached. Conversation history grows but the prefix is always cached.
total_write_tokens = prefix_tokens # Cache write on first turn
total_read_tokens = prefix_tokens * (num_turns - 1) # Cache reads on subsequent turns
total_uncached_tokens = 0
for turn in range(1, num_turns + 1):
history_tokens = (turn - 1) * avg_new_tokens_per_turn * 2
total_uncached_tokens += history_tokens + avg_new_tokens_per_turn
cost_with_cache_per_convo = (
(total_write_tokens / 1_000_000) * cache_write_price +
(total_read_tokens / 1_000_000) * cache_read_price +
(total_uncached_tokens / 1_000_000) * standard_price
)
savings_per_convo = cost_no_cache_per_convo - cost_with_cache_per_convo
savings_pct = (savings_per_convo / cost_no_cache_per_convo) * 100
print("Customer Support Chatbot Cost Analysis")
print("=" * 55)
print(f"Model: Claude Sonnet 4.6")
print(f"System prompt: {system_tokens:,} tokens")
print(f"Tool definitions: {tool_tokens:,} tokens")
print(f"Cacheable prefix: {prefix_tokens:,} tokens")
print(f"Average conversation: {num_turns} turns")
print(f"Conversations per day: {conversations_per_day:,}")
print()
print(f"Per conversation:")
print(f" Without caching: ${cost_no_cache_per_convo:.4f}")
print(f" With caching: ${cost_with_cache_per_convo:.4f}")
print(f" Savings: ${savings_per_convo:.4f} ({savings_pct:.1f}%)")
print()
print(f"Daily ({conversations_per_day:,} conversations):")
print(f" Without caching: ${cost_no_cache_per_convo * conversations_per_day:.2f}")
print(f" With caching: ${cost_with_cache_per_convo * conversations_per_day:.2f}")
print(f" Daily savings: ${savings_per_convo * conversations_per_day:.2f}")
print()
monthly_savings = savings_per_convo * conversations_per_day * 30
print(f"Monthly savings: ${monthly_savings:,.2f}")
chatbot_cost_comparison()The savings are substantial. For a chatbot handling 1,000 conversations per day, prompt caching on just the system prompt and tool definitions can save thousands of dollars per month. And this is a conservative estimate: in practice, conversation history caching (where each turn’s prefix includes all previous turns) provides additional savings.
Cache Hit Rates in Practice
The effectiveness of prompt caching depends on the cache hit rate: the fraction of input tokens that are served from cache rather than computed from scratch. Cache hit rates vary widely depending on the application pattern.
High Cache Hit Rates (80-95%)
Applications with large, stable prefixes achieve the highest cache hit rates:
- Chatbots with long system prompts: A 5,000-token system prompt is cached and reused across every turn of every conversation. If the average total input per turn is 6,000 tokens, the cache hit rate is 5,000/6,000 = 83%.
- RAG with stable document context: If you include the same 10,000-token document in every request and the user’s question adds 200 tokens, the cache hit rate is 10,000/10,200 = 98%.
- Coding assistants with tool definitions: Tools like Claude Code or Cursor send large tool schemas (often 3,000 to 8,000 tokens) with every request. These are identical across requests and achieve very high cache hit rates.
Moderate Cache Hit Rates (40-70%)
Applications with partially stable prefixes:
- Multi-turn conversations: The system prompt is always cached, but the conversation history grows with each turn. Early turns have high cache hit rates (the prefix is mostly system prompt); later turns have lower rates (the prefix includes a long, unique conversation history that may not match other users’ conversations).
- Few-shot learning with rotating examples: If you rotate which examples are included, the prefix changes and cache hits decrease.
Low Cache Hit Rates (< 30%)
Applications where the prefix changes frequently:
- Unique prompts per request: If every request has a completely different prompt (e.g., different documents for each query), there is no shared prefix to cache.
- Prompts with dynamic prefixes: If you include timestamps, user IDs, or other variable data at the beginning of the prompt, the prefix changes every time and caching is ineffective.
def cache_hit_rate_scenarios():
"""
Demonstrate cache hit rates for different application patterns.
"""
scenarios = [
{
"name": "Chatbot (5K system prompt, 15-turn avg)",
"system_tokens": 5_000,
"tool_tokens": 3_000,
"avg_history_tokens": 4_000, # Average across all turns
"new_tokens": 200,
},
{
"name": "RAG (10K document context)",
"system_tokens": 500,
"tool_tokens": 0,
"avg_history_tokens": 10_000, # Document context
"new_tokens": 200,
},
{
"name": "Code assistant (8K tool definitions)",
"system_tokens": 2_000,
"tool_tokens": 8_000,
"avg_history_tokens": 3_000,
"new_tokens": 500,
},
{
"name": "Unique prompts (no shared prefix)",
"system_tokens": 0,
"tool_tokens": 0,
"avg_history_tokens": 0,
"new_tokens": 2_000,
},
]
print(f"{'Scenario':<45} {'Cacheable':>10} {'Total':>8} {'Hit Rate':>10}")
print("-" * 75)
for s in scenarios:
cacheable = s["system_tokens"] + s["tool_tokens"]
total = cacheable + s["avg_history_tokens"] + s["new_tokens"]
hit_rate = cacheable / total * 100 if total > 0 else 0
print(f"{s['name']:<45} {cacheable:>10,} {total:>8,} {hit_rate:>9.1f}%")
cache_hit_rate_scenarios()Note that these hit rates only account for the system prompt and tool definitions being cached. In practice, conversation history is also cached (each turn’s prefix includes all previous turns), which increases the effective cache hit rate significantly for multi-turn conversations.
Prompt Caching for Self-Hosted Models
The prompt caching implementations described above are features of commercial API providers. But if you run your own models using open-source inference servers, you can get the same benefits through automatic prefix caching (APC).
vLLM: Automatic Prefix Caching
vLLM, the open-source inference engine introduced in Chapter 18 for its PagedAttention memory management, supports automatic prefix caching. When enabled, vLLM detects shared prefixes across requests and reuses the cached KV pages.
from vllm import LLM, SamplingParams
# Enable automatic prefix caching
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_prefix_caching=True, # Enable APC
)
# First request: computes KV cache for the full prompt
response_1 = llm.generate(
["You are a helpful assistant.\n\nWhat is Python?"],
SamplingParams(max_tokens=200),
)
# Second request: same prefix, different question
# vLLM reuses the cached KV for "You are a helpful assistant.\n\n"
response_2 = llm.generate(
["You are a helpful assistant.\n\nWhat is JavaScript?"],
SamplingParams(max_tokens=200),
)vLLM’s APC works at the page level (using the same PagedAttention pages from Chapter 18). When a new request shares a prefix with a previous request, vLLM reuses the cached pages for the shared portion and only computes new pages for the unique suffix. This is transparent to the user and requires no changes to the prompt structure beyond enabling the flag.
SGLang: RadixAttention
SGLang, developed by the LMSYS team (Zheng et al., arXiv:2312.07104), takes a more sophisticated approach with RadixAttention. Instead of simple prefix matching, RadixAttention organizes the KV cache in a radix tree (a compressed trie data structure), which enables efficient matching of shared prefixes across many concurrent requests, even when those requests share different portions of their prefixes.
The radix tree structure is particularly effective for tree-structured workloads where multiple requests branch from common prefixes. For example, if you are evaluating a model on multiple questions about the same document, all requests share the document prefix but diverge at the question. RadixAttention can cache the document’s KV state once and reuse it for all questions, automatically discovering the optimal sharing pattern. SGLang reports up to 6.4x higher throughput than alternatives on workloads with high prefix sharing, and its v0.4 release (December 2024) added a cache-aware load balancer that achieved up to 1.9x throughput increase with 3.8x higher cache hit rates in multi-GPU deployments.
Source: vLLM automatic prefix caching documentation (docs.vllm.com.cn). Set enable_prefix_caching=True to enable APC. Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,” arXiv:2312.07104, December 2023. RadixAttention uses a radix tree for automatic and efficient KV cache reuse across multiple generation calls (lmsys.org/blog/2024-01-17-sglang). SGLang up to 6.4x higher throughput (qiyanjun.github.io, repleteai.com, inference.net). SGLang v0.4 cache-aware load balancer: 1.9x throughput, 3.8x cache hit rate (lmsys.org/blog/2024-12-04-sglang-v0-4).
Structuring Prompts for Maximum Cache Hits
The single most important thing you can do to benefit from prompt caching is to structure your prompts so that the static content comes first and the dynamic content comes last. This maximizes the length of the shared prefix across requests.
The Golden Rule: Static First, Dynamic Last
# OPTIMAL: Static content at the beginning, dynamic content at the end
optimal_prompt = [
# Layer 1: System prompt (identical across ALL requests)
{"role": "system", "content": "You are a helpful coding assistant..."},
# Layer 2: Tool definitions (identical across ALL requests)
# (In function-calling APIs, tools are part of the request)
# Layer 3: Reference documents (identical within a session)
{"role": "user", "content": "Here is the codebase context:\n{large_document}"},
{"role": "assistant", "content": "I've reviewed the codebase. How can I help?"},
# Layer 4: Conversation history (grows but prefix-stable)
{"role": "user", "content": "Previous question..."},
{"role": "assistant", "content": "Previous answer..."},
# Layer 5: New user message (changes every request)
{"role": "user", "content": "New question here"},
]Each layer is more stable than the one below it. The system prompt never changes. Tool definitions rarely change. Reference documents change per session but not per turn. Conversation history grows but the prefix (all previous turns) is stable. Only the final user message is truly new.
Common Mistakes That Break Caching
# MISTAKE 1: Timestamp in the system prompt
# This changes every request, invalidating the entire cache
bad_system = f"You are a helpful assistant. Current time: {datetime.now()}"
# FIX: Put timestamps in the user message, not the system prompt
# MISTAKE 2: User ID or session metadata at the beginning
bad_prompt = [
{"role": "system", "content": f"Session ID: {session_id}\nUser: {user_name}\n..."},
]
# FIX: Put per-user metadata after the static system prompt
# MISTAKE 3: Randomized few-shot examples
import random
examples = random.sample(all_examples, k=3) # Different order each time!
# FIX: Use a fixed, deterministic set of examples, or sort them consistently
# MISTAKE 4: Different tool ordering across requests
# Some frameworks serialize tools in non-deterministic order
# FIX: Ensure tools are always serialized in the same orderMeasuring Cache Effectiveness
All three major providers return cache usage information in their API responses. You should monitor these metrics to verify that caching is working as expected:
def monitor_cache_effectiveness(responses):
"""
Track cache hit rates across a series of API responses.
Works with OpenAI's response format.
"""
total_input = 0
total_cached = 0
for resp in responses:
usage = resp.usage
input_tokens = usage.prompt_tokens
cached_tokens = getattr(
usage.prompt_tokens_details, 'cached_tokens', 0
) or 0
total_input += input_tokens
total_cached += cached_tokens
hit_rate = total_cached / total_input * 100 if total_input > 0 else 0
print(f"Total input tokens: {total_input:,}")
print(f"Cached tokens: {total_cached:,}")
print(f"Computed tokens: {total_input - total_cached:,}")
print(f"Cache hit rate: {hit_rate:.1f}%")
# Estimate cost savings (GPT-4.1: 75% discount on cached tokens)
full_cost = total_input / 1_000_000 * 2.00 # $2.00/MTok
actual_cost = (
(total_input - total_cached) / 1_000_000 * 2.00 +
total_cached / 1_000_000 * 0.50 # $0.50/MTok cached
)
print(f"Cost without cache: ${full_cost:.4f}")
print(f"Cost with cache: ${actual_cost:.4f}")
print(f"Savings: ${full_cost - actual_cost:.4f} "
f"({(1 - actual_cost/full_cost)*100:.1f}%)")If your cache hit rate is lower than expected, check for the common mistakes listed above. The most frequent cause of poor cache performance is dynamic content at the beginning of the prompt that invalidates the prefix match.
The Technical Implementation: Prefix Trees and Hash Matching
Under the hood, prompt caching systems use two main techniques to efficiently match prefixes across requests: hash-based matching and tree-based matching.
Hash-Based Matching
The simplest approach (used by OpenAI’s automatic caching) computes a hash of the token sequence at fixed intervals (e.g., every 128 tokens) and checks whether that hash matches a stored cache entry. This is fast and requires minimal bookkeeping, but it only supports exact prefix matches.
import hashlib
def compute_prefix_hashes(token_ids, block_size=128, min_prefix=1024):
"""
Compute hashes at block boundaries for prefix matching.
Matches OpenAI's approach: starts at 1024 tokens,
increments in 128-token blocks.
"""
hashes = {}
# Start at minimum prefix length
for end in range(min_prefix, len(token_ids) + 1, block_size):
prefix = tuple(token_ids[:end])
h = hashlib.sha256(str(prefix).encode()).hexdigest()[:16]
hashes[end] = h
return hashes
# Example: two requests with a shared 2000-token prefix
request_1_tokens = list(range(2500)) # 2500 tokens
request_2_tokens = list(range(2000)) + list(range(10000, 10300)) # Same first 2000, then different
hashes_1 = compute_prefix_hashes(request_1_tokens)
hashes_2 = compute_prefix_hashes(request_2_tokens)
# Find the longest matching prefix
max_match = 0
for length in sorted(set(hashes_1.keys()) & set(hashes_2.keys()), reverse=True):
if hashes_1[length] == hashes_2[length]:
max_match = length
break
print(f"Request 1: {len(request_1_tokens)} tokens")
print(f"Request 2: {len(request_2_tokens)} tokens")
print(f"Longest matching prefix: {max_match} tokens")
print(f"Request 2 needs to compute only {len(request_2_tokens) - max_match} new tokens")Tree-Based Matching (RadixAttention)
SGLang’s RadixAttention uses a radix tree to organize cached KV states. A radix tree is a compressed trie where each edge represents a sequence of tokens rather than a single token. This allows efficient lookup of the longest matching prefix for any new request, and it naturally supports sharing across requests that branch from different points in the prefix.
class RadixTreeNode:
"""
Simplified radix tree node for KV cache management.
Each node stores a sequence of tokens and a reference to cached KV data.
"""
def __init__(self):
self.children = {} # first_token -> RadixTreeNode
self.tokens = [] # Token sequence stored at this edge
self.kv_cache = None # Reference to cached KV data (if any)
self.ref_count = 0 # Number of active requests using this node
class RadixKVCache:
"""
Simplified radix tree for KV cache prefix sharing.
Inspired by SGLang's RadixAttention.
"""
def __init__(self):
self.root = RadixTreeNode()
def find_longest_prefix(self, token_ids):
"""
Find the longest cached prefix for a given token sequence.
Returns (matched_length, kv_cache_reference).
"""
node = self.root
matched = 0
best_kv = None
i = 0
while i < len(token_ids):
first_token = token_ids[i]
if first_token not in node.children:
break
child = node.children[first_token]
# Check how many tokens match along this edge
edge_tokens = child.tokens
match_len = 0
for j in range(len(edge_tokens)):
if i + j >= len(token_ids) or token_ids[i + j] != edge_tokens[j]:
break
match_len += 1
matched += match_len
if child.kv_cache is not None:
best_kv = child.kv_cache
if match_len < len(edge_tokens):
break # Partial match along this edge
node = child
i += match_len
return matched, best_kv
def insert(self, token_ids, kv_cache):
"""Insert a token sequence and its KV cache into the tree."""
# Simplified: in practice, this involves splitting edges
# and managing memory for KV cache pages
node = self.root
i = 0
while i < len(token_ids):
first_token = token_ids[i]
if first_token not in node.children:
# Create new edge with remaining tokens
new_node = RadixTreeNode()
new_node.tokens = token_ids[i:]
new_node.kv_cache = kv_cache
node.children[first_token] = new_node
return
child = node.children[first_token]
edge_len = len(child.tokens)
# Check for match
match_len = 0
for j in range(edge_len):
if i + j >= len(token_ids) or token_ids[i + j] != child.tokens[j]:
break
match_len += 1
if match_len == edge_len:
node = child
i += edge_len
else:
# Need to split this edge (omitted for brevity)
break
# Example usage
cache = RadixKVCache()
# Three requests sharing different prefixes:
# Request A: [system_prompt] + [doc_1] + [question_1]
# Request B: [system_prompt] + [doc_1] + [question_2]
# Request C: [system_prompt] + [doc_2] + [question_3]
# The radix tree stores:
# root -> [system_prompt] -> [doc_1] -> [question_1]
# -> [doc_1] -> [question_2]
# -> [doc_2] -> [question_3]
#
# Request B reuses the KV cache for [system_prompt] + [doc_1]
# Request C reuses the KV cache for [system_prompt] onlyThe radix tree approach is more memory-efficient than storing separate caches for each unique prefix because shared portions are stored only once. It also supports more flexible matching patterns than simple hash-based approaches.
Source: Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,” arXiv:2312.07104, December 2023. RadixAttention organizes KV cache in a radix tree for automatic prefix sharing across requests (lmsys.org/blog/2024-01-17-sglang).
Prompt Caching and the KV Cache: Connecting the Concepts
In Chapter 18, you learned that the KV cache stores key and value vectors computed during the prefill phase so they do not need to be recomputed during the decode phase (token-by-token generation). Prompt caching extends this concept one level further: it stores the KV cache from the prefill phase of one request so it does not need to be recomputed during the prefill phase of the next request.
Here is how the two optimizations work together:
Without KV cache or prompt caching:
Request 1: Compute K,V for all tokens at every generation step (O(n^2) per request)
Request 2: Same as Request 1, starting from scratch
With KV cache only (within a single request):
Request 1: Compute K,V once during prefill, reuse during decode (O(n) per request)
Request 2: Compute K,V once during prefill again (no reuse across requests)
With KV cache + prompt caching (across requests):
Request 1: Compute K,V during prefill, store in prompt cache (O(n) first request)
Request 2: Load cached K,V for shared prefix, compute only new (O(m) where m << n)The KV cache eliminates redundancy within a request. Prompt caching eliminates redundancy across requests. Together, they ensure that each token’s K and V vectors are computed exactly once, no matter how many generation steps or API calls use them.
Memory Implications
Prompt caching requires additional GPU memory to store the cached KV states. This memory comes from the same pool that would otherwise be used for serving concurrent requests (as discussed in Chapter 18’s section on batch serving). There is a direct tradeoff: more memory allocated to prompt caches means fewer concurrent requests can be served, but each request that hits the cache is processed faster and cheaper.
In practice, inference servers use eviction policies (similar to the KV cache eviction strategies from Chapter 18) to manage the prompt cache. The most common policy is LRU (Least Recently Used): when the cache is full and a new entry needs to be stored, the least recently accessed cache entry is evicted. This naturally prioritizes frequently used prefixes (like popular system prompts) over rarely used ones.
from collections import OrderedDict
class LRUPromptCache:
"""
LRU eviction policy for prompt cache management.
When the cache is full, evict the least recently used entry.
"""
def __init__(self, max_entries=1000):
self.max_entries = max_entries
self.cache = OrderedDict() # hash -> kv_cache_data
def get(self, prefix_hash):
"""Look up a cached prefix. Returns None on miss."""
if prefix_hash in self.cache:
# Move to end (most recently used)
self.cache.move_to_end(prefix_hash)
return self.cache[prefix_hash]
return None
def put(self, prefix_hash, kv_cache_data):
"""Store a new cache entry, evicting LRU if necessary."""
if prefix_hash in self.cache:
self.cache.move_to_end(prefix_hash)
self.cache[prefix_hash] = kv_cache_data
else:
if len(self.cache) >= self.max_entries:
# Evict least recently used
self.cache.popitem(last=False)
self.cache[prefix_hash] = kv_cache_data
def stats(self):
return {
"entries": len(self.cache),
"capacity": self.max_entries,
"utilization": len(self.cache) / self.max_entries * 100,
}Prompt Caching vs. Conversation Memory
It is important to distinguish prompt caching from “memory” features offered by some AI products. They solve different problems:
Prompt caching is a server-side optimization that avoids redundant computation. The model still receives the full conversation history with every request. It still processes the full context. The optimization is that the KV cache for the shared prefix is loaded from storage rather than recomputed. The model’s behavior is identical whether caching is used or not; only the cost and latency change.
Conversation memory (as offered by ChatGPT’s memory feature, for example) is a different concept entirely. Memory systems store summaries or key facts from previous conversations and inject them into future prompts. This changes what the model sees and therefore changes its behavior. Memory is a product feature; prompt caching is an infrastructure optimization.
| Aspect | Prompt Caching | Conversation Memory |
|---|---|---|
| What it stores | KV cache (intermediate computation) | Summaries, facts, user preferences |
| Where it operates | Server infrastructure | Application layer |
| Effect on model output | None (identical output) | Changes output (different context) |
| Scope | Within a session or short time window | Across sessions, potentially permanent |
| Who benefits | Developers (cost/latency) | End users (personalization) |
| Requires API changes | Minimal or none | Significant application logic |
Key Takeaways
Language model APIs are stateless: every request sends the full conversation from scratch. Without prompt caching, the server recomputes the KV cache for the entire shared prefix (system prompt, tool definitions, conversation history) at every turn, wasting both compute and money.
Prompt caching stores the KV cache from the prefill phase of one request and reuses it when a subsequent request shares the same token prefix. This eliminates redundant computation across API calls, reducing both latency (up to 80-85% faster TTFT) and cost (up to 90% cheaper input tokens).
OpenAI offers automatic prompt caching (launched October 1, 2024 at DevDay). It requires no code changes, activates for prompts of 1,024+ tokens, and matches in 128-token increments. Cache discounts vary by model family: 90% for GPT-5/GPT-5.4, 75% for GPT-4.1 and o-series (o3, o4-mini), and 50% for GPT-4o. Default cache retention is 5 to 10 minutes; the Responses API supports extended 24-hour retention via
prompt_cache_retention="24h"for GPT-5.1 and newer models. GPT-5.4 (March 5, 2026) introduced tool search, which reduces token usage by 47% by fetching tool definitions on demand instead of loading them all into the prompt.Anthropic offers explicit prompt caching (launched August 14, 2024) with
cache_controlmarkers. The 5-minute cache write costs 1.25x the standard input price; the 1-hour cache write costs 2.0x. Cache reads cost 0.1x (90% discount). The 5-minute TTL refreshes on each access. Minimum cacheable content varies by model: 4,096 tokens for Opus 4.6 and Haiku 4.5, 2,048 tokens for Sonnet 4.6, and 1,024 tokens for Sonnet 4.5 and older models. Cache isolation operates at the workspace level since February 2026. As of March 13, 2026, the long-context surcharge for requests exceeding 200K tokens has been removed for Opus 4.6 and Sonnet 4.6.Google Gemini offers both explicit caching (since May 2024) and implicit caching (since May 9, 2025). Both provide a 75% discount on cached tokens. Implicit caching requires no setup; explicit caching incurs a storage cost of $4.50 per million tokens per hour. Minimum thresholds are 1,024 tokens for 2.5 Flash and 2,048 tokens for 2.5 Pro.
Prompt caching pays for itself almost immediately. With Anthropic’s 5-minute cache (1.25x write surcharge), the break-even point is approximately 1.3 requests. With the 1-hour cache (2.0x write surcharge), it takes about 2 requests. With OpenAI’s zero-surcharge approach, every cache hit saves money from the first occurrence.
Cache hit rates depend on application structure. Chatbots with large system prompts achieve 80-95% hit rates. RAG applications with stable document context can exceed 95%. Applications with unique prompts per request see minimal benefit. Anthropic’s Claude Code team treats drops in cache hit rate as production incidents.
For self-hosted models, vLLM supports automatic prefix caching via
enable_prefix_caching=True, and SGLang uses RadixAttention (a radix tree structure) for more flexible prefix sharing, achieving up to 6.4x throughput improvement on workloads with high prefix sharing.The golden rule for maximizing cache hits: put static content (system prompt, tool definitions, reference documents) at the beginning of the prompt and dynamic content (user’s new message) at the end. Avoid timestamps, random orderings, or per-user metadata at the start of the prompt. Be aware that API-level settings (toggling web search, changing
tool_choice, adding images, modifying thinking parameters) can also invalidate cached content, even if the prompt text itself has not changed.An empirical evaluation across providers (Agarwal et al., arXiv:2601.06007) found that prompt caching reduced API costs by 41 to 80% and improved TTFT by 13 to 31% on long-horizon agentic tasks, confirming that the benefits scale with prompt size and tool call frequency.
Prompt caching extends the KV cache concept from Chapter 18. The KV cache eliminates redundant computation within a single request (across generation steps). Prompt caching eliminates redundant computation across requests (across API calls). Together, they ensure each token’s K and V vectors are computed exactly once.
What’s Next
You now understand how prompt caching reuses KV cache state across API calls to reduce both cost and latency. But there is a deeper challenge lurking behind all of these optimizations: the fundamental scaling problem of attention itself. As context windows grow from 4K to 128K to 1 million tokens and beyond, the quadratic cost of attention becomes the dominant bottleneck. In Chapter 20, we will explore how techniques like Flash Attention, Ring Attention, and sparse attention patterns make long-context inference practical, and why context windows have grown by over 250x in just three years.