Structure exact reusable prefixes, validate cache hits from usage fields, and enforce invalidation and tenant-isolation boundaries.
PagedAttention packs live KV blocks inside a serving engine. Prefix caching asks the next question: what if the same prompt prefix appears again in later requests?
The key-value (KV) cache inside one request saves work while the model generates the next token. Prefix caching saves work across requests that begin with the same tokens.
That distinction matters for long-context products. Suppose a coding assistant receives the same repository guidelines, architecture notes, tool schema, and safety instructions on every request. Without prefix caching, the model pays the prefill cost again and again. With compatible prefix caching enabled, the runtime can reuse computed KV state for the shared prefix, then process only the new user question.
During generation, a decoder-only transformer stores keys and values for tokens it has already processed. When it predicts token 501, it doesn't recompute attention keys and values for tokens 1 through 500. That's the normal KV cache.
This cache belongs to the active sequence. It helps decode speed, but it doesn't automatically help the next request.
Prefix caching is different. It asks: if request B starts with the exact same token prefix as request A, can the runtime reuse the already-computed KV blocks for that prefix?
vLLM calls this Automatic Prefix Caching. Its current documentation enables the feature with enable_prefix_caching=True; it's a serving configuration choice, not a behavior to assume from PagedAttention alone.[1] SGLang's RadixAttention uses a radix tree to reuse KV cache state across structured generation calls, with an LRU eviction policy and a cache-aware scheduler that groups requests with shared prefixes to raise the hit rate.[2] For a self-hosted engine, verify that reuse is enabled, scoped correctly, and measurable before designing prompt layouts around hits.
Many engines cache full KV blocks, not arbitrary partial tokens. vLLM's design docs explain that each full block is hashed with its parent hash, the block's token IDs, and extra fields such as LoRA IDs or multimodal input hashes.[1] That's why a tiny change early in the prompt can invalidate everything after it: the parent hash changes, so later blocks no longer match. vLLM also lets a request supply a cache_salt that's mixed into the first block hash, so only requests carrying the same salt can reuse those blocks.[1] This is one mechanism for tenant-aware reuse boundaries.
1from hashlib import sha256
2
3def chain(blocks: list[str]) -> list[str]:
4 parent = "root"
5 hashes = []
6 for block in blocks:
7 parent = sha256(f"{parent}|{block}".encode()).hexdigest()[:8]
8 hashes.append(parent)
9 return hashes
10
11original = chain(["system", "repo guide v1", "tools"])
12edited = chain(["system", "repo guide v2", "tools"])
13
14for index, (left, right) in enumerate(zip(original, edited)):
15 print(f"block {index}: {'hit' if left == right else 'miss'}")1block 0: hit
2block 1: miss
3block 2: missCache hits require stable prefixes. Put shared content first and variable content last.
Good shape:
1system: You are the repository maintenance assistant.
2repo_guide: <shared repository guide>
3tools: <shared tool schema>
4examples: <shared examples>
5user: Why is this test failing?Poor shape:
1user: Why is this test failing?
2system: You are the repository maintenance assistant.
3repo_guide: <shared repository guide>
4tools: <shared tool schema>In the poor shape, the first tokens change with every user question. The shared policy appears later, after the mismatch. Many caching systems match from the prefix, so the cache opportunity disappears.
That's one reason agent prompts should keep dynamic tool results and user-specific state after stable instructions when possible.
1def shared_prefix_tokens(left: list[str], right: list[str]) -> int:
2 count = 0
3 for a, b in zip(left, right):
4 if a != b:
5 break
6 count += 1
7 return count
8
9stable_first_a = ["system", "repo guide", "tools", "test failure question"]
10stable_first_b = ["system", "repo guide", "tools", "lint failure question"]
11variable_first_a = ["test failure question", "system", "repo guide", "tools"]
12variable_first_b = ["lint failure question", "system", "repo guide", "tools"]
13
14print("stable-first reusable units:", shared_prefix_tokens(stable_first_a, stable_first_b))
15print("variable-first reusable units:", shared_prefix_tokens(variable_first_a, variable_first_b))1stable-first reusable units: 3
2variable-first reusable units: 0
Hosted providers expose prompt caching differently. OpenAI documents automatic prompt caching for repeated prompt prefixes and reports cached tokens in usage fields.[3] Anthropic documents cache_control at the top level for automatically selected prefix caching and on content blocks for explicit breakpoints.[4]
Don't assume provider caches behave like your local runtime. Check:
| Question | Why it matters |
|---|---|
| Is caching automatic or explicit? | Determines prompt construction |
| What is the minimum cacheable prefix? | Short prompts may never hit |
| How long does the cache live? | Affects traffic batching |
| Is cache scoped by org, project, or region? | Affects privacy and hit rate |
| Where are cached tokens reported? | Needed for cost measurement |
OpenAI's current docs say caching is available for prompts of 1,024 tokens or more and report cached_tokens inside usage.prompt_tokens_details, including zero for shorter prompts.[3] OpenAI's caching is automatic, has no separate cache-write fee, and lets you pass an optional prompt_cache_key to improve routing affinity for shared prefixes. Supported models can also offer prompt_cache_retention choices such as in_memory or 24h, so check the selected model before depending on cache lifetime.[3] Anthropic exposes cache_creation_input_tokens, cache_read_input_tokens, and input_tokens so you can separate cache writes, reads, and uncached suffix tokens.[4]
The two pricing models differ, which changes the break-even math. OpenAI cache hits are billed at a reduced input rate with no separate write charge. Anthropic documents a write premium and then a read discount: a 5-minute cache write costs 1.25 times the base input price, a 1-hour write costs 2 times, and a cache read costs 0.1 times.[4] Under those published ratios, a cached prefix repays the extra write cost after one read at the 5-minute tier or two reads at the 1-hour tier. Anthropic also uses model-specific minimum cacheable prefix lengths; its current active-model table spans 1,024 to 4,096 tokens.[4] Below the selected model's minimum, cache fields can stay zero and you pay normal input cost. Model and pricing behavior change over time, so production code should inspect usage fields rather than assume a fixed discount.
Anthropic's explicit breakpoints add one more boundary to remember: each breakpoint checks at most 20 preceding content blocks for reusable content.[4] If a prompt contains more than 20 content blocks before a breakpoint, add another cache_control breakpoint before that lookback window so older reusable content stays discoverable. Top-level automatic caching avoids manual breakpoint placement for many common conversation shapes.
1response_usage = {
2 "prompt_tokens": 2006,
3 "prompt_tokens_details": {"cached_tokens": 1920},
4}
5
6cached = response_usage["prompt_tokens_details"]["cached_tokens"]
7uncached = response_usage["prompt_tokens"] - cached
8print(f"cached prompt tokens: {cached}")
9print(f"uncached prompt tokens: {uncached}")
10assert 0 <= cached <= response_usage["prompt_tokens"]1cached prompt tokens: 1920
2uncached prompt tokens: 861from math import ceil
2
3read_rate = 0.1
4for ttl, write_rate in [("5m", 1.25), ("1h", 2.0)]:
5 extra_write_cost = write_rate - 1.0
6 savings_per_read = 1.0 - read_rate
7 reads_needed = ceil(extra_write_cost / savings_per_read)
8 print(f"{ttl} cache: {reads_needed} reuse request(s) to repay write premium")15m cache: 1 reuse request(s) to repay write premium
21h cache: 2 reuse request(s) to repay write premium
In a sharded self-hosted deployment, a prefix-cache entry generally resides on the worker that computed it. If a load balancer spreads requests with the same prefix across many replicas using round-robin, each replica may build its own copy and hit rate falls. Cache-aware routing sends matching prefixes to the same worker while it remains healthy, so the cache can stay warm where it was built. SGLang's scheduler groups shared-prefix requests,[2] and OpenAI documents prompt_cache_key as a routing-affinity control.[3] Treat affinity as a hint, not an unconditional pin: OpenAI's docs recommend adding more prompt_cache_key values when one shared prefix-key combination exceeds roughly 15 requests per minute. Pin too hard and a popular worker can overload.
1from hashlib import sha256
2
3workers = ["gpu-a", "gpu-b", "gpu-c"]
4prefix = "access-policy-v7|tool-schema-v2"
5
6def affinity_worker(stable_prefix: str) -> str:
7 digest = int(sha256(stable_prefix.encode()).hexdigest(), 16)
8 return workers[digest % len(workers)]
9
10round_robin = [workers[i % len(workers)] for i in range(4)]
11affinity = [affinity_worker(prefix) for _ in range(4)]
12print("round-robin workers:", round_robin)
13print("affinity workers:", affinity)1round-robin workers: ['gpu-a', 'gpu-b', 'gpu-c', 'gpu-a']
2affinity workers: ['gpu-c', 'gpu-c', 'gpu-c', 'gpu-c']Prefix caching doesn't make generation free. It reduces repeated prefill work for shared input tokens. Output tokens still require decoding. If your cost is dominated by long generated answers, prefix caching helps less.
That's why serving teams watch time to first token (TTFT) and inter-token latency (ITL) separately. Prefix hits should lower TTFT because repeated prefill gets shorter. ITL often stays about the same because the model still has to decode each new output token.
It also doesn't understand semantic similarity. These two prompts may mean the same thing, but they don't share an exact token prefix:
1Why did this test fail after the refactor?
2What broke in the failing spec after the code change?That's semantic caching territory. Semantic caching reuses previous final answers for similar requests, which changes product behavior and needs safety checks. Prefix caching reuses computation, not answers.
1uncached = {"prefill_ms": 410, "decode_ms": 980}
2cached = {"prefill_ms": 95, "decode_ms": 972}
3
4ttft_saved = uncached["prefill_ms"] - cached["prefill_ms"]
5decode_change = uncached["decode_ms"] - cached["decode_ms"]
6print(f"prefill/TTFT saved: {ttft_saved} ms")
7print(f"decode change: {decode_change} ms")
8assert ttft_saved > decode_change1prefill/TTFT saved: 315 ms
2decode change: 8 ms
Track cache hits like any other serving metric.
For local runtimes, log prefix-cache hit rate, reused token count, prefill latency, decode latency, and GPU memory. For hosted APIs, log cached token fields from the response. In both cases, slice by route. A code-assistant route with a long stable repository guide should have a higher cache hit rate than a free-form chat route.
Example log:
1{
2 "route": "repo-guide-rag",
3 "prompt_tokens": 12200,
4 "cached_tokens": 9000,
5 "prefill_ms": 430,
6 "decode_ms": 1180,
7 "cache_version_id": "access_policy_2026_05_12"
8}Version the static prefix in telemetry and, where your cache-key design permits, in the prefix itself. If the policy text changes, exact-prefix matching already forces a miss after the first changed token. A version label makes that expected warmup miss visible and can intentionally invalidate reuse when a behavior-changing input is outside the hashed prompt.
1events = [
2 {"version": "policy-v6", "cached_tokens": 6000},
3 {"version": "policy-v7", "cached_tokens": 0},
4 {"version": "policy-v7", "cached_tokens": 6000},
5]
6
7for version in sorted({event["version"] for event in events}):
8 rows = [event for event in events if event["version"] == version]
9 hits = sum(event["cached_tokens"] > 0 for event in rows)
10 print(f"{version}: {hits}/{len(rows)} requests hit")1policy-v6: 1/1 requests hit
2policy-v7: 1/2 requests hitCache reuse is safe only inside a declared scope. Public tool schemas or public policy text may be reusable broadly; tenant-specific instructions, private documents, adapter state, and authorization context must not become cross-tenant reusable state by accident. A local engine can incorporate a tenant salt or equivalent scope into its cache key, while a hosted API requires the provider's documented isolation guarantees and your own request design.
1from hashlib import sha256
2
3def cache_key(tenant: str, prefix_version: str, tokens: str) -> str:
4 material = f"{tenant}|{prefix_version}|{tokens}"
5 return sha256(material.encode()).hexdigest()[:12]
6
7prefix = "system|private-policy|tools"
8alpha = cache_key("tenant-alpha", "v7", prefix)
9beta = cache_key("tenant-beta", "v7", prefix)
10print("same tokens, scoped keys differ:", alpha != beta)
11assert alpha != beta1same tokens, scoped keys differ: TruePrefix caching rewards prompt discipline: stable instructions, policy text, schemas, and examples should come first, while user-specific facts, retrieved snippets, and tool results belong later. Measure hit rate, but treat cached-token savings as an optimization rather than permission to send uncontrolled context forever.
Suppose the shared prefix for a repository assistant is 6,000 tokens:
1system instructions: 600 tokens
2repo guide: 4,200 tokens
3tool schema: 800 tokens
4few-shot examples: 400 tokensThree developers then ask different questions. If the prefix is identical, the runtime can reuse those 6,000 prefix tokens for the second and third requests. It still has to process each user's question and decode each answer, but it avoids repeating most of the prefill work.
Now change one thing: put request_id: R-18492 near the top of the prompt before the shared guide. The first tokens now differ per request. A prefix matcher may miss the whole shared guide even though the guide text itself is unchanged. That's why prompt shape isn't cosmetic. It controls whether the runtime can see the reusable prefix.
This accounting script makes the savings concrete:
1shared_prefix_tokens = 6000
2question_tails = [48, 37, 51]
3
4total_fresh_without_cache = 0
5total_fresh_with_cache = 0
6
7for request_index, tail_tokens in enumerate(question_tails, start=1):
8 fresh_without_cache = shared_prefix_tokens + tail_tokens
9 fresh_with_cache = fresh_without_cache if request_index == 1 else tail_tokens
10 reused_prefix = 0 if request_index == 1 else shared_prefix_tokens
11
12 total_fresh_without_cache += fresh_without_cache
13 total_fresh_with_cache += fresh_with_cache
14
15 print(
16 f"request {request_index}: reused_prefix={reused_prefix:4d} "
17 f"fresh_tokens={fresh_with_cache:4d}"
18 )
19
20saved_tokens = total_fresh_without_cache - total_fresh_with_cache
21print(f"fresh tokens without cache: {total_fresh_without_cache}")
22print(f"fresh tokens with cache: {total_fresh_with_cache}")
23print(f"prefix tokens saved: {saved_tokens}")1request 1: reused_prefix= 0 fresh_tokens=6048
2request 2: reused_prefix=6000 fresh_tokens= 37
3request 3: reused_prefix=6000 fresh_tokens= 51
4fresh tokens without cache: 18136
5fresh tokens with cache: 6136
6prefix tokens saved: 12000Symptom: The code assistant keeps answering with last week's lint rule after the repository guide changed.
Cause to investigate first: The rendered prompt, retrieval source, rollout version, or custom cache key still supplied old guide state. In an exact-prefix cache, changed guide tokens don't match old cached KV blocks.
Fix: Log and inspect the rendered prompt plus its policy version. When the policy text changes, confirm cached-token counts drop for the changed prefix and warm again only for requests containing the new policy. Treat any custom key that ignores changed behavior inputs as a correctness bug.
1from hashlib import sha256
2
3def exact_prefix_key(policy: str) -> str:
4 return sha256(f"system|{policy}|tools".encode()).hexdigest()
5
6before = exact_prefix_key("keys rotate after 90 days")
7after = exact_prefix_key("keys rotate after 30 days")
8print("changed policy invalidates exact-prefix key:", before != after)
9assert before != after1changed policy invalidates exact-prefix key: Truerequest_id or retrieved snippets near the top tanks hit rate. Cause: The first mismatch happens before the reusable instructions and guide. Fix: Put shared instructions, guide, schemas, and examples first, then push request-specific state later.Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.