LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScalePrefix Caching and Prompt Caching
🚀HardInference Optimization

Prefix Caching and Prompt Caching

Structure exact reusable prefixes, validate cache hits from usage fields, and enforce invalidation and tenant-isolation boundaries.

19 min read
Learning path
Step 129 of 158 in the full curriculum
KV Cache & PagedAttentionFlashAttention & Memory Efficiency

PagedAttention packs live KV blocks inside a serving engine. Prefix caching asks the next question: what if the same prompt prefix appears again in later requests?

The key-value (KV) cache inside one request saves work while the model generates the next token. Prefix caching saves work across requests that begin with the same tokens.

That distinction matters for long-context products. Suppose a coding assistant receives the same repository guidelines, architecture notes, tool schema, and safety instructions on every request. Without prefix caching, the model pays the prefill cost again and again. With compatible prefix caching enabled, the runtime can reuse computed KV state for the shared prefix, then process only the new user question.

Cold prefill computes every block, a cache hit reuses the prefix, and early drift forces replay. Cold prefill computes every block, a cache hit reuses the prefix, and early drift forces replay.
Cold prefill computes all blocks. A hit reuses the shared prefix and computes only the tail. Early drift invalidates reuse.

Per-request KV cache

During generation, a decoder-only transformer stores keys and values for tokens it has already processed. When it predicts token 501, it doesn't recompute attention keys and values for tokens 1 through 500. That's the normal KV cache.

This cache belongs to the active sequence. It helps decode speed, but it doesn't automatically help the next request.

Prefix caching is different. It asks: if request B starts with the exact same token prefix as request A, can the runtime reuse the already-computed KV blocks for that prefix?

vLLM calls this Automatic Prefix Caching. Its current documentation enables the feature with enable_prefix_caching=True; it's a serving configuration choice, not a behavior to assume from PagedAttention alone.[1] SGLang's RadixAttention uses a radix tree to reuse KV cache state across structured generation calls, with an LRU eviction policy and a cache-aware scheduler that groups requests with shared prefixes to raise the hit rate.[2] For a self-hosted engine, verify that reuse is enabled, scoped correctly, and measurable before designing prompt layouts around hits.

Many engines cache full KV blocks, not arbitrary partial tokens. vLLM's design docs explain that each full block is hashed with its parent hash, the block's token IDs, and extra fields such as LoRA IDs or multimodal input hashes.[1] That's why a tiny change early in the prompt can invalidate everything after it: the parent hash changes, so later blocks no longer match. vLLM also lets a request supply a cache_salt that's mixed into the first block hash, so only requests carrying the same salt can reuse those blocks.[1] This is one mechanism for tenant-aware reuse boundaries.

A branching prefix-cache hash path showing a shared first block, an edited policy block, and a downstream tools block receiving a different hash despite unchanged text. A branching prefix-cache hash path showing a shared first block, an edited policy block, and a downstream tools block receiving a different hash despite unchanged text.
Both requests reuse block 0. Editing block 1 changes its fingerprint, then block 2 receives a different cumulative key even though its visible `tools` text is unchanged.

trace-prefix-block-hashes.py
1from hashlib import sha256 2 3def chain(blocks: list[str]) -> list[str]: 4 parent = "root" 5 hashes = [] 6 for block in blocks: 7 parent = sha256(f"{parent}|{block}".encode()).hexdigest()[:8] 8 hashes.append(parent) 9 return hashes 10 11original = chain(["system", "repo guide v1", "tools"]) 12edited = chain(["system", "repo guide v2", "tools"]) 13 14for index, (left, right) in enumerate(zip(original, edited)): 15 print(f"block {index}: {'hit' if left == right else 'miss'}")
Output
1block 0: hit 2block 1: miss 3block 2: miss

Prompt shape

Cache hits require stable prefixes. Put shared content first and variable content last.

Good shape:

text
1system: You are the repository maintenance assistant. 2repo_guide: <shared repository guide> 3tools: <shared tool schema> 4examples: <shared examples> 5user: Why is this test failing?

Poor shape:

text
1user: Why is this test failing? 2system: You are the repository maintenance assistant. 3repo_guide: <shared repository guide> 4tools: <shared tool schema>

In the poor shape, the first tokens change with every user question. The shared policy appears later, after the mismatch. Many caching systems match from the prefix, so the cache opportunity disappears.

That's one reason agent prompts should keep dynamic tool results and user-specific state after stable instructions when possible.

measure-reusable-prefix.py
1def shared_prefix_tokens(left: list[str], right: list[str]) -> int: 2 count = 0 3 for a, b in zip(left, right): 4 if a != b: 5 break 6 count += 1 7 return count 8 9stable_first_a = ["system", "repo guide", "tools", "test failure question"] 10stable_first_b = ["system", "repo guide", "tools", "lint failure question"] 11variable_first_a = ["test failure question", "system", "repo guide", "tools"] 12variable_first_b = ["lint failure question", "system", "repo guide", "tools"] 13 14print("stable-first reusable units:", shared_prefix_tokens(stable_first_a, stable_first_b)) 15print("variable-first reusable units:", shared_prefix_tokens(variable_first_a, variable_first_b))
Output
1stable-first reusable units: 3 2variable-first reusable units: 0
Prompt-shape comparison showing stable content first for cache hits and user-specific content first for cache misses. Prompt-shape comparison showing stable content first for cache hits and user-specific content first for cache misses.
Stable-first prompts expose a long reusable prefix. Variable-first prompts hide the shared policy behind the first mismatch and usually miss.

Hosted API prompt caching

Hosted providers expose prompt caching differently. OpenAI documents automatic prompt caching for repeated prompt prefixes and reports cached tokens in usage fields.[3] Anthropic documents cache_control at the top level for automatically selected prefix caching and on content blocks for explicit breakpoints.[4]

Don't assume provider caches behave like your local runtime. Check:

QuestionWhy it matters
Is caching automatic or explicit?Determines prompt construction
What is the minimum cacheable prefix?Short prompts may never hit
How long does the cache live?Affects traffic batching
Is cache scoped by org, project, or region?Affects privacy and hit rate
Where are cached tokens reported?Needed for cost measurement

OpenAI's current docs say caching is available for prompts of 1,024 tokens or more and report cached_tokens inside usage.prompt_tokens_details, including zero for shorter prompts.[3] OpenAI's caching is automatic, has no separate cache-write fee, and lets you pass an optional prompt_cache_key to improve routing affinity for shared prefixes. Supported models can also offer prompt_cache_retention choices such as in_memory or 24h, so check the selected model before depending on cache lifetime.[3] Anthropic exposes cache_creation_input_tokens, cache_read_input_tokens, and input_tokens so you can separate cache writes, reads, and uncached suffix tokens.[4]

The two pricing models differ, which changes the break-even math. OpenAI cache hits are billed at a reduced input rate with no separate write charge. Anthropic documents a write premium and then a read discount: a 5-minute cache write costs 1.25 times the base input price, a 1-hour write costs 2 times, and a cache read costs 0.1 times.[4] Under those published ratios, a cached prefix repays the extra write cost after one read at the 5-minute tier or two reads at the 1-hour tier. Anthropic also uses model-specific minimum cacheable prefix lengths; its current active-model table spans 1,024 to 4,096 tokens.[4] Below the selected model's minimum, cache fields can stay zero and you pay normal input cost. Model and pricing behavior change over time, so production code should inspect usage fields rather than assume a fixed discount.

Anthropic's explicit breakpoints add one more boundary to remember: each breakpoint checks at most 20 preceding content blocks for reusable content.[4] If a prompt contains more than 20 content blocks before a breakpoint, add another cache_control breakpoint before that lookback window so older reusable content stays discoverable. Top-level automatic caching avoids manual breakpoint placement for many common conversation shapes.

read-openai-cached-token-usage.py
1response_usage = { 2 "prompt_tokens": 2006, 3 "prompt_tokens_details": {"cached_tokens": 1920}, 4} 5 6cached = response_usage["prompt_tokens_details"]["cached_tokens"] 7uncached = response_usage["prompt_tokens"] - cached 8print(f"cached prompt tokens: {cached}") 9print(f"uncached prompt tokens: {uncached}") 10assert 0 <= cached <= response_usage["prompt_tokens"]
Output
1cached prompt tokens: 1920 2uncached prompt tokens: 86
compute-anthropic-read-break-even.py
1from math import ceil 2 3read_rate = 0.1 4for ttl, write_rate in [("5m", 1.25), ("1h", 2.0)]: 5 extra_write_cost = write_rate - 1.0 6 savings_per_read = 1.0 - read_rate 7 reads_needed = ceil(extra_write_cost / savings_per_read) 8 print(f"{ttl} cache: {reads_needed} reuse request(s) to repay write premium")
Output
15m cache: 1 reuse request(s) to repay write premium 21h cache: 2 reuse request(s) to repay write premium
Hosted prompt caching usage fields comparing OpenAI cached_tokens with Anthropic cache read, cache write, and uncached input fields. Hosted prompt caching usage fields comparing OpenAI cached_tokens with Anthropic cache read, cache write, and uncached input fields.
Hosted APIs prove cache behavior through usage fields. Compare provider field names, then log cache reads, cache writes, and uncached suffix tokens per route.

Cache-aware routing

In a sharded self-hosted deployment, a prefix-cache entry generally resides on the worker that computed it. If a load balancer spreads requests with the same prefix across many replicas using round-robin, each replica may build its own copy and hit rate falls. Cache-aware routing sends matching prefixes to the same worker while it remains healthy, so the cache can stay warm where it was built. SGLang's scheduler groups shared-prefix requests,[2] and OpenAI documents prompt_cache_key as a routing-affinity control.[3] Treat affinity as a hint, not an unconditional pin: OpenAI's docs recommend adding more prompt_cache_key values when one shared prefix-key combination exceeds roughly 15 requests per minute. Pin too hard and a popular worker can overload.

route-stable-prefixes-together.py
1from hashlib import sha256 2 3workers = ["gpu-a", "gpu-b", "gpu-c"] 4prefix = "access-policy-v7|tool-schema-v2" 5 6def affinity_worker(stable_prefix: str) -> str: 7 digest = int(sha256(stable_prefix.encode()).hexdigest(), 16) 8 return workers[digest % len(workers)] 9 10round_robin = [workers[i % len(workers)] for i in range(4)] 11affinity = [affinity_worker(prefix) for _ in range(4)] 12print("round-robin workers:", round_robin) 13print("affinity workers:", affinity)
Output
1round-robin workers: ['gpu-a', 'gpu-b', 'gpu-c', 'gpu-a'] 2affinity workers: ['gpu-c', 'gpu-c', 'gpu-c', 'gpu-c']

What prefix caching doesn't do

Prefix caching doesn't make generation free. It reduces repeated prefill work for shared input tokens. Output tokens still require decoding. If your cost is dominated by long generated answers, prefix caching helps less.

That's why serving teams watch time to first token (TTFT) and inter-token latency (ITL) separately. Prefix hits should lower TTFT because repeated prefill gets shorter. ITL often stays about the same because the model still has to decode each new output token.

It also doesn't understand semantic similarity. These two prompts may mean the same thing, but they don't share an exact token prefix:

text
1Why did this test fail after the refactor? 2What broke in the failing spec after the code change?

That's semantic caching territory. Semantic caching reuses previous final answers for similar requests, which changes product behavior and needs safety checks. Prefix caching reuses computation, not answers.

separate-ttft-from-decode-gains.py
1uncached = {"prefill_ms": 410, "decode_ms": 980} 2cached = {"prefill_ms": 95, "decode_ms": 972} 3 4ttft_saved = uncached["prefill_ms"] - cached["prefill_ms"] 5decode_change = uncached["decode_ms"] - cached["decode_ms"] 6print(f"prefill/TTFT saved: {ttft_saved} ms") 7print(f"decode change: {decode_change} ms") 8assert ttft_saved > decode_change
Output
1prefill/TTFT saved: 315 ms 2decode change: 8 ms
Latency view showing prefix caching reducing repeated prefill and TTFT while decode time stays mostly unchanged. Latency view showing prefix caching reducing repeated prefill and TTFT while decode time stays mostly unchanged.
Prefix caching mostly reduces repeated prefill and TTFT. It doesn't make output decoding free, so gains shrink when long answers dominate latency.

Instrumentation

Track cache hits like any other serving metric.

For local runtimes, log prefix-cache hit rate, reused token count, prefill latency, decode latency, and GPU memory. For hosted APIs, log cached token fields from the response. In both cases, slice by route. A code-assistant route with a long stable repository guide should have a higher cache hit rate than a free-form chat route.

Example log:

instrumentation.json
1{ 2 "route": "repo-guide-rag", 3 "prompt_tokens": 12200, 4 "cached_tokens": 9000, 5 "prefill_ms": 430, 6 "decode_ms": 1180, 7 "cache_version_id": "access_policy_2026_05_12" 8}

Version the static prefix in telemetry and, where your cache-key design permits, in the prefix itself. If the policy text changes, exact-prefix matching already forces a miss after the first changed token. A version label makes that expected warmup miss visible and can intentionally invalidate reuse when a behavior-changing input is outside the hashed prompt.

Prefix-caching observability view showing route hit rate, a brief version-bump miss dip, and rules for what to log and how to read the signal. Prefix-caching observability view showing route hit rate, a brief version-bump miss dip, and rules for what to log and how to read the signal.
Good cache dashboards answer three questions: are hits happening, is prefill dropping, and did a version change explain a miss spike.

summarize-hits-by-prefix-version.py
1events = [ 2 {"version": "policy-v6", "cached_tokens": 6000}, 3 {"version": "policy-v7", "cached_tokens": 0}, 4 {"version": "policy-v7", "cached_tokens": 6000}, 5] 6 7for version in sorted({event["version"] for event in events}): 8 rows = [event for event in events if event["version"] == version] 9 hits = sum(event["cached_tokens"] > 0 for event in rows) 10 print(f"{version}: {hits}/{len(rows)} requests hit")
Output
1policy-v6: 1/1 requests hit 2policy-v7: 1/2 requests hit

Isolation boundary

Cache reuse is safe only inside a declared scope. Public tool schemas or public policy text may be reusable broadly; tenant-specific instructions, private documents, adapter state, and authorization context must not become cross-tenant reusable state by accident. A local engine can incorporate a tenant salt or equivalent scope into its cache key, while a hosted API requires the provider's documented isolation guarantees and your own request design.

scope-prefix-keys-by-tenant.py
1from hashlib import sha256 2 3def cache_key(tenant: str, prefix_version: str, tokens: str) -> str: 4 material = f"{tenant}|{prefix_version}|{tokens}" 5 return sha256(material.encode()).hexdigest()[:12] 6 7prefix = "system|private-policy|tools" 8alpha = cache_key("tenant-alpha", "v7", prefix) 9beta = cache_key("tenant-beta", "v7", prefix) 10print("same tokens, scoped keys differ:", alpha != beta) 11assert alpha != beta
Output
1same tokens, scoped keys differ: True

Design rule

Prefix caching rewards prompt discipline: stable instructions, policy text, schemas, and examples should come first, while user-specific facts, retrieved snippets, and tool results belong later. Measure hit rate, but treat cached-token savings as an optimization rather than permission to send uncontrolled context forever.

Work a cache hit by hand

Suppose the shared prefix for a repository assistant is 6,000 tokens:

text
1system instructions: 600 tokens 2repo guide: 4,200 tokens 3tool schema: 800 tokens 4few-shot examples: 400 tokens

Three developers then ask different questions. If the prefix is identical, the runtime can reuse those 6,000 prefix tokens for the second and third requests. It still has to process each user's question and decode each answer, but it avoids repeating most of the prefill work.

Now change one thing: put request_id: R-18492 near the top of the prompt before the shared guide. The first tokens now differ per request. A prefix matcher may miss the whole shared guide even though the guide text itself is unchanged. That's why prompt shape isn't cosmetic. It controls whether the runtime can see the reusable prefix.

This accounting script makes the savings concrete:

prefix-cache-hit-calculator.py
1shared_prefix_tokens = 6000 2question_tails = [48, 37, 51] 3 4total_fresh_without_cache = 0 5total_fresh_with_cache = 0 6 7for request_index, tail_tokens in enumerate(question_tails, start=1): 8 fresh_without_cache = shared_prefix_tokens + tail_tokens 9 fresh_with_cache = fresh_without_cache if request_index == 1 else tail_tokens 10 reused_prefix = 0 if request_index == 1 else shared_prefix_tokens 11 12 total_fresh_without_cache += fresh_without_cache 13 total_fresh_with_cache += fresh_with_cache 14 15 print( 16 f"request {request_index}: reused_prefix={reused_prefix:4d} " 17 f"fresh_tokens={fresh_with_cache:4d}" 18 ) 19 20saved_tokens = total_fresh_without_cache - total_fresh_with_cache 21print(f"fresh tokens without cache: {total_fresh_without_cache}") 22print(f"fresh tokens with cache: {total_fresh_with_cache}") 23print(f"prefix tokens saved: {saved_tokens}")
Output
1request 1: reused_prefix= 0 fresh_tokens=6048 2request 2: reused_prefix=6000 fresh_tokens= 37 3request 3: reused_prefix=6000 fresh_tokens= 51 4fresh tokens without cache: 18136 5fresh tokens with cache: 6136 6prefix tokens saved: 12000

Common mistake: blaming exact-prefix caching for stale guidance

  • Symptom: The code assistant keeps answering with last week's lint rule after the repository guide changed.

  • Cause to investigate first: The rendered prompt, retrieval source, rollout version, or custom cache key still supplied old guide state. In an exact-prefix cache, changed guide tokens don't match old cached KV blocks.

  • Fix: Log and inspect the rendered prompt plus its policy version. When the policy text changes, confirm cached-token counts drop for the changed prefix and warm again only for requests containing the new policy. Treat any custom key that ignores changed behavior inputs as a correctness bug.

require-miss-on-policy-change.py
1from hashlib import sha256 2 3def exact_prefix_key(policy: str) -> str: 4 return sha256(f"system|{policy}|tools".encode()).hexdigest() 5 6before = exact_prefix_key("keys rotate after 90 days") 7after = exact_prefix_key("keys rotate after 30 days") 8print("changed policy invalidates exact-prefix key:", before != after) 9assert before != after
Output
1changed policy invalidates exact-prefix key: True

Mastery check

Evaluation rubric

  • Foundational: Explains the difference between the normal per-request KV cache and prefix caching across repeated requests.
  • Foundational: Rearranges a prompt so stable instructions, policy text, schemas, and examples stay before user-specific state.
  • Intermediate: Explains why one early token change invalidates later reusable blocks through cumulative parent-hash matching.
  • Intermediate: Distinguishes prefix caching from semantic caching and names why semantic reuse has higher product risk.
  • Advanced: Uses usage fields, route slices, and prefill latency to prove that a production route is getting real cache hits.
  • Advanced: Defines versioning, routing, and tenant-isolation rules that expose prompt rollouts and prevent cross-tenant leakage.

Follow-up questions

Common pitfalls

  • Symptom: Two questions mean nearly the same thing, but cached-token counts stay at zero. Cause: Prefix caching matches exact token prefixes, not semantic similarity. Fix: Treat semantic reuse as a different system and keep the stable text byte- and token-identical at the start of the prompt.
  • Symptom: Moving request_id or retrieved snippets near the top tanks hit rate. Cause: The first mismatch happens before the reusable instructions and guide. Fix: Put shared instructions, guide, schemas, and examples first, then push request-specific state later.
  • Symptom: Shared caches improve latency, but privacy review can't prove one tenant never reused another tenant's prefix. Cause: Reusable prefixes were keyed only by tokens, not by tenant scope. Fix: Add tenant-scoped cache keys or salts, don't rely on routing alone for isolation, and log enough metadata to prove cache reads stayed inside the right boundary.
  • Symptom: Cached-token share is high, yet user-perceived latency barely improves. Cause: Decode dominates the route, so saved prefill is a small part of total time. Fix: Measure prefill and decode separately and use prefix caching where repeated input is large and answers aren't the main bottleneck.
  • Symptom: Teams page on every miss spike after a guide or prompt update. Cause: They expect hit rate to stay flat even when the cache key should change. Fix: Expect temporary warmup misses after a new prefix version goes live and verify recovery with the new version label in logs.
  • Symptom: Cost projections or alerts drift away from reality after a provider update. Cause: The system assumed old minimum lengths, lifetimes, or discounts instead of reading usage fields. Fix: Keep thresholds and pricing out of hardcoded mental models and inspect the fields the provider returns on each request.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An engineer says: The model already has a KV cache, so later paraphrased test-failure questions can reuse the first request's work or answer. Which correction is accurate?
2.A repository assistant sends this content on every request: system instructions, a 6,000-token guide, a tool schema, few-shot examples, plus a different request_id, retrieved file snippets, and user question. Which prompt layout gives the prefix cache the largest safe reuse opportunity?
3.A block-based prefix cache has cached blocks for [system, access-policy-v1, tool-schema]. A later request uses [system, access-policy-v2, tool-schema], where the tool-schema text is unchanged. Why can the tool-schema block still miss?
4.Anthropic charges a write premium for a cacheable prefix and then a discounted read: 1.25x base input price to write a 5-minute entry, 2.0x to write a 1-hour entry, and 0.1x for each cache read. Compared with sending the prefix uncached every time, how many later cache reads repay the extra write premium?
5.A repeated Anthropic prompt with an explicit cache breakpoint still reports zero cache fields. Which investigation follows the documented qualification rules?
6.A team needs one savings estimator for hosted prompt caching across OpenAI and Anthropic, whose cache fields, minimums, lifetimes, and pricing differ by provider and model. Which implementation measures actual cache reads and writes instead of assuming a fixed discount?
7.A route's cache dashboard shows high hit counts, but when sliced to that route and model, cached-token volume is small and prefill p50 and TTFT are unchanged while decode time dominates total latency. What should the team conclude?
8.A self-hosted service has three GPU workers. Tenant A sends many requests with the same private policy prefix, but round-robin routing sends them to different workers. You also must prove Tenant B can never reuse Tenant A's private KV blocks. Which design matches both goals?
9.A policy changed from 'keys rotate after 90 days' to 'keys rotate after 30 days', but the bot still answers with the 90-day rule. In an exact-prefix cache, changed policy tokens should miss old KV blocks. What should the team investigate first?

9 questions remaining.

Next Step
Continue to FlashAttention & Memory Efficiency

Prefix caching reuses work across requests; FlashAttention reduces memory traffic inside each attention computation, so the next chapter moves from request-level reuse to kernel-level efficiency.

PreviousKV Cache & PagedAttention
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Automatic Prefix Caching

vLLM · 2026

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104

Prompt caching

OpenAI · 2026

Prompt caching.

Anthropic. · 2026 · Official documentation