LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

๐Ÿ› ๏ธComputing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
๐Ÿ“ŠMath & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
๐Ÿ“šPreparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
๐ŸงฎML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
๐Ÿ“ฆProduction ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฐApplied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
๐ŸŽคAI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringSemantic Caching & Cost Optimization
๐Ÿš€MediumInference Optimization

Semantic Caching & Cost Optimization

Reuse stable policy answers across paraphrased questions without crossing release, access, or freshness boundaries; then prove the cache is both safe and worth serving.

15 min read
Learning path
Step 72 of 155 in the full curriculum
Model Versioning & DeploymentLLM Cost Engineering & Token Economics

Semantic Caching & Cost Optimization

The previous lesson promoted an LLM system only by moving a pointer to an immutable release bundle. That discipline matters again here: a cached answer is behavior produced by one particular release, prompt policy, corpus, and access scope.

Suppose the deployed delivery-support assistant repeatedly answers the public question, "How long can I return unused headphones?" Shoppers phrase that question many ways. Reusing a verified answer could save generation work and respond faster. Reusing the answer after a return-policy update, for another tenant, or for a live order-status question could be plainly wrong.

This lesson builds one semantic cache for public policy answers. It will serve nothing until a shadow replay shows safe hits and worthwhile savings.

An answer is reusable only inside its contract

A response cache isn't a memory of generally true sentences. It's a store of outputs generated under a particular contract. The release bundle from the previous chapter gives us most of that contract already.

RequestCan an answer be reused?Why
"What is the return window for unused headphones?"CandidatePublic policy answer can remain stable within one policy release.
"How long can I send unused headphones back?"CandidateParaphrase of the same public-policy question, after evaluation.
"Where is order ORD-48192 right now?"NoAnswer depends on live, customer-specific state.
"Create a return label for ORD-48192."NoThe request asks for a side effect, not reusable prose.

For this system, a reusable answer must match all of these fields:

FieldWhy it matters
release_idPins model, prompt, policy logic, and serving behavior.
corpus_versionPrevents old policy evidence from surviving a document update.
tenant_id and access_scopePrevents one customer's or merchant's information from leaking into another response.
locale and response_schemaPrevents the right content from appearing in the wrong language or output contract.
Reuse eligibility map showing public policy answers as candidates while live order status and return-label actions bypass the semantic answer cache. Reuse eligibility map showing public policy answers as candidates while live order status and return-label actions bypass the semantic answer cache.
A high-volume request isn't automatically cacheable. Public, read-only policy answers can enter evaluation; live customer state and write actions must bypass answer reuse.

Start the lab with the exact release scope and one response generated by the promoted release.

define-the-reuse-contract.py
1from dataclasses import asdict, dataclass, replace 2from hashlib import sha256 3import json 4import math 5 6@dataclass(frozen=True) 7class ReleaseScope: 8 release_id: str 9 corpus_version: str 10 tenant_id: str 11 access_scope: str 12 locale: str 13 response_schema: str 14 15@dataclass(frozen=True) 16class Request: 17 text: str 18 tenant_id: str = "shopflow-public" 19 access_scope: str = "public-policy" 20 locale: str = "en-US" 21 response_schema: str = "cited-answer-v2" 22 requires_live_data: bool = False 23 writes_state: bool = False 24 25@dataclass(frozen=True) 26class CachedAnswer: 27 answer_id: str 28 source_query: str 29 response: str 30 scope: ReleaseScope 31 32stable_scope = ReleaseScope( 33 release_id="delivery-evidence-answerer@sha256:df2d4fe7b0c5", 34 corpus_version="returns-policy-2026-04", 35 tenant_id="shopflow-public", 36 access_scope="public-policy", 37 locale="en-US", 38 response_schema="cited-answer-v2", 39) 40 41seed_answer = CachedAnswer( 42 answer_id="ans_returns_unused_30d", 43 source_query="What is the return window for unused headphones?", 44 response="Unused headphones can be returned within 30 days of delivery.", 45 scope=stable_scope, 46) 47 48print(f"release_id={stable_scope.release_id}") 49print(f"corpus_version={stable_scope.corpus_version}") 50print(f"seed_answer={seed_answer.answer_id}")
Output
1release_id=delivery-evidence-answerer@sha256:df2d4fe7b0c5 2corpus_version=returns-policy-2026-04 3seed_answer=ans_returns_unused_30d

For requests already eligible for answer reuse, an ordinary key-value cache can safely reuse normalized exact repeats as long as its key contains the full scope. It can't see through paraphrasing.

The scope fields must come from application-owned route policy and authenticated context, not from raw user text or a model's guess. Unknown response classes should bypass answer reuse. The lab also checks that the caller-selected scope agrees with the request before deriving a key.

exact-cache-respects-release-scope.py
1def normalized_text(text: str) -> str: 2 return " ".join(text.lower().split()) 3 4def request_matches_scope(request: Request, scope: ReleaseScope) -> bool: 5 return ( 6 request.tenant_id == scope.tenant_id 7 and request.access_scope == scope.access_scope 8 and request.locale == scope.locale 9 and request.response_schema == scope.response_schema 10 ) 11 12def exact_key(scope: ReleaseScope, request: Request) -> str: 13 if not request_matches_scope(request, scope): 14 raise ValueError("request does not match cache scope") 15 payload = { 16 "scope": asdict(scope), 17 "text": normalized_text(request.text), 18 } 19 encoded = json.dumps(payload, sort_keys=True).encode("utf-8") 20 return sha256(encoded).hexdigest() 21 22same_words = Request("What is the return window for unused headphones?") 23paraphrase = Request("How long can I send unused headphones back?") 24cross_tenant = replace(same_words, tenant_id="merchant-b") 25updated_scope = replace( 26 stable_scope, 27 release_id="delivery-evidence-answerer@sha256:new-policy", 28 corpus_version="returns-policy-2026-05", 29) 30 31seed_key = exact_key(stable_scope, Request(seed_answer.source_query)) 32print(f"exact_repeat_hit={exact_key(stable_scope, same_words) == seed_key}") 33print(f"paraphrase_exact_hit={exact_key(stable_scope, paraphrase) == seed_key}") 34print(f"updated_policy_hit={exact_key(updated_scope, same_words) == seed_key}") 35try: 36 exact_key(stable_scope, cross_tenant) 37except ValueError as error: 38 print(f"cross_tenant_rejected={error}")
Output
1exact_repeat_hit=True 2paraphrase_exact_hit=False 3updated_policy_hit=False 4cross_tenant_rejected=request does not match cache scope

The exact cache does the correct thing: it refuses a paraphrase and refuses an old answer under a new policy release. Semantic caching adds only the first capability. It must not weaken the second.

Similarity retrieves a candidate, not a truth

A semantic cache embeds a new question, searches stored question embeddings, and proposes a nearby saved answer. Systems such as GPTCache apply that retrieval step before deciding whether to call the LLM at all.[1] Sentence-BERT showed why this shape works: sentence embeddings can be compared efficiently with cosine similarity for semantic matching tasks.[2]

For two vectors aaa and bbb, cosine similarity is:

cosineโก(a,b)=aโ‹…bโˆฅaโˆฅ2โˆฅbโˆฅ2\operatorname{cosine}(a, b) = \frac{a \cdot b}{\lVert a \rVert_2 \lVert b \rVert_2}cosine(a,b)=โˆฅaโˆฅ2โ€‹โˆฅbโˆฅ2โ€‹aโ‹…bโ€‹

The numerator measures their aligned components. Dividing by both lengths makes the result compare direction rather than vector magnitude. A high score says two encoded questions are close under this embedding model. It does not say their answers are interchangeable.

The tiny vectors below are an instructional fixture, not scores from a commercial embedding model. They let us see the failure mode without downloading a model: an opened-item exception can sit near a general return-window question while still needing a different answer.

similarity-only-proposes-a-candidate.py
1fixture_vectors = { 2 seed_answer.source_query: (1.00, 0.00, 0.00), 3 "How long can I send unused headphones back?": (0.99, 0.04, 0.00), 4 "Can I return opened headphones?": (0.94, 0.10, 0.00), 5 "Where is order ORD-48192 right now?": (0.00, 0.05, 1.00), 6} 7 8def cosine(left: tuple[float, ...], right: tuple[float, ...]) -> float: 9 dot = sum(a * b for a, b in zip(left, right)) 10 left_norm = math.sqrt(sum(value * value for value in left)) 11 right_norm = math.sqrt(sum(value * value for value in right)) 12 return dot / (left_norm * right_norm) 13 14seed_vector = fixture_vectors[seed_answer.source_query] 15for question in [ 16 "How long can I send unused headphones back?", 17 "Can I return opened headphones?", 18 "Where is order ORD-48192 right now?", 19]: 20 score = cosine(seed_vector, fixture_vectors[question]) 21 print(f"{question} | score={score:.3f}")
Output
1How long can I send unused headphones back? | score=0.999 2Can I return opened headphones? | score=0.994 3Where is order ORD-48192 right now? | score=0.000
Release-scoped semantic cache lookup where embedding similarity proposes a policy answer and matching release, corpus, tenant, and access scope permit reuse. Release-scoped semantic cache lookup where embedding similarity proposes a policy answer and matching release, corpus, tenant, and access scope permit reuse.
Similarity is only the candidate-retrieval step. A semantic hit becomes servable only after release identity, evidence version, access scope, and eligibility checks all pass.

Eligibility rules run before the score threshold

The assistant shouldn't response-cache live order state or actions at any threshold. Even for a public-policy question, a cached answer must be from the same release scope.

This decision procedure checks the non-negotiable rules first. Only an eligible, same-scope request reaches the similarity threshold.

gate-semantic-hits-by-contract.py
1def same_scope(request: Request, record: CachedAnswer, active: ReleaseScope) -> bool: 2 return ( 3 record.scope == active 4 and request_matches_scope(request, active) 5 ) 6 7def decide_candidate( 8 request: Request, 9 record: CachedAnswer, 10 active: ReleaseScope, 11 score: float, 12 threshold: float, 13) -> str: 14 if request.requires_live_data or request.writes_state: 15 return "BYPASS_DYNAMIC_OR_WRITE" 16 if not same_scope(request, record, active): 17 return "MISS_SCOPE_CHANGED" 18 if score < threshold: 19 return "MISS_BELOW_THRESHOLD" 20 return "SEMANTIC_HIT" 21 22policy_paraphrase = Request("How long can I send unused headphones back?") 23live_order = Request( 24 "Where is order ORD-48192 right now?", 25 access_scope="customer-order", 26 requires_live_data=True, 27) 28label_action = Request( 29 "Create a return label for ORD-48192.", 30 access_scope="customer-order", 31 writes_state=True, 32) 33 34policy_score = cosine(seed_vector, fixture_vectors[policy_paraphrase.text]) 35print(decide_candidate(policy_paraphrase, seed_answer, stable_scope, policy_score, 0.98)) 36print(decide_candidate(live_order, seed_answer, stable_scope, 1.00, 0.98)) 37print(decide_candidate(label_action, seed_answer, stable_scope, 1.00, 0.98))
Output
1SEMANTIC_HIT 2BYPASS_DYNAMIC_OR_WRITE 3BYPASS_DYNAMIC_OR_WRITE

Version changes invalidate answers without guessing

A time-to-live (TTL) can expire old entries after a period. It can't know that a returns policy changed five minutes after an answer was stored. The release bundle provides a stronger invalidation hook: if policy evidence or answer behavior changes, the release or corpus version changes and old entries are no longer in scope.

invalidate-on-policy-release.py
1policy_update = replace( 2 stable_scope, 3 release_id="delivery-evidence-answerer@sha256:7a12policy", 4 corpus_version="returns-policy-2026-05", 5) 6 7same_question = Request(seed_answer.source_query) 8old_release_decision = decide_candidate( 9 same_question, seed_answer, stable_scope, 1.00, 0.98 10) 11new_release_decision = decide_candidate( 12 same_question, seed_answer, policy_update, 1.00, 0.98 13) 14 15print(f"old_release={old_release_decision}") 16print(f"new_policy_release={new_release_decision}") 17print(f"new_release_must_generate={new_release_decision != 'SEMANTIC_HIT'}")
Output
1old_release=SEMANTIC_HIT 2new_policy_release=MISS_SCOPE_CHANGED 3new_release_must_generate=True

This is why cache identity should inherit the release identity from deployment. Eviction can clean up storage later; correctness shouldn't depend on eviction finishing first.

Choose a threshold in shadow mode

Serving a semantic hit immediately turns a retrieval mistake into a user-visible wrong answer. Shadow mode runs the lookup decision but still serves the normal fresh path. Reviewers then label whether each proposed reuse would have been acceptable.

A good cache metric separates two questions:

  • Proposal rate: how often would the cache return something?
  • Hit precision: among proposed hits, how often is answer reuse acceptable?

High proposal rate without high precision is a cheaper system that is wrong more often.

The labeled fixture below contains public-policy paraphrases, a subtle opened-item exception, and ineligible requests. Thresholds are specific to this fixture and embedding setup; don't copy them into production.

calibrate-with-shadow-replay.py
1@dataclass(frozen=True) 2class ShadowProbe: 3 name: str 4 score: float 5 eligible: bool 6 acceptable_reuse: bool 7 8shadow_probes = [ 9 ShadowProbe("return window paraphrase", 0.995, True, True), 10 ShadowProbe("send unused item back", 0.989, True, True), 11 ShadowProbe("refund window wording", 0.982, True, True), 12 ShadowProbe("policy FAQ reworded", 0.981, True, True), 13 ShadowProbe("opened-item exception", 0.965, True, False), 14 ShadowProbe("live order state", 0.999, False, False), 15 ShadowProbe("create label action", 0.997, False, False), 16] 17 18def replay_at(threshold: float) -> dict[str, float | int]: 19 proposed = [ 20 probe for probe in shadow_probes 21 if probe.eligible and probe.score >= threshold 22 ] 23 accepted = [probe for probe in proposed if probe.acceptable_reuse] 24 precision = len(accepted) / len(proposed) if proposed else 1.0 25 return { 26 "proposed": len(proposed), 27 "accepted": len(accepted), 28 "precision": precision, 29 "proposal_rate": len(proposed) / len(shadow_probes), 30 } 31 32for threshold in [0.960, 0.980, 0.990]: 33 metrics = replay_at(threshold) 34 print( 35 f"threshold={threshold:.3f} " 36 f"proposed={metrics['proposed']} " 37 f"precision={metrics['precision']:.1%} " 38 f"proposal_rate={metrics['proposal_rate']:.1%}" 39 ) 40 41selected_threshold = 0.980
Output
1threshold=0.960 proposed=5 precision=80.0% proposal_rate=71.4% 2threshold=0.980 proposed=4 precision=100.0% proposal_rate=57.1% 3threshold=0.990 proposed=1 precision=100.0% proposal_rate=14.3%
Shadow replay threshold sweep where a loose semantic-cache threshold includes an unacceptable policy exception and a stricter threshold serves fewer but correct reuse proposals. Shadow replay threshold sweep where a loose semantic-cache threshold includes an unacceptable policy exception and a stricter threshold serves fewer but correct reuse proposals.
Threshold tuning is an evaluation decision. In this fixture, excluding one near-but-wrong exception matters more than maximizing reuse count.

A safe cache still has to pay for itself

Every semantic lookup incurs work, even on a miss: embedding the request, searching an index, and recording metrics. Evaluate cost only after the precision gate passes.

Let:

  • NNN be requests in a measured period.
  • CgC_gCgโ€‹ be average fresh-generation cost per request.
  • ClC_lClโ€‹ be semantic-lookup cost per request.
  • hhh be the observed safe-hit fraction.

If a hit skips fresh generation, expected period savings are:

savings=N(hCgโˆ’Cl)\text{savings} = N \left(h C_g - C_l\right)savings=N(hCgโ€‹โˆ’Clโ€‹)

These quantities must come from the workload and model you plan to operate. The next example uses clearly labeled measurement fixtures, not provider prices.

measure-break-even-savings.py
1shadow_metrics = replay_at(selected_threshold) 2requests_per_day = 10_000 3fresh_generation_usd = 0.0040 # measured fixture: average full answer cost 4semantic_lookup_usd = 0.00008 # measured fixture: embed + index lookup 5safe_hit_fraction = shadow_metrics["proposal_rate"] 6 7without_cache = requests_per_day * fresh_generation_usd 8with_cache = requests_per_day * ( 9 semantic_lookup_usd + (1 - safe_hit_fraction) * fresh_generation_usd 10) 11savings = without_cache - with_cache 12break_even_hit_fraction = semantic_lookup_usd / fresh_generation_usd 13 14print(f"safe_hit_fraction={safe_hit_fraction:.1%}") 15print(f"break_even_hit_fraction={break_even_hit_fraction:.1%}") 16print(f"daily_savings_fixture_usd={savings:.2f}") 17print(f"savings_positive={savings > 0}")
Output
1safe_hit_fraction=57.1% 2break_even_hit_fraction=2.0% 3daily_savings_fixture_usd=22.06 4savings_positive=True

This calculation intentionally omits any guessed list price. Measure generation and lookup cost for the actual release and traffic distribution, then repeat the gate when either changes.

Promote only the narrow policy you tested

The safe outcome isn't "turn on semantic caching for all support." It is "turn on semantic reuse for the public-policy scope that passed shadow evidence." Order status and write actions remain bypassed.

make-the-cache-promotion-decision.py
1@dataclass(frozen=True) 2class CachePromotionGate: 3 minimum_precision: float 4 minimum_daily_savings_usd: float 5 required_scope: ReleaseScope 6 7gate = CachePromotionGate( 8 minimum_precision=0.99, 9 minimum_daily_savings_usd=5.00, 10 required_scope=stable_scope, 11) 12 13passes_quality = shadow_metrics["precision"] >= gate.minimum_precision 14passes_economics = savings >= gate.minimum_daily_savings_usd 15passes_scope = seed_answer.scope == gate.required_scope 16decision = ( 17 "PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE" 18 if passes_quality and passes_economics and passes_scope 19 else "KEEP_SHADOW_ONLY" 20) 21 22print(f"quality_gate={passes_quality}") 23print(f"economics_gate={passes_economics}") 24print(f"scope_gate={passes_scope}") 25print(f"cache_decision={decision}")
Output
1quality_gate=True 2economics_gate=True 3scope_gate=True 4cache_decision=PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE
Semantic-cache delivery path moving from immutable release scope through exact lookup, shadow-only semantic evaluation, quality and savings gates, and limited public-policy promotion. Semantic-cache delivery path moving from immutable release scope through exact lookup, shadow-only semantic evaluation, quality and savings gates, and limited public-policy promotion.
Deployment is a measured policy decision. Only the response class tested in shadow mode is promoted; dynamic customer data and side effects remain outside the answer cache.

Record why each request hit or bypassed

Once the cache is serving, traces must answer: which release generated the cached response, which cache policy reused it, and why a request bypassed reuse? Without those fields, wrong-hit incidents become difficult to reconstruct.

emit-cache-decision-traces.py
1def trace_decision(request: Request, score: float) -> dict[str, str | float]: 2 cache_decision = decide_candidate( 3 request, seed_answer, stable_scope, score, selected_threshold 4 ) 5 return { 6 "request": request.text, 7 "release_id": stable_scope.release_id, 8 "corpus_version": stable_scope.corpus_version, 9 "tenant_id": request.tenant_id, 10 "access_scope": request.access_scope, 11 "cache_policy": "public-policy-semantic-v1", 12 "answer_id": seed_answer.answer_id if cache_decision == "SEMANTIC_HIT" else "", 13 "decision": cache_decision, 14 "score": score, 15 } 16 17hit_trace = trace_decision(policy_paraphrase, policy_score) 18bypass_trace = trace_decision(live_order, 1.00) 19 20print(f"hit_decision={hit_trace['decision']} answer_id={hit_trace['answer_id']}") 21print(f"bypass_decision={bypass_trace['decision']}") 22print(f"traced_release={hit_trace['release_id'] == stable_scope.release_id}") 23print(f"traced_scope={hit_trace['corpus_version'] == stable_scope.corpus_version and hit_trace['access_scope'] == stable_scope.access_scope}")
Output
1hit_decision=SEMANTIC_HIT answer_id=ans_returns_unused_30d 2bypass_decision=BYPASS_DYNAMIC_OR_WRITE 3traced_release=True 4traced_scope=True

Watch production for accepted-hit review failures, user retries after cache hits, scope bypass volume, latency, and realized saved generation. An incident should be able to disable this cache policy pointer without changing the production release that generates fresh responses.

Semantic response caching isn't prompt-prefix caching

The cache in this lab can return a stored answer for a paraphrase and skip generation. Provider prompt caching operates at a different layer. For example, OpenAI's documented prompt caching detects matching prompt prefixes starting at a documented token length and reduces repeated input processing; the model still computes a new output.[3]

LayerMatchesResult on hitMain correctness risk
Exact response cacheSame scoped request keyReturn stored answer, skip generationStale or incomplete scope key
Semantic response cacheSimilar eligible question under same contractReturn stored answer, skip generationFalse semantic reuse
Provider prompt cacheMatching input prefix under provider rulesCompute a new answer with cheaper/faster repeated input workMissed cost opportunity, not stored-answer substitution

The distinction determines the evaluation: semantic answer caching needs labeled reuse precision; prompt-prefix caching needs token and latency accounting. The next chapter expands that economics.

separate-answer-reuse-from-prefix-reuse.py
1@dataclass(frozen=True) 2class ReuseCase: 3 name: str 4 semantic_answer_hit: bool 5 repeated_prefix_hit: bool 6 7cases = [ 8 ReuseCase( 9 name="paraphrased public return question", 10 semantic_answer_hit=True, 11 repeated_prefix_hit=False, 12 ), 13 ReuseCase( 14 name="new live order question after same long instructions", 15 semantic_answer_hit=False, 16 repeated_prefix_hit=True, 17 ), 18] 19 20for case in cases: 21 print( 22 f"{case.name}: " 23 f"skip_generation={case.semantic_answer_hit}, " 24 f"reuse_input_work={case.repeated_prefix_hit}" 25 ) 26 27print("next_measure_token_economics=True")
Output
1paraphrased public return question: skip_generation=True, reuse_input_work=False 2new live order question after same long instructions: skip_generation=False, reuse_input_work=True 3next_measure_token_economics=True

Mastery check

Key concepts

  • A cached answer belongs to an immutable release scope, not only to question text.
  • Exact response caches catch identical scoped requests; semantic caches retrieve answer candidates across paraphrases.
  • Similarity is evidence for candidate retrieval, never permission to ignore access, freshness, or side-effect boundaries.
  • New corpus or release identity naturally invalidates old answer reuse.
  • Shadow-mode precision and measured break-even savings are promotion gates.
  • Servable cache writes need a validated admission path; don't let unreviewed answers become reusable records.
  • Provider prompt caching reuses repeated input processing, while semantic response caching can skip generation.

Practice tasks

  1. Add response_schema="json-citations-v3" to a new scope and prove an old natural-language answer can't hit it.
  2. Add one public-policy question that is almost similar but requires a different answer. Re-run the threshold sweep and explain the new selected threshold.
  3. Replace the fixture costs with measured numbers for a workload you control. Compute the hit fraction required to break even.
  4. Add a cache-policy rollback event that turns semantic hits back into misses while keeping fresh generation on the same release.

Evaluation rubric

  • Foundational: Explains why a paraphrase misses an exact cache and why similarity can propose reuse.
  • Foundational: Identifies live data and write actions as ineligible for response reuse before considering score.
  • Intermediate: Builds a cache key or metadata filter that includes release, corpus, tenant, access, locale, and schema scope.
  • Intermediate: Calibrates a threshold in shadow mode using accepted-hit precision rather than raw hit rate.
  • Advanced: Computes measured break-even savings and promotes only the response class proved safe and worthwhile.
  • Advanced: Distinguishes semantic stored-answer reuse from provider prefix-computation reuse and chooses evidence for each.

Self-check questions

Common pitfalls

Optimizing hit count instead of correct reuse

Symptom: Hit rate rises while users report answers for a nearby but different policy case. Cause: Threshold was loosened without accepted-reuse labels. Fix: Run shadow replay, gate on precision, and exclude classes where a near match is unsafe.

Caching dynamic or write requests

Symptom: Customer sees stale tracking data or a workflow appears completed when no action ran. Cause: Cache eligibility was treated as a similarity decision. Fix: Bypass live-state and side-effect requests before vector lookup can produce a servable hit.

Keeping entries across a policy release

Symptom: A new return rule is live, but responses still quote the previous rule. Cause: Cache records weren't scoped to release and evidence version. Fix: Include release and corpus identifiers in lookup filters; treat a version change as an immediate miss.

Proving safety but not value

Symptom: Quality remains stable, yet overall request cost or latency gets worse. Cause: Embedding and index lookup costs exceed saved generations. Fix: Measure lookup overhead and safe-hit fraction for the target traffic before promoting.

Unreviewed answers enter the reusable store

Symptom: One incorrect fresh answer gets repeated across several paraphrased questions. Cause: The write path admitted every generated answer directly into the servable cache. Fix: Treat cache admission as write authorization. Admit only validated response classes with recorded evidence; quarantine or review new records before reuse.

Next Step
Continue to LLM Cost Engineering & Token Economics

You can now decide whether an answer is safe to reuse under one evaluated release. Next you will measure the token, model, prefix-cache, and routing costs of requests that still require generation.

PreviousModel Versioning & Deployment
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings.

Bang, Fu ยท 2023 ยท NLP-OSS 2023

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. ยท 2019 ยท EMNLP 2019

Prompt caching

OpenAI ยท 2026