Reuse stable policy answers across paraphrased questions without crossing release, access, or freshness boundaries; then prove the cache is both safe and worth serving.
The previous lesson promoted an LLM system only by moving a pointer to an immutable release bundle. That discipline matters again here: a cached answer is behavior produced by one particular release, prompt policy, corpus, and access scope.
Suppose the deployed delivery-support assistant repeatedly answers the public question, "How long can I return unused headphones?" Shoppers phrase that question many ways. Reusing a verified answer could save generation work and respond faster. Reusing the answer after a return-policy update, for another tenant, or for a live order-status question could be plainly wrong.
This lesson builds one semantic cache for public policy answers. It will serve nothing until a shadow replay shows safe hits and worthwhile savings.
A response cache isn't a memory of generally true sentences. It's a store of outputs generated under a particular contract. The release bundle from the previous chapter gives us most of that contract already.
| Request | Can an answer be reused? | Why |
|---|---|---|
| "What is the return window for unused headphones?" | Candidate | Public policy answer can remain stable within one policy release. |
| "How long can I send unused headphones back?" | Candidate | Paraphrase of the same public-policy question, after evaluation. |
| "Where is order ORD-48192 right now?" | No | Answer depends on live, customer-specific state. |
| "Create a return label for ORD-48192." | No | The request asks for a side effect, not reusable prose. |
For this system, a reusable answer must match all of these fields:
| Field | Why it matters |
|---|---|
release_id | Pins model, prompt, policy logic, and serving behavior. |
corpus_version | Prevents old policy evidence from surviving a document update. |
tenant_id and access_scope | Prevents one customer's or merchant's information from leaking into another response. |
locale and response_schema | Prevents the right content from appearing in the wrong language or output contract. |
Start the lab with the exact release scope and one response generated by the promoted release.
1from dataclasses import asdict, dataclass, replace
2from hashlib import sha256
3import json
4import math
5
6@dataclass(frozen=True)
7class ReleaseScope:
8 release_id: str
9 corpus_version: str
10 tenant_id: str
11 access_scope: str
12 locale: str
13 response_schema: str
14
15@dataclass(frozen=True)
16class Request:
17 text: str
18 tenant_id: str = "shopflow-public"
19 access_scope: str = "public-policy"
20 locale: str = "en-US"
21 response_schema: str = "cited-answer-v2"
22 requires_live_data: bool = False
23 writes_state: bool = False
24
25@dataclass(frozen=True)
26class CachedAnswer:
27 answer_id: str
28 source_query: str
29 response: str
30 scope: ReleaseScope
31
32stable_scope = ReleaseScope(
33 release_id="delivery-evidence-answerer@sha256:df2d4fe7b0c5",
34 corpus_version="returns-policy-2026-04",
35 tenant_id="shopflow-public",
36 access_scope="public-policy",
37 locale="en-US",
38 response_schema="cited-answer-v2",
39)
40
41seed_answer = CachedAnswer(
42 answer_id="ans_returns_unused_30d",
43 source_query="What is the return window for unused headphones?",
44 response="Unused headphones can be returned within 30 days of delivery.",
45 scope=stable_scope,
46)
47
48print(f"release_id={stable_scope.release_id}")
49print(f"corpus_version={stable_scope.corpus_version}")
50print(f"seed_answer={seed_answer.answer_id}")1release_id=delivery-evidence-answerer@sha256:df2d4fe7b0c5
2corpus_version=returns-policy-2026-04
3seed_answer=ans_returns_unused_30dFor requests already eligible for answer reuse, an ordinary key-value cache can safely reuse normalized exact repeats as long as its key contains the full scope. It can't see through paraphrasing.
The scope fields must come from application-owned route policy and authenticated context, not from raw user text or a model's guess. Unknown response classes should bypass answer reuse. The lab also checks that the caller-selected scope agrees with the request before deriving a key.
1def normalized_text(text: str) -> str:
2 return " ".join(text.lower().split())
3
4def request_matches_scope(request: Request, scope: ReleaseScope) -> bool:
5 return (
6 request.tenant_id == scope.tenant_id
7 and request.access_scope == scope.access_scope
8 and request.locale == scope.locale
9 and request.response_schema == scope.response_schema
10 )
11
12def exact_key(scope: ReleaseScope, request: Request) -> str:
13 if not request_matches_scope(request, scope):
14 raise ValueError("request does not match cache scope")
15 payload = {
16 "scope": asdict(scope),
17 "text": normalized_text(request.text),
18 }
19 encoded = json.dumps(payload, sort_keys=True).encode("utf-8")
20 return sha256(encoded).hexdigest()
21
22same_words = Request("What is the return window for unused headphones?")
23paraphrase = Request("How long can I send unused headphones back?")
24cross_tenant = replace(same_words, tenant_id="merchant-b")
25updated_scope = replace(
26 stable_scope,
27 release_id="delivery-evidence-answerer@sha256:new-policy",
28 corpus_version="returns-policy-2026-05",
29)
30
31seed_key = exact_key(stable_scope, Request(seed_answer.source_query))
32print(f"exact_repeat_hit={exact_key(stable_scope, same_words) == seed_key}")
33print(f"paraphrase_exact_hit={exact_key(stable_scope, paraphrase) == seed_key}")
34print(f"updated_policy_hit={exact_key(updated_scope, same_words) == seed_key}")
35try:
36 exact_key(stable_scope, cross_tenant)
37except ValueError as error:
38 print(f"cross_tenant_rejected={error}")1exact_repeat_hit=True
2paraphrase_exact_hit=False
3updated_policy_hit=False
4cross_tenant_rejected=request does not match cache scopeThe exact cache does the correct thing: it refuses a paraphrase and refuses an old answer under a new policy release. Semantic caching adds only the first capability. It must not weaken the second.
A semantic cache embeds a new question, searches stored question embeddings, and proposes a nearby saved answer. Systems such as GPTCache apply that retrieval step before deciding whether to call the LLM at all.[1] Sentence-BERT showed why this shape works: sentence embeddings can be compared efficiently with cosine similarity for semantic matching tasks.[2]
For two vectors and , cosine similarity is:
The numerator measures their aligned components. Dividing by both lengths makes the result compare direction rather than vector magnitude. A high score says two encoded questions are close under this embedding model. It does not say their answers are interchangeable.
The tiny vectors below are an instructional fixture, not scores from a commercial embedding model. They let us see the failure mode without downloading a model: an opened-item exception can sit near a general return-window question while still needing a different answer.
1fixture_vectors = {
2 seed_answer.source_query: (1.00, 0.00, 0.00),
3 "How long can I send unused headphones back?": (0.99, 0.04, 0.00),
4 "Can I return opened headphones?": (0.94, 0.10, 0.00),
5 "Where is order ORD-48192 right now?": (0.00, 0.05, 1.00),
6}
7
8def cosine(left: tuple[float, ...], right: tuple[float, ...]) -> float:
9 dot = sum(a * b for a, b in zip(left, right))
10 left_norm = math.sqrt(sum(value * value for value in left))
11 right_norm = math.sqrt(sum(value * value for value in right))
12 return dot / (left_norm * right_norm)
13
14seed_vector = fixture_vectors[seed_answer.source_query]
15for question in [
16 "How long can I send unused headphones back?",
17 "Can I return opened headphones?",
18 "Where is order ORD-48192 right now?",
19]:
20 score = cosine(seed_vector, fixture_vectors[question])
21 print(f"{question} | score={score:.3f}")1How long can I send unused headphones back? | score=0.999
2Can I return opened headphones? | score=0.994
3Where is order ORD-48192 right now? | score=0.000
The assistant shouldn't response-cache live order state or actions at any threshold. Even for a public-policy question, a cached answer must be from the same release scope.
This decision procedure checks the non-negotiable rules first. Only an eligible, same-scope request reaches the similarity threshold.
1def same_scope(request: Request, record: CachedAnswer, active: ReleaseScope) -> bool:
2 return (
3 record.scope == active
4 and request_matches_scope(request, active)
5 )
6
7def decide_candidate(
8 request: Request,
9 record: CachedAnswer,
10 active: ReleaseScope,
11 score: float,
12 threshold: float,
13) -> str:
14 if request.requires_live_data or request.writes_state:
15 return "BYPASS_DYNAMIC_OR_WRITE"
16 if not same_scope(request, record, active):
17 return "MISS_SCOPE_CHANGED"
18 if score < threshold:
19 return "MISS_BELOW_THRESHOLD"
20 return "SEMANTIC_HIT"
21
22policy_paraphrase = Request("How long can I send unused headphones back?")
23live_order = Request(
24 "Where is order ORD-48192 right now?",
25 access_scope="customer-order",
26 requires_live_data=True,
27)
28label_action = Request(
29 "Create a return label for ORD-48192.",
30 access_scope="customer-order",
31 writes_state=True,
32)
33
34policy_score = cosine(seed_vector, fixture_vectors[policy_paraphrase.text])
35print(decide_candidate(policy_paraphrase, seed_answer, stable_scope, policy_score, 0.98))
36print(decide_candidate(live_order, seed_answer, stable_scope, 1.00, 0.98))
37print(decide_candidate(label_action, seed_answer, stable_scope, 1.00, 0.98))1SEMANTIC_HIT
2BYPASS_DYNAMIC_OR_WRITE
3BYPASS_DYNAMIC_OR_WRITEA time-to-live (TTL) can expire old entries after a period. It can't know that a returns policy changed five minutes after an answer was stored. The release bundle provides a stronger invalidation hook: if policy evidence or answer behavior changes, the release or corpus version changes and old entries are no longer in scope.
1policy_update = replace(
2 stable_scope,
3 release_id="delivery-evidence-answerer@sha256:7a12policy",
4 corpus_version="returns-policy-2026-05",
5)
6
7same_question = Request(seed_answer.source_query)
8old_release_decision = decide_candidate(
9 same_question, seed_answer, stable_scope, 1.00, 0.98
10)
11new_release_decision = decide_candidate(
12 same_question, seed_answer, policy_update, 1.00, 0.98
13)
14
15print(f"old_release={old_release_decision}")
16print(f"new_policy_release={new_release_decision}")
17print(f"new_release_must_generate={new_release_decision != 'SEMANTIC_HIT'}")1old_release=SEMANTIC_HIT
2new_policy_release=MISS_SCOPE_CHANGED
3new_release_must_generate=TrueThis is why cache identity should inherit the release identity from deployment. Eviction can clean up storage later; correctness shouldn't depend on eviction finishing first.
Serving a semantic hit immediately turns a retrieval mistake into a user-visible wrong answer. Shadow mode runs the lookup decision but still serves the normal fresh path. Reviewers then label whether each proposed reuse would have been acceptable.
A good cache metric separates two questions:
High proposal rate without high precision is a cheaper system that is wrong more often.
The labeled fixture below contains public-policy paraphrases, a subtle opened-item exception, and ineligible requests. Thresholds are specific to this fixture and embedding setup; don't copy them into production.
1@dataclass(frozen=True)
2class ShadowProbe:
3 name: str
4 score: float
5 eligible: bool
6 acceptable_reuse: bool
7
8shadow_probes = [
9 ShadowProbe("return window paraphrase", 0.995, True, True),
10 ShadowProbe("send unused item back", 0.989, True, True),
11 ShadowProbe("refund window wording", 0.982, True, True),
12 ShadowProbe("policy FAQ reworded", 0.981, True, True),
13 ShadowProbe("opened-item exception", 0.965, True, False),
14 ShadowProbe("live order state", 0.999, False, False),
15 ShadowProbe("create label action", 0.997, False, False),
16]
17
18def replay_at(threshold: float) -> dict[str, float | int]:
19 proposed = [
20 probe for probe in shadow_probes
21 if probe.eligible and probe.score >= threshold
22 ]
23 accepted = [probe for probe in proposed if probe.acceptable_reuse]
24 precision = len(accepted) / len(proposed) if proposed else 1.0
25 return {
26 "proposed": len(proposed),
27 "accepted": len(accepted),
28 "precision": precision,
29 "proposal_rate": len(proposed) / len(shadow_probes),
30 }
31
32for threshold in [0.960, 0.980, 0.990]:
33 metrics = replay_at(threshold)
34 print(
35 f"threshold={threshold:.3f} "
36 f"proposed={metrics['proposed']} "
37 f"precision={metrics['precision']:.1%} "
38 f"proposal_rate={metrics['proposal_rate']:.1%}"
39 )
40
41selected_threshold = 0.9801threshold=0.960 proposed=5 precision=80.0% proposal_rate=71.4%
2threshold=0.980 proposed=4 precision=100.0% proposal_rate=57.1%
3threshold=0.990 proposed=1 precision=100.0% proposal_rate=14.3%
Every semantic lookup incurs work, even on a miss: embedding the request, searching an index, and recording metrics. Evaluate cost only after the precision gate passes.
Let:
If a hit skips fresh generation, expected period savings are:
These quantities must come from the workload and model you plan to operate. The next example uses clearly labeled measurement fixtures, not provider prices.
1shadow_metrics = replay_at(selected_threshold)
2requests_per_day = 10_000
3fresh_generation_usd = 0.0040 # measured fixture: average full answer cost
4semantic_lookup_usd = 0.00008 # measured fixture: embed + index lookup
5safe_hit_fraction = shadow_metrics["proposal_rate"]
6
7without_cache = requests_per_day * fresh_generation_usd
8with_cache = requests_per_day * (
9 semantic_lookup_usd + (1 - safe_hit_fraction) * fresh_generation_usd
10)
11savings = without_cache - with_cache
12break_even_hit_fraction = semantic_lookup_usd / fresh_generation_usd
13
14print(f"safe_hit_fraction={safe_hit_fraction:.1%}")
15print(f"break_even_hit_fraction={break_even_hit_fraction:.1%}")
16print(f"daily_savings_fixture_usd={savings:.2f}")
17print(f"savings_positive={savings > 0}")1safe_hit_fraction=57.1%
2break_even_hit_fraction=2.0%
3daily_savings_fixture_usd=22.06
4savings_positive=TrueThis calculation intentionally omits any guessed list price. Measure generation and lookup cost for the actual release and traffic distribution, then repeat the gate when either changes.
The safe outcome isn't "turn on semantic caching for all support." It is "turn on semantic reuse for the public-policy scope that passed shadow evidence." Order status and write actions remain bypassed.
1@dataclass(frozen=True)
2class CachePromotionGate:
3 minimum_precision: float
4 minimum_daily_savings_usd: float
5 required_scope: ReleaseScope
6
7gate = CachePromotionGate(
8 minimum_precision=0.99,
9 minimum_daily_savings_usd=5.00,
10 required_scope=stable_scope,
11)
12
13passes_quality = shadow_metrics["precision"] >= gate.minimum_precision
14passes_economics = savings >= gate.minimum_daily_savings_usd
15passes_scope = seed_answer.scope == gate.required_scope
16decision = (
17 "PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE"
18 if passes_quality and passes_economics and passes_scope
19 else "KEEP_SHADOW_ONLY"
20)
21
22print(f"quality_gate={passes_quality}")
23print(f"economics_gate={passes_economics}")
24print(f"scope_gate={passes_scope}")
25print(f"cache_decision={decision}")1quality_gate=True
2economics_gate=True
3scope_gate=True
4cache_decision=PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE
Once the cache is serving, traces must answer: which release generated the cached response, which cache policy reused it, and why a request bypassed reuse? Without those fields, wrong-hit incidents become difficult to reconstruct.
1def trace_decision(request: Request, score: float) -> dict[str, str | float]:
2 cache_decision = decide_candidate(
3 request, seed_answer, stable_scope, score, selected_threshold
4 )
5 return {
6 "request": request.text,
7 "release_id": stable_scope.release_id,
8 "corpus_version": stable_scope.corpus_version,
9 "tenant_id": request.tenant_id,
10 "access_scope": request.access_scope,
11 "cache_policy": "public-policy-semantic-v1",
12 "answer_id": seed_answer.answer_id if cache_decision == "SEMANTIC_HIT" else "",
13 "decision": cache_decision,
14 "score": score,
15 }
16
17hit_trace = trace_decision(policy_paraphrase, policy_score)
18bypass_trace = trace_decision(live_order, 1.00)
19
20print(f"hit_decision={hit_trace['decision']} answer_id={hit_trace['answer_id']}")
21print(f"bypass_decision={bypass_trace['decision']}")
22print(f"traced_release={hit_trace['release_id'] == stable_scope.release_id}")
23print(f"traced_scope={hit_trace['corpus_version'] == stable_scope.corpus_version and hit_trace['access_scope'] == stable_scope.access_scope}")1hit_decision=SEMANTIC_HIT answer_id=ans_returns_unused_30d
2bypass_decision=BYPASS_DYNAMIC_OR_WRITE
3traced_release=True
4traced_scope=TrueWatch production for accepted-hit review failures, user retries after cache hits, scope bypass volume, latency, and realized saved generation. An incident should be able to disable this cache policy pointer without changing the production release that generates fresh responses.
The cache in this lab can return a stored answer for a paraphrase and skip generation. Provider prompt caching operates at a different layer. For example, OpenAI's documented prompt caching detects matching prompt prefixes starting at a documented token length and reduces repeated input processing; the model still computes a new output.[3]
| Layer | Matches | Result on hit | Main correctness risk |
|---|---|---|---|
| Exact response cache | Same scoped request key | Return stored answer, skip generation | Stale or incomplete scope key |
| Semantic response cache | Similar eligible question under same contract | Return stored answer, skip generation | False semantic reuse |
| Provider prompt cache | Matching input prefix under provider rules | Compute a new answer with cheaper/faster repeated input work | Missed cost opportunity, not stored-answer substitution |
The distinction determines the evaluation: semantic answer caching needs labeled reuse precision; prompt-prefix caching needs token and latency accounting. The next chapter expands that economics.
1@dataclass(frozen=True)
2class ReuseCase:
3 name: str
4 semantic_answer_hit: bool
5 repeated_prefix_hit: bool
6
7cases = [
8 ReuseCase(
9 name="paraphrased public return question",
10 semantic_answer_hit=True,
11 repeated_prefix_hit=False,
12 ),
13 ReuseCase(
14 name="new live order question after same long instructions",
15 semantic_answer_hit=False,
16 repeated_prefix_hit=True,
17 ),
18]
19
20for case in cases:
21 print(
22 f"{case.name}: "
23 f"skip_generation={case.semantic_answer_hit}, "
24 f"reuse_input_work={case.repeated_prefix_hit}"
25 )
26
27print("next_measure_token_economics=True")1paraphrased public return question: skip_generation=True, reuse_input_work=False
2new live order question after same long instructions: skip_generation=False, reuse_input_work=True
3next_measure_token_economics=Trueresponse_schema="json-citations-v3" to a new scope and prove an old natural-language answer can't hit it.Symptom: Hit rate rises while users report answers for a nearby but different policy case. Cause: Threshold was loosened without accepted-reuse labels. Fix: Run shadow replay, gate on precision, and exclude classes where a near match is unsafe.
Symptom: Customer sees stale tracking data or a workflow appears completed when no action ran. Cause: Cache eligibility was treated as a similarity decision. Fix: Bypass live-state and side-effect requests before vector lookup can produce a servable hit.
Symptom: A new return rule is live, but responses still quote the previous rule. Cause: Cache records weren't scoped to release and evidence version. Fix: Include release and corpus identifiers in lookup filters; treat a version change as an immediate miss.
Symptom: Quality remains stable, yet overall request cost or latency gets worse. Cause: Embedding and index lookup costs exceed saved generations. Fix: Measure lookup overhead and safe-hit fraction for the target traffic before promoting.
Symptom: One incorrect fresh answer gets repeated across several paraphrased questions. Cause: The write path admitted every generated answer directly into the servable cache. Fix: Treat cache admission as write authorization. Admit only validated response classes with recorded evidence; quarantine or review new records before reuse.