LearnApplied LLM EngineeringSemantic Caching & Cost Optimization

🚀MediumInference Optimization

Semantic Caching & Cost Optimization

Reuse stable policy answers across paraphrased questions without crossing release, access, or freshness boundaries; then prove the cache is both safe and worth serving.

15 min read

Learning path

Step 72 of 155 in the full curriculum

Model Versioning & Deployment LLM Cost Engineering & Token Economics

Semantic Caching & Cost Optimization

The previous lesson promoted an LLM system only by moving a pointer to an immutable release bundle. That discipline matters again here: a cached answer is behavior produced by one particular release, prompt policy, corpus, and access scope.

Suppose the deployed delivery-support assistant repeatedly answers the public question, "How long can I return unused headphones?" Shoppers phrase that question many ways. Reusing a verified answer could save generation work and respond faster. Reusing the answer after a return-policy update, for another tenant, or for a live order-status question could be plainly wrong.

This lesson builds one semantic cache for public policy answers. It will serve nothing until a shadow replay shows safe hits and worthwhile savings.

An answer is reusable only inside its contract

A response cache isn't a memory of generally true sentences. It's a store of outputs generated under a particular contract. The release bundle from the previous chapter gives us most of that contract already.

Request	Can an answer be reused?	Why
"What is the return window for unused headphones?"	Candidate	Public policy answer can remain stable within one policy release.
"How long can I send unused headphones back?"	Candidate	Paraphrase of the same public-policy question, after evaluation.
"Where is order ORD-48192 right now?"	No	Answer depends on live, customer-specific state.
"Create a return label for ORD-48192."	No	The request asks for a side effect, not reusable prose.

For this system, a reusable answer must match all of these fields:

Field	Why it matters
`release_id`	Pins model, prompt, policy logic, and serving behavior.
`corpus_version`	Prevents old policy evidence from surviving a document update.
`tenant_id` and `access_scope`	Prevents one customer's or merchant's information from leaking into another response.
`locale` and `response_schema`	Prevents the right content from appearing in the wrong language or output contract.

Reuse eligibility map showing public policy answers as candidates while live order status and return-label actions bypass the semantic answer cache. — A high-volume request isn't automatically cacheable. Public, read-only policy answers can enter evaluation; live customer state and write actions must bypass answer reuse.

Start the lab with the exact release scope and one response generated by the promoted release.

define-the-reuse-contract.py

from dataclasses import asdict, dataclass, replace
from hashlib import sha256
import json
import math

@dataclass(frozen=True)
class ReleaseScope:
    release_id: str
    corpus_version: str
    tenant_id: str
    access_scope: str
    locale: str
    response_schema: str

@dataclass(frozen=True)
class Request:
    text: str
    tenant_id: str = "shopflow-public"
    access_scope: str = "public-policy"
    locale: str = "en-US"
    response_schema: str = "cited-answer-v2"
    requires_live_data: bool = False
    writes_state: bool = False

@dataclass(frozen=True)
class CachedAnswer:
    answer_id: str
    source_query: str
    response: str
    scope: ReleaseScope

stable_scope = ReleaseScope(
    release_id="delivery-evidence-answerer@sha256:df2d4fe7b0c5",
    corpus_version="returns-policy-2026-04",
    tenant_id="shopflow-public",
    access_scope="public-policy",
    locale="en-US",
    response_schema="cited-answer-v2",
)

seed_answer = CachedAnswer(
    answer_id="ans_returns_unused_30d",
    source_query="What is the return window for unused headphones?",
    response="Unused headphones can be returned within 30 days of delivery.",
    scope=stable_scope,
)

print(f"release_id={stable_scope.release_id}")
print(f"corpus_version={stable_scope.corpus_version}")
print(f"seed_answer={seed_answer.answer_id}")

Output

release_id=delivery-evidence-answerer@sha256:df2d4fe7b0c5
corpus_version=returns-policy-2026-04
seed_answer=ans_returns_unused_30d

For requests already eligible for answer reuse, an ordinary key-value cache can safely reuse normalized exact repeats as long as its key contains the full scope. It can't see through paraphrasing.

The scope fields must come from application-owned route policy and authenticated context, not from raw user text or a model's guess. Unknown response classes should bypass answer reuse. The lab also checks that the caller-selected scope agrees with the request before deriving a key.

exact-cache-respects-release-scope.py

def normalized_text(text: str) -> str:
    return " ".join(text.lower().split())

def request_matches_scope(request: Request, scope: ReleaseScope) -> bool:
    return (
        request.tenant_id == scope.tenant_id
        and request.access_scope == scope.access_scope
        and request.locale == scope.locale
        and request.response_schema == scope.response_schema
    )

def exact_key(scope: ReleaseScope, request: Request) -> str:
    if not request_matches_scope(request, scope):
        raise ValueError("request does not match cache scope")
    payload = {
        "scope": asdict(scope),
        "text": normalized_text(request.text),
    }
    encoded = json.dumps(payload, sort_keys=True).encode("utf-8")
    return sha256(encoded).hexdigest()

same_words = Request("What is the return window for unused headphones?")
paraphrase = Request("How long can I send unused headphones back?")
cross_tenant = replace(same_words, tenant_id="merchant-b")
updated_scope = replace(
    stable_scope,
    release_id="delivery-evidence-answerer@sha256:new-policy",
    corpus_version="returns-policy-2026-05",
)

seed_key = exact_key(stable_scope, Request(seed_answer.source_query))
print(f"exact_repeat_hit={exact_key(stable_scope, same_words) == seed_key}")
print(f"paraphrase_exact_hit={exact_key(stable_scope, paraphrase) == seed_key}")
print(f"updated_policy_hit={exact_key(updated_scope, same_words) == seed_key}")
try:
    exact_key(stable_scope, cross_tenant)
except ValueError as error:
    print(f"cross_tenant_rejected={error}")

Output

exact_repeat_hit=True
paraphrase_exact_hit=False
updated_policy_hit=False
cross_tenant_rejected=request does not match cache scope

The exact cache does the correct thing: it refuses a paraphrase and refuses an old answer under a new policy release. Semantic caching adds only the first capability. It must not weaken the second.

Similarity retrieves a candidate, not a truth

A semantic cache embeds a new question, searches stored question embeddings, and proposes a nearby saved answer. Systems such as GPTCache apply that retrieval step before deciding whether to call the LLM at all.^[1] Sentence-BERT showed why this shape works: sentence embeddings can be compared efficiently with cosine similarity for semantic matching tasks.^[2]

For two vectors $a$ and $b$ , cosine similarity is:

\operatorname{cosine}(a, b) = \frac{a \cdot b}{\lVert a \rVert_2 \lVert b \rVert_2}

The numerator measures their aligned components. Dividing by both lengths makes the result compare direction rather than vector magnitude. A high score says two encoded questions are close under this embedding model. It does not say their answers are interchangeable.

The tiny vectors below are an instructional fixture, not scores from a commercial embedding model. They let us see the failure mode without downloading a model: an opened-item exception can sit near a general return-window question while still needing a different answer.

similarity-only-proposes-a-candidate.py

fixture_vectors = {
    seed_answer.source_query: (1.00, 0.00, 0.00),
    "How long can I send unused headphones back?": (0.99, 0.04, 0.00),
    "Can I return opened headphones?": (0.94, 0.10, 0.00),
    "Where is order ORD-48192 right now?": (0.00, 0.05, 1.00),
}

def cosine(left: tuple[float, ...], right: tuple[float, ...]) -> float:
    dot = sum(a * b for a, b in zip(left, right))
    left_norm = math.sqrt(sum(value * value for value in left))
    right_norm = math.sqrt(sum(value * value for value in right))
    return dot / (left_norm * right_norm)

seed_vector = fixture_vectors[seed_answer.source_query]
for question in [
    "How long can I send unused headphones back?",
    "Can I return opened headphones?",
    "Where is order ORD-48192 right now?",
]:
    score = cosine(seed_vector, fixture_vectors[question])
    print(f"{question} | score={score:.3f}")

Output

How long can I send unused headphones back? | score=0.999
Can I return opened headphones? | score=0.994
Where is order ORD-48192 right now? | score=0.000

Release-scoped semantic cache lookup where embedding similarity proposes a policy answer and matching release, corpus, tenant, and access scope permit reuse. — Similarity is only the candidate-retrieval step. A semantic hit becomes servable only after release identity, evidence version, access scope, and eligibility checks all pass.

Eligibility rules run before the score threshold

The assistant shouldn't response-cache live order state or actions at any threshold. Even for a public-policy question, a cached answer must be from the same release scope.

This decision procedure checks the non-negotiable rules first. Only an eligible, same-scope request reaches the similarity threshold.

gate-semantic-hits-by-contract.py

def same_scope(request: Request, record: CachedAnswer, active: ReleaseScope) -> bool:
    return (
        record.scope == active
        and request_matches_scope(request, active)
    )

def decide_candidate(
    request: Request,
    record: CachedAnswer,
    active: ReleaseScope,
    score: float,
    threshold: float,
) -> str:
    if request.requires_live_data or request.writes_state:
        return "BYPASS_DYNAMIC_OR_WRITE"
    if not same_scope(request, record, active):
        return "MISS_SCOPE_CHANGED"
    if score < threshold:
        return "MISS_BELOW_THRESHOLD"
    return "SEMANTIC_HIT"

policy_paraphrase = Request("How long can I send unused headphones back?")
live_order = Request(
    "Where is order ORD-48192 right now?",
    access_scope="customer-order",
    requires_live_data=True,
)
label_action = Request(
    "Create a return label for ORD-48192.",
    access_scope="customer-order",
    writes_state=True,
)

policy_score = cosine(seed_vector, fixture_vectors[policy_paraphrase.text])
print(decide_candidate(policy_paraphrase, seed_answer, stable_scope, policy_score, 0.98))
print(decide_candidate(live_order, seed_answer, stable_scope, 1.00, 0.98))
print(decide_candidate(label_action, seed_answer, stable_scope, 1.00, 0.98))

Output

SEMANTIC_HIT
BYPASS_DYNAMIC_OR_WRITE
BYPASS_DYNAMIC_OR_WRITE

Version changes invalidate answers without guessing

A time-to-live (TTL) can expire old entries after a period. It can't know that a returns policy changed five minutes after an answer was stored. The release bundle provides a stronger invalidation hook: if policy evidence or answer behavior changes, the release or corpus version changes and old entries are no longer in scope.

invalidate-on-policy-release.py

policy_update = replace(
    stable_scope,
    release_id="delivery-evidence-answerer@sha256:7a12policy",
    corpus_version="returns-policy-2026-05",
)

same_question = Request(seed_answer.source_query)
old_release_decision = decide_candidate(
    same_question, seed_answer, stable_scope, 1.00, 0.98
)
new_release_decision = decide_candidate(
    same_question, seed_answer, policy_update, 1.00, 0.98
)

print(f"old_release={old_release_decision}")
print(f"new_policy_release={new_release_decision}")
print(f"new_release_must_generate={new_release_decision != 'SEMANTIC_HIT'}")

Output

old_release=SEMANTIC_HIT
new_policy_release=MISS_SCOPE_CHANGED
new_release_must_generate=True

This is why cache identity should inherit the release identity from deployment. Eviction can clean up storage later; correctness shouldn't depend on eviction finishing first.

Choose a threshold in shadow mode

Serving a semantic hit immediately turns a retrieval mistake into a user-visible wrong answer. Shadow mode runs the lookup decision but still serves the normal fresh path. Reviewers then label whether each proposed reuse would have been acceptable.

A good cache metric separates two questions:

Proposal rate: how often would the cache return something?
Hit precision: among proposed hits, how often is answer reuse acceptable?

High proposal rate without high precision is a cheaper system that is wrong more often.

The labeled fixture below contains public-policy paraphrases, a subtle opened-item exception, and ineligible requests. Thresholds are specific to this fixture and embedding setup; don't copy them into production.

calibrate-with-shadow-replay.py

@dataclass(frozen=True)
class ShadowProbe:
    name: str
    score: float
    eligible: bool
    acceptable_reuse: bool

shadow_probes = [
    ShadowProbe("return window paraphrase", 0.995, True, True),
    ShadowProbe("send unused item back", 0.989, True, True),
    ShadowProbe("refund window wording", 0.982, True, True),
    ShadowProbe("policy FAQ reworded", 0.981, True, True),
    ShadowProbe("opened-item exception", 0.965, True, False),
    ShadowProbe("live order state", 0.999, False, False),
    ShadowProbe("create label action", 0.997, False, False),
]

def replay_at(threshold: float) -> dict[str, float | int]:
    proposed = [
        probe for probe in shadow_probes
        if probe.eligible and probe.score >= threshold
    ]
    accepted = [probe for probe in proposed if probe.acceptable_reuse]
    precision = len(accepted) / len(proposed) if proposed else 1.0
    return {
        "proposed": len(proposed),
        "accepted": len(accepted),
        "precision": precision,
        "proposal_rate": len(proposed) / len(shadow_probes),
    }

for threshold in [0.960, 0.980, 0.990]:
    metrics = replay_at(threshold)
    print(
        f"threshold={threshold:.3f} "
        f"proposed={metrics['proposed']} "
        f"precision={metrics['precision']:.1%} "
        f"proposal_rate={metrics['proposal_rate']:.1%}"
    )

selected_threshold = 0.980

Output

threshold=0.960 proposed=5 precision=80.0% proposal_rate=71.4%
threshold=0.980 proposed=4 precision=100.0% proposal_rate=57.1%
threshold=0.990 proposed=1 precision=100.0% proposal_rate=14.3%

Shadow replay threshold sweep where a loose semantic-cache threshold includes an unacceptable policy exception and a stricter threshold serves fewer but correct reuse proposals. — Threshold tuning is an evaluation decision. In this fixture, excluding one near-but-wrong exception matters more than maximizing reuse count.

A safe cache still has to pay for itself

Every semantic lookup incurs work, even on a miss: embedding the request, searching an index, and recording metrics. Evaluate cost only after the precision gate passes.

Let:

$N$ be requests in a measured period.
$C_g$ be average fresh-generation cost per request.
$C_l$ be semantic-lookup cost per request.
$h$ be the observed safe-hit fraction.

If a hit skips fresh generation, expected period savings are:

\text{savings} = N \left(h C_g - C_l\right)

These quantities must come from the workload and model you plan to operate. The next example uses clearly labeled measurement fixtures, not provider prices.

measure-break-even-savings.py

shadow_metrics = replay_at(selected_threshold)
requests_per_day = 10_000
fresh_generation_usd = 0.0040  # measured fixture: average full answer cost
semantic_lookup_usd = 0.00008  # measured fixture: embed + index lookup
safe_hit_fraction = shadow_metrics["proposal_rate"]

without_cache = requests_per_day * fresh_generation_usd
with_cache = requests_per_day * (
    semantic_lookup_usd + (1 - safe_hit_fraction) * fresh_generation_usd
)
savings = without_cache - with_cache
break_even_hit_fraction = semantic_lookup_usd / fresh_generation_usd

print(f"safe_hit_fraction={safe_hit_fraction:.1%}")
print(f"break_even_hit_fraction={break_even_hit_fraction:.1%}")
print(f"daily_savings_fixture_usd={savings:.2f}")
print(f"savings_positive={savings > 0}")

Output

safe_hit_fraction=57.1%
break_even_hit_fraction=2.0%
daily_savings_fixture_usd=22.06
savings_positive=True

This calculation intentionally omits any guessed list price. Measure generation and lookup cost for the actual release and traffic distribution, then repeat the gate when either changes.

Promote only the narrow policy you tested

The safe outcome isn't "turn on semantic caching for all support." It is "turn on semantic reuse for the public-policy scope that passed shadow evidence." Order status and write actions remain bypassed.

make-the-cache-promotion-decision.py

@dataclass(frozen=True)
class CachePromotionGate:
    minimum_precision: float
    minimum_daily_savings_usd: float
    required_scope: ReleaseScope

gate = CachePromotionGate(
    minimum_precision=0.99,
    minimum_daily_savings_usd=5.00,
    required_scope=stable_scope,
)

passes_quality = shadow_metrics["precision"] >= gate.minimum_precision
passes_economics = savings >= gate.minimum_daily_savings_usd
passes_scope = seed_answer.scope == gate.required_scope
decision = (
    "PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE"
    if passes_quality and passes_economics and passes_scope
    else "KEEP_SHADOW_ONLY"
)

print(f"quality_gate={passes_quality}")
print(f"economics_gate={passes_economics}")
print(f"scope_gate={passes_scope}")
print(f"cache_decision={decision}")

Output

quality_gate=True
economics_gate=True
scope_gate=True
cache_decision=PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE

Semantic-cache delivery path moving from immutable release scope through exact lookup, shadow-only semantic evaluation, quality and savings gates, and limited public-policy promotion. — Deployment is a measured policy decision. Only the response class tested in shadow mode is promoted; dynamic customer data and side effects remain outside the answer cache.

Record why each request hit or bypassed

Once the cache is serving, traces must answer: which release generated the cached response, which cache policy reused it, and why a request bypassed reuse? Without those fields, wrong-hit incidents become difficult to reconstruct.

emit-cache-decision-traces.py

def trace_decision(request: Request, score: float) -> dict[str, str | float]:
    cache_decision = decide_candidate(
        request, seed_answer, stable_scope, score, selected_threshold
    )
    return {
        "request": request.text,
        "release_id": stable_scope.release_id,
        "corpus_version": stable_scope.corpus_version,
        "tenant_id": request.tenant_id,
        "access_scope": request.access_scope,
        "cache_policy": "public-policy-semantic-v1",
        "answer_id": seed_answer.answer_id if cache_decision == "SEMANTIC_HIT" else "",
        "decision": cache_decision,
        "score": score,
    }

hit_trace = trace_decision(policy_paraphrase, policy_score)
bypass_trace = trace_decision(live_order, 1.00)

print(f"hit_decision={hit_trace['decision']} answer_id={hit_trace['answer_id']}")
print(f"bypass_decision={bypass_trace['decision']}")
print(f"traced_release={hit_trace['release_id'] == stable_scope.release_id}")
print(f"traced_scope={hit_trace['corpus_version'] == stable_scope.corpus_version and hit_trace['access_scope'] == stable_scope.access_scope}")

Output

hit_decision=SEMANTIC_HIT answer_id=ans_returns_unused_30d
bypass_decision=BYPASS_DYNAMIC_OR_WRITE
traced_release=True
traced_scope=True

Watch production for accepted-hit review failures, user retries after cache hits, scope bypass volume, latency, and realized saved generation. An incident should be able to disable this cache policy pointer without changing the production release that generates fresh responses.

Semantic response caching isn't prompt-prefix caching

The cache in this lab can return a stored answer for a paraphrase and skip generation. Provider prompt caching operates at a different layer. For example, OpenAI's documented prompt caching detects matching prompt prefixes starting at a documented token length and reduces repeated input processing; the model still computes a new output.^[3]

Layer	Matches	Result on hit	Main correctness risk
Exact response cache	Same scoped request key	Return stored answer, skip generation	Stale or incomplete scope key
Semantic response cache	Similar eligible question under same contract	Return stored answer, skip generation	False semantic reuse
Provider prompt cache	Matching input prefix under provider rules	Compute a new answer with cheaper/faster repeated input work	Missed cost opportunity, not stored-answer substitution

The distinction determines the evaluation: semantic answer caching needs labeled reuse precision; prompt-prefix caching needs token and latency accounting. The next chapter expands that economics.

separate-answer-reuse-from-prefix-reuse.py

@dataclass(frozen=True)
class ReuseCase:
    name: str
    semantic_answer_hit: bool
    repeated_prefix_hit: bool

cases = [
    ReuseCase(
        name="paraphrased public return question",
        semantic_answer_hit=True,
        repeated_prefix_hit=False,
    ),
    ReuseCase(
        name="new live order question after same long instructions",
        semantic_answer_hit=False,
        repeated_prefix_hit=True,
    ),
]

for case in cases:
    print(
        f"{case.name}: "
        f"skip_generation={case.semantic_answer_hit}, "
        f"reuse_input_work={case.repeated_prefix_hit}"
    )

print("next_measure_token_economics=True")

Output

paraphrased public return question: skip_generation=True, reuse_input_work=False
new live order question after same long instructions: skip_generation=False, reuse_input_work=True
next_measure_token_economics=True

Mastery check

Key concepts

A cached answer belongs to an immutable release scope, not only to question text.
Exact response caches catch identical scoped requests; semantic caches retrieve answer candidates across paraphrases.
Similarity is evidence for candidate retrieval, never permission to ignore access, freshness, or side-effect boundaries.
New corpus or release identity naturally invalidates old answer reuse.
Shadow-mode precision and measured break-even savings are promotion gates.
Servable cache writes need a validated admission path; don't let unreviewed answers become reusable records.
Provider prompt caching reuses repeated input processing, while semantic response caching can skip generation.

Practice tasks

Add response_schema="json-citations-v3" to a new scope and prove an old natural-language answer can't hit it.
Add one public-policy question that is almost similar but requires a different answer. Re-run the threshold sweep and explain the new selected threshold.
Replace the fixture costs with measured numbers for a workload you control. Compute the hit fraction required to break even.
Add a cache-policy rollback event that turns semantic hits back into misses while keeping fresh generation on the same release.

Evaluation rubric

Foundational: Explains why a paraphrase misses an exact cache and why similarity can propose reuse.
Foundational: Identifies live data and write actions as ineligible for response reuse before considering score.
Intermediate: Builds a cache key or metadata filter that includes release, corpus, tenant, access, locale, and schema scope.
Intermediate: Calibrates a threshold in shadow mode using accepted-hit precision rather than raw hit rate.
Advanced: Computes measured break-even savings and promotes only the response class proved safe and worthwhile.
Advanced: Distinguishes semantic stored-answer reuse from provider prefix-computation reuse and chooses evidence for each.

Self-check questions

Common pitfalls

Optimizing hit count instead of correct reuse

Symptom: Hit rate rises while users report answers for a nearby but different policy case. Cause: Threshold was loosened without accepted-reuse labels. Fix: Run shadow replay, gate on precision, and exclude classes where a near match is unsafe.

Caching dynamic or write requests

Symptom: Customer sees stale tracking data or a workflow appears completed when no action ran. Cause: Cache eligibility was treated as a similarity decision. Fix: Bypass live-state and side-effect requests before vector lookup can produce a servable hit.

Keeping entries across a policy release

Symptom: A new return rule is live, but responses still quote the previous rule. Cause: Cache records weren't scoped to release and evidence version. Fix: Include release and corpus identifiers in lookup filters; treat a version change as an immediate miss.

Proving safety but not value

Symptom: Quality remains stable, yet overall request cost or latency gets worse. Cause: Embedding and index lookup costs exceed saved generations. Fix: Measure lookup overhead and safe-hit fraction for the target traffic before promoting.

Unreviewed answers enter the reusable store

Symptom: One incorrect fresh answer gets repeated across several paraphrased questions. Cause: The write path admitted every generated answer directly into the servable cache. Fix: Treat cache admission as write authorization. Admit only validated response classes with recorded evidence; quarantine or review new records before reuse.

Next Step

Continue to LLM Cost Engineering & Token Economics

You can now decide whether an answer is safe to reuse under one evaluated release. Next you will measure the token, model, prefix-cache, and routing costs of requests that still require generation.

PreviousModel Versioning & Deployment

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings.

Bang, Fu · 2023 · NLP-OSS 2023

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

Prompt caching

OpenAI · 2026

Back to Topics

LearnApplied LLM EngineeringSemantic Caching & Cost Optimization

🚀MediumInference Optimization

Semantic Caching & Cost Optimization

Reuse stable policy answers across paraphrased questions without crossing release, access, or freshness boundaries; then prove the cache is both safe and worth serving.

15 min read

Learning path

Step 72 of 155 in the full curriculum

Model Versioning & Deployment LLM Cost Engineering & Token Economics

Semantic Caching & Cost Optimization

This lesson builds one semantic cache for public policy answers. It will serve nothing until a shadow replay shows safe hits and worthwhile savings.

An answer is reusable only inside its contract

Request	Can an answer be reused?	Why
"What is the return window for unused headphones?"	Candidate	Public policy answer can remain stable within one policy release.
"How long can I send unused headphones back?"	Candidate	Paraphrase of the same public-policy question, after evaluation.
"Where is order ORD-48192 right now?"	No	Answer depends on live, customer-specific state.
"Create a return label for ORD-48192."	No	The request asks for a side effect, not reusable prose.

For this system, a reusable answer must match all of these fields:

Field	Why it matters
`release_id`	Pins model, prompt, policy logic, and serving behavior.
`corpus_version`	Prevents old policy evidence from surviving a document update.
`tenant_id` and `access_scope`	Prevents one customer's or merchant's information from leaking into another response.
`locale` and `response_schema`	Prevents the right content from appearing in the wrong language or output contract.

Start the lab with the exact release scope and one response generated by the promoted release.

define-the-reuse-contract.py

from dataclasses import asdict, dataclass, replace
from hashlib import sha256
import json
import math

@dataclass(frozen=True)
class ReleaseScope:
    release_id: str
    corpus_version: str
    tenant_id: str
    access_scope: str
    locale: str
    response_schema: str

@dataclass(frozen=True)
class Request:
    text: str
    tenant_id: str = "shopflow-public"
    access_scope: str = "public-policy"
    locale: str = "en-US"
    response_schema: str = "cited-answer-v2"
    requires_live_data: bool = False
    writes_state: bool = False

@dataclass(frozen=True)
class CachedAnswer:
    answer_id: str
    source_query: str
    response: str
    scope: ReleaseScope

stable_scope = ReleaseScope(
    release_id="delivery-evidence-answerer@sha256:df2d4fe7b0c5",
    corpus_version="returns-policy-2026-04",
    tenant_id="shopflow-public",
    access_scope="public-policy",
    locale="en-US",
    response_schema="cited-answer-v2",
)

seed_answer = CachedAnswer(
    answer_id="ans_returns_unused_30d",
    source_query="What is the return window for unused headphones?",
    response="Unused headphones can be returned within 30 days of delivery.",
    scope=stable_scope,
)

print(f"release_id={stable_scope.release_id}")
print(f"corpus_version={stable_scope.corpus_version}")
print(f"seed_answer={seed_answer.answer_id}")

Output

release_id=delivery-evidence-answerer@sha256:df2d4fe7b0c5
corpus_version=returns-policy-2026-04
seed_answer=ans_returns_unused_30d

For requests already eligible for answer reuse, an ordinary key-value cache can safely reuse normalized exact repeats as long as its key contains the full scope. It can't see through paraphrasing.

exact-cache-respects-release-scope.py

def normalized_text(text: str) -> str:
    return " ".join(text.lower().split())

def request_matches_scope(request: Request, scope: ReleaseScope) -> bool:
    return (
        request.tenant_id == scope.tenant_id
        and request.access_scope == scope.access_scope
        and request.locale == scope.locale
        and request.response_schema == scope.response_schema
    )

def exact_key(scope: ReleaseScope, request: Request) -> str:
    if not request_matches_scope(request, scope):
        raise ValueError("request does not match cache scope")
    payload = {
        "scope": asdict(scope),
        "text": normalized_text(request.text),
    }
    encoded = json.dumps(payload, sort_keys=True).encode("utf-8")
    return sha256(encoded).hexdigest()

same_words = Request("What is the return window for unused headphones?")
paraphrase = Request("How long can I send unused headphones back?")
cross_tenant = replace(same_words, tenant_id="merchant-b")
updated_scope = replace(
    stable_scope,
    release_id="delivery-evidence-answerer@sha256:new-policy",
    corpus_version="returns-policy-2026-05",
)

seed_key = exact_key(stable_scope, Request(seed_answer.source_query))
print(f"exact_repeat_hit={exact_key(stable_scope, same_words) == seed_key}")
print(f"paraphrase_exact_hit={exact_key(stable_scope, paraphrase) == seed_key}")
print(f"updated_policy_hit={exact_key(updated_scope, same_words) == seed_key}")
try:
    exact_key(stable_scope, cross_tenant)
except ValueError as error:
    print(f"cross_tenant_rejected={error}")

Output

exact_repeat_hit=True
paraphrase_exact_hit=False
updated_policy_hit=False
cross_tenant_rejected=request does not match cache scope

The exact cache does the correct thing: it refuses a paraphrase and refuses an old answer under a new policy release. Semantic caching adds only the first capability. It must not weaken the second.

Similarity retrieves a candidate, not a truth

For two vectors $a$ and $b$ , cosine similarity is:

\operatorname{cosine}(a, b) = \frac{a \cdot b}{\lVert a \rVert_2 \lVert b \rVert_2}

similarity-only-proposes-a-candidate.py

fixture_vectors = {
    seed_answer.source_query: (1.00, 0.00, 0.00),
    "How long can I send unused headphones back?": (0.99, 0.04, 0.00),
    "Can I return opened headphones?": (0.94, 0.10, 0.00),
    "Where is order ORD-48192 right now?": (0.00, 0.05, 1.00),
}

def cosine(left: tuple[float, ...], right: tuple[float, ...]) -> float:
    dot = sum(a * b for a, b in zip(left, right))
    left_norm = math.sqrt(sum(value * value for value in left))
    right_norm = math.sqrt(sum(value * value for value in right))
    return dot / (left_norm * right_norm)

seed_vector = fixture_vectors[seed_answer.source_query]
for question in [
    "How long can I send unused headphones back?",
    "Can I return opened headphones?",
    "Where is order ORD-48192 right now?",
]:
    score = cosine(seed_vector, fixture_vectors[question])
    print(f"{question} | score={score:.3f}")

Output

How long can I send unused headphones back? | score=0.999
Can I return opened headphones? | score=0.994
Where is order ORD-48192 right now? | score=0.000

Eligibility rules run before the score threshold

The assistant shouldn't response-cache live order state or actions at any threshold. Even for a public-policy question, a cached answer must be from the same release scope.

This decision procedure checks the non-negotiable rules first. Only an eligible, same-scope request reaches the similarity threshold.

gate-semantic-hits-by-contract.py

def same_scope(request: Request, record: CachedAnswer, active: ReleaseScope) -> bool:
    return (
        record.scope == active
        and request_matches_scope(request, active)
    )

def decide_candidate(
    request: Request,
    record: CachedAnswer,
    active: ReleaseScope,
    score: float,
    threshold: float,
) -> str:
    if request.requires_live_data or request.writes_state:
        return "BYPASS_DYNAMIC_OR_WRITE"
    if not same_scope(request, record, active):
        return "MISS_SCOPE_CHANGED"
    if score < threshold:
        return "MISS_BELOW_THRESHOLD"
    return "SEMANTIC_HIT"

policy_paraphrase = Request("How long can I send unused headphones back?")
live_order = Request(
    "Where is order ORD-48192 right now?",
    access_scope="customer-order",
    requires_live_data=True,
)
label_action = Request(
    "Create a return label for ORD-48192.",
    access_scope="customer-order",
    writes_state=True,
)

policy_score = cosine(seed_vector, fixture_vectors[policy_paraphrase.text])
print(decide_candidate(policy_paraphrase, seed_answer, stable_scope, policy_score, 0.98))
print(decide_candidate(live_order, seed_answer, stable_scope, 1.00, 0.98))
print(decide_candidate(label_action, seed_answer, stable_scope, 1.00, 0.98))

Output

SEMANTIC_HIT
BYPASS_DYNAMIC_OR_WRITE
BYPASS_DYNAMIC_OR_WRITE

Version changes invalidate answers without guessing

invalidate-on-policy-release.py

policy_update = replace(
    stable_scope,
    release_id="delivery-evidence-answerer@sha256:7a12policy",
    corpus_version="returns-policy-2026-05",
)

same_question = Request(seed_answer.source_query)
old_release_decision = decide_candidate(
    same_question, seed_answer, stable_scope, 1.00, 0.98
)
new_release_decision = decide_candidate(
    same_question, seed_answer, policy_update, 1.00, 0.98
)

print(f"old_release={old_release_decision}")
print(f"new_policy_release={new_release_decision}")
print(f"new_release_must_generate={new_release_decision != 'SEMANTIC_HIT'}")

Output

old_release=SEMANTIC_HIT
new_policy_release=MISS_SCOPE_CHANGED
new_release_must_generate=True

This is why cache identity should inherit the release identity from deployment. Eviction can clean up storage later; correctness shouldn't depend on eviction finishing first.

Choose a threshold in shadow mode

A good cache metric separates two questions:

Proposal rate: how often would the cache return something?
Hit precision: among proposed hits, how often is answer reuse acceptable?

High proposal rate without high precision is a cheaper system that is wrong more often.

calibrate-with-shadow-replay.py

@dataclass(frozen=True)
class ShadowProbe:
    name: str
    score: float
    eligible: bool
    acceptable_reuse: bool

shadow_probes = [
    ShadowProbe("return window paraphrase", 0.995, True, True),
    ShadowProbe("send unused item back", 0.989, True, True),
    ShadowProbe("refund window wording", 0.982, True, True),
    ShadowProbe("policy FAQ reworded", 0.981, True, True),
    ShadowProbe("opened-item exception", 0.965, True, False),
    ShadowProbe("live order state", 0.999, False, False),
    ShadowProbe("create label action", 0.997, False, False),
]

def replay_at(threshold: float) -> dict[str, float | int]:
    proposed = [
        probe for probe in shadow_probes
        if probe.eligible and probe.score >= threshold
    ]
    accepted = [probe for probe in proposed if probe.acceptable_reuse]
    precision = len(accepted) / len(proposed) if proposed else 1.0
    return {
        "proposed": len(proposed),
        "accepted": len(accepted),
        "precision": precision,
        "proposal_rate": len(proposed) / len(shadow_probes),
    }

for threshold in [0.960, 0.980, 0.990]:
    metrics = replay_at(threshold)
    print(
        f"threshold={threshold:.3f} "
        f"proposed={metrics['proposed']} "
        f"precision={metrics['precision']:.1%} "
        f"proposal_rate={metrics['proposal_rate']:.1%}"
    )

selected_threshold = 0.980

Output

threshold=0.960 proposed=5 precision=80.0% proposal_rate=71.4%
threshold=0.980 proposed=4 precision=100.0% proposal_rate=57.1%
threshold=0.990 proposed=1 precision=100.0% proposal_rate=14.3%

A safe cache still has to pay for itself

Every semantic lookup incurs work, even on a miss: embedding the request, searching an index, and recording metrics. Evaluate cost only after the precision gate passes.

Let:

$N$ be requests in a measured period.
$C_g$ be average fresh-generation cost per request.
$C_l$ be semantic-lookup cost per request.
$h$ be the observed safe-hit fraction.

If a hit skips fresh generation, expected period savings are:

\text{savings} = N \left(h C_g - C_l\right)

These quantities must come from the workload and model you plan to operate. The next example uses clearly labeled measurement fixtures, not provider prices.

measure-break-even-savings.py

shadow_metrics = replay_at(selected_threshold)
requests_per_day = 10_000
fresh_generation_usd = 0.0040  # measured fixture: average full answer cost
semantic_lookup_usd = 0.00008  # measured fixture: embed + index lookup
safe_hit_fraction = shadow_metrics["proposal_rate"]

without_cache = requests_per_day * fresh_generation_usd
with_cache = requests_per_day * (
    semantic_lookup_usd + (1 - safe_hit_fraction) * fresh_generation_usd
)
savings = without_cache - with_cache
break_even_hit_fraction = semantic_lookup_usd / fresh_generation_usd

print(f"safe_hit_fraction={safe_hit_fraction:.1%}")
print(f"break_even_hit_fraction={break_even_hit_fraction:.1%}")
print(f"daily_savings_fixture_usd={savings:.2f}")
print(f"savings_positive={savings > 0}")

Output

safe_hit_fraction=57.1%
break_even_hit_fraction=2.0%
daily_savings_fixture_usd=22.06
savings_positive=True

This calculation intentionally omits any guessed list price. Measure generation and lookup cost for the actual release and traffic distribution, then repeat the gate when either changes.

Promote only the narrow policy you tested

make-the-cache-promotion-decision.py

@dataclass(frozen=True)
class CachePromotionGate:
    minimum_precision: float
    minimum_daily_savings_usd: float
    required_scope: ReleaseScope

gate = CachePromotionGate(
    minimum_precision=0.99,
    minimum_daily_savings_usd=5.00,
    required_scope=stable_scope,
)

passes_quality = shadow_metrics["precision"] >= gate.minimum_precision
passes_economics = savings >= gate.minimum_daily_savings_usd
passes_scope = seed_answer.scope == gate.required_scope
decision = (
    "PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE"
    if passes_quality and passes_economics and passes_scope
    else "KEEP_SHADOW_ONLY"
)

print(f"quality_gate={passes_quality}")
print(f"economics_gate={passes_economics}")
print(f"scope_gate={passes_scope}")
print(f"cache_decision={decision}")

Output

quality_gate=True
economics_gate=True
scope_gate=True
cache_decision=PROMOTE_PUBLIC_POLICY_SEMANTIC_CACHE

Record why each request hit or bypassed

emit-cache-decision-traces.py

def trace_decision(request: Request, score: float) -> dict[str, str | float]:
    cache_decision = decide_candidate(
        request, seed_answer, stable_scope, score, selected_threshold
    )
    return {
        "request": request.text,
        "release_id": stable_scope.release_id,
        "corpus_version": stable_scope.corpus_version,
        "tenant_id": request.tenant_id,
        "access_scope": request.access_scope,
        "cache_policy": "public-policy-semantic-v1",
        "answer_id": seed_answer.answer_id if cache_decision == "SEMANTIC_HIT" else "",
        "decision": cache_decision,
        "score": score,
    }

hit_trace = trace_decision(policy_paraphrase, policy_score)
bypass_trace = trace_decision(live_order, 1.00)

print(f"hit_decision={hit_trace['decision']} answer_id={hit_trace['answer_id']}")
print(f"bypass_decision={bypass_trace['decision']}")
print(f"traced_release={hit_trace['release_id'] == stable_scope.release_id}")
print(f"traced_scope={hit_trace['corpus_version'] == stable_scope.corpus_version and hit_trace['access_scope'] == stable_scope.access_scope}")

Output

hit_decision=SEMANTIC_HIT answer_id=ans_returns_unused_30d
bypass_decision=BYPASS_DYNAMIC_OR_WRITE
traced_release=True
traced_scope=True

Semantic response caching isn't prompt-prefix caching

Layer	Matches	Result on hit	Main correctness risk
Exact response cache	Same scoped request key	Return stored answer, skip generation	Stale or incomplete scope key
Semantic response cache	Similar eligible question under same contract	Return stored answer, skip generation	False semantic reuse
Provider prompt cache	Matching input prefix under provider rules	Compute a new answer with cheaper/faster repeated input work	Missed cost opportunity, not stored-answer substitution

The distinction determines the evaluation: semantic answer caching needs labeled reuse precision; prompt-prefix caching needs token and latency accounting. The next chapter expands that economics.

separate-answer-reuse-from-prefix-reuse.py

@dataclass(frozen=True)
class ReuseCase:
    name: str
    semantic_answer_hit: bool
    repeated_prefix_hit: bool

cases = [
    ReuseCase(
        name="paraphrased public return question",
        semantic_answer_hit=True,
        repeated_prefix_hit=False,
    ),
    ReuseCase(
        name="new live order question after same long instructions",
        semantic_answer_hit=False,
        repeated_prefix_hit=True,
    ),
]

for case in cases:
    print(
        f"{case.name}: "
        f"skip_generation={case.semantic_answer_hit}, "
        f"reuse_input_work={case.repeated_prefix_hit}"
    )

print("next_measure_token_economics=True")

Output

paraphrased public return question: skip_generation=True, reuse_input_work=False
new live order question after same long instructions: skip_generation=False, reuse_input_work=True
next_measure_token_economics=True

Mastery check

Key concepts

A cached answer belongs to an immutable release scope, not only to question text.
Exact response caches catch identical scoped requests; semantic caches retrieve answer candidates across paraphrases.
Similarity is evidence for candidate retrieval, never permission to ignore access, freshness, or side-effect boundaries.
New corpus or release identity naturally invalidates old answer reuse.
Shadow-mode precision and measured break-even savings are promotion gates.
Servable cache writes need a validated admission path; don't let unreviewed answers become reusable records.
Provider prompt caching reuses repeated input processing, while semantic response caching can skip generation.

Practice tasks

Add response_schema="json-citations-v3" to a new scope and prove an old natural-language answer can't hit it.
Add one public-policy question that is almost similar but requires a different answer. Re-run the threshold sweep and explain the new selected threshold.
Replace the fixture costs with measured numbers for a workload you control. Compute the hit fraction required to break even.
Add a cache-policy rollback event that turns semantic hits back into misses while keeping fresh generation on the same release.

Evaluation rubric

Foundational: Explains why a paraphrase misses an exact cache and why similarity can propose reuse.
Foundational: Identifies live data and write actions as ineligible for response reuse before considering score.
Intermediate: Builds a cache key or metadata filter that includes release, corpus, tenant, access, locale, and schema scope.
Intermediate: Calibrates a threshold in shadow mode using accepted-hit precision rather than raw hit rate.
Advanced: Computes measured break-even savings and promotes only the response class proved safe and worthwhile.
Advanced: Distinguishes semantic stored-answer reuse from provider prefix-computation reuse and chooses evidence for each.

Self-check questions

Common pitfalls

Optimizing hit count instead of correct reuse

Caching dynamic or write requests

Keeping entries across a policy release

Proving safety but not value

Unreviewed answers enter the reusable store

Next Step

Continue to LLM Cost Engineering & Token Economics

You can now decide whether an answer is safe to reuse under one evaluated release. Next you will measure the token, model, prefix-cache, and routing costs of requests that still require generation.

PreviousModel Versioning & Deployment

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings.

Bang, Fu · 2023 · NLP-OSS 2023

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

Prompt caching

OpenAI · 2026

Semantic Caching & Cost Optimization

Semantic Caching & Cost Optimization

An answer is reusable only inside its contract

Similarity retrieves a candidate, not a truth

Eligibility rules run before the score threshold

A live order-status request has a similarity score of 1.00 against a saved response. Can it be a semantic answer-cache hit?

Version changes invalidate answers without guessing

Choose a threshold in shadow mode

Why doesn't a high hit rate prove that semantic caching is helping?

A safe cache still has to pay for itself

Promote only the narrow policy you tested

Record why each request hit or bypassed

Semantic response caching isn't prompt-prefix caching

Mastery check

Key concepts

Practice tasks

Evaluation rubric

Self-check questions

Why doesn't a TTL alone solve policy staleness?

Your shadow replay gets perfect precision but proposes one hit in a million requests. Is the cache ready to promote?

Why should an answer-cache trace include release_id and answer_id?

Common pitfalls

Optimizing hit count instead of correct reuse

Caching dynamic or write requests

Keeping entries across a policy release

Proving safety but not value

Unreviewed answers enter the reusable store

Semantic Caching & Cost Optimization

Semantic Caching & Cost Optimization

An answer is reusable only inside its contract

Similarity retrieves a candidate, not a truth

Eligibility rules run before the score threshold

A live order-status request has a similarity score of 1.00 against a saved response. Can it be a semantic answer-cache hit?

Version changes invalidate answers without guessing

Choose a threshold in shadow mode

Why doesn't a high hit rate prove that semantic caching is helping?

A safe cache still has to pay for itself

Promote only the narrow policy you tested

Record why each request hit or bypassed

Semantic response caching isn't prompt-prefix caching

Mastery check

Key concepts

Practice tasks

Evaluation rubric

Self-check questions

Why doesn't a TTL alone solve policy staleness?

Your shadow replay gets perfect precision but proposes one hit in a million requests. Is the cache ready to promote?

Why should an answer-cache trace include release_id and answer_id?

Common pitfalls

Optimizing hit count instead of correct reuse

Caching dynamic or write requests

Keeping entries across a policy release

Proving safety but not value

Unreviewed answers enter the reusable store