Build an auditable LLM cost ledger from usage traces, cache decisions, output contracts, offline batch work, and release budget gates.
Your ShopFlow team just promoted a semantic answer cache for its large language model (LLM) support assistant. Safe public return-policy hits can now return an evaluated answer without generating new text. That doesn't make the bill disappear. Live order checks, exception cases, and cache misses still invoke a model, and nobody can defend the next release with "it should be cheaper."
This lesson builds the missing artifact: a cost ledger for one evaluated support release. It starts from provider-reported usage, accounts for two very different forms of caching, tests output savings against the answer contract, defers only eligible offline work, and produces a promotion decision. The following chapter will turn that budget contract into gateway routing and fallback policy. We won't build a router here.
Cost engineering begins after a request decision is known:
| Decision | What executes? | What belongs in this ledger? |
|---|---|---|
| Semantic answer hit from the previous lesson | Stored evaluated answer returned | No generation charge; record avoided generation separately |
| Generated answer with provider prompt-cache hit | Model still produces a new answer | Fresh input, cached input, and output token charges |
| Generated answer without prompt-cache hit | Model reads full input and produces a new answer | Fresh input and output token charges |
| Offline evaluation batch | Model runs later, outside interactive path | Batch-eligible token spend and completion contract |
This boundary matters. A semantic answer cache can skip generation. A provider prompt cache reuses matching prefix work but still generates a new output. Treating both as "cache hits" hides both correctness risk and spend.
Provider pricing is configuration data. For this lab, the teaching fixture snapshots OpenAI's GPT-5.4 short-context rates from May 31, 2026: standard input costs $2.50 per million tokens, standard cached input costs $0.25 per million tokens, and standard output costs $15.00 per million tokens. OpenAI's pricing page publishes Batch rates as a separate row, so the ledger stores those values explicitly too. This dated fixture teaches the ledger shape; it isn't a permanent price promise. Refresh every production rate card from current provider pricing documentation before forecasting a release.[1]
The usage , not a character-count estimate, is the source of truth after a model call. A provider can expose total input tokens, the cached subset, and output tokens. The fresh-input charge is the total input minus the cached subset.
1from collections import defaultdict
2from dataclasses import dataclass
3from decimal import Decimal, ROUND_HALF_UP
4
5MILLION = Decimal("1000000")
6CENT = Decimal("0.01")
7
8def dollars(amount: Decimal) -> Decimal:
9 return amount.quantize(CENT, rounding=ROUND_HALF_UP)
10
11@dataclass(frozen=True)
12class TokenRates:
13 input_per_1m: Decimal
14 cached_input_per_1m: Decimal
15 output_per_1m: Decimal
16
17@dataclass(frozen=True)
18class RateCard:
19 rate_card_id: str
20 standard: TokenRates
21 batch: TokenRates
22
23@dataclass(frozen=True)
24class Usage:
25 input_tokens: int
26 cached_input_tokens: int
27 output_tokens: int
28
29 def __post_init__(self) -> None:
30 if not 0 <= self.cached_input_tokens <= self.input_tokens:
31 raise ValueError("cached input must be a subset of input tokens")
32 if min(self.input_tokens, self.output_tokens) < 0:
33 raise ValueError("token counts cannot be negative")
34
35rate_card = RateCard(
36 rate_card_id="openai-gpt-5.4-short-context-2026-05-31",
37 standard=TokenRates(
38 input_per_1m=Decimal("2.50"),
39 cached_input_per_1m=Decimal("0.25"),
40 output_per_1m=Decimal("15.00"),
41 ),
42 batch=TokenRates(
43 input_per_1m=Decimal("1.25"),
44 cached_input_per_1m=Decimal("0.13"),
45 output_per_1m=Decimal("7.50"),
46 ),
47)
48
49print(rate_card.rate_card_id)
50print(f"standard_input_per_1m=${rate_card.standard.input_per_1m}")
51print(f"standard_cached_input_per_1m=${rate_card.standard.cached_input_per_1m}")
52print(f"standard_output_per_1m=${rate_card.standard.output_per_1m}")
53print(f"batch_input_per_1m=${rate_card.batch.input_per_1m}")
54print(f"batch_cached_input_per_1m=${rate_card.batch.cached_input_per_1m}")
55print(f"batch_output_per_1m=${rate_card.batch.output_per_1m}")1openai-gpt-5.4-short-context-2026-05-31
2standard_input_per_1m=$2.50
3standard_cached_input_per_1m=$0.25
4standard_output_per_1m=$15.00
5batch_input_per_1m=$1.25
6batch_cached_input_per_1m=$0.13
7batch_output_per_1m=$7.50ShopFlow's public-policy-answer path uses a long evaluated instruction prefix and cites policy evidence in every generated response. A cold request reports 1,800 input tokens and 180 output tokens. Split each billable category rather than multiplying one blended token count by one price.
1@dataclass(frozen=True)
2class PricedUsage:
3 fresh_input_usd: Decimal
4 cached_input_usd: Decimal
5 output_usd: Decimal
6
7 @property
8 def total_usd(self) -> Decimal:
9 return self.fresh_input_usd + self.cached_input_usd + self.output_usd
10
11def price_usage(usage: Usage, card: RateCard = rate_card, *, batch: bool = False) -> PricedUsage:
12 rates = card.batch if batch else card.standard
13 fresh_input_tokens = usage.input_tokens - usage.cached_input_tokens
14 return PricedUsage(
15 fresh_input_usd=Decimal(fresh_input_tokens) * rates.input_per_1m / MILLION,
16 cached_input_usd=Decimal(usage.cached_input_tokens) * rates.cached_input_per_1m / MILLION,
17 output_usd=Decimal(usage.output_tokens) * rates.output_per_1m / MILLION,
18 )
19
20cold_policy_answer = Usage(input_tokens=1_800, cached_input_tokens=0, output_tokens=180)
21cold_cost = price_usage(cold_policy_answer)
22
23assert cold_cost.total_usd == Decimal("0.0072")
24assert rate_card.standard.output_per_1m > rate_card.standard.input_per_1m
25
26print(f"fresh_input=${cold_cost.fresh_input_usd:.6f}")
27print(f"output=${cold_cost.output_usd:.6f}")
28print(f"total=${cold_cost.total_usd:.6f}")
29print(f"output_share={(cold_cost.output_usd / cold_cost.total_usd):.1%}")1fresh_input=$0.004500
2output=$0.002700
3total=$0.007200
4output_share=37.5%The output token rate is higher in this rate card, although this long-input request still spends more dollars on input overall. That distinction is why both token volume and per-category rate belong in the ledger. Decode is autoregressive and serving systems manage growing KV state during generation; those mechanisms help explain why output can be priced differently, while the current rate card remains the billing authority.[2][1]
OpenAI's documented prompt caching applies automatically for prompts of at least 1,024 tokens, depends on exact matching at the beginning of the prompt, and reports cached prefix tokens in usage.prompt_tokens_details.cached_tokens.[3] That gives the application one actionable design rule: put stable instructions and shared evidence first, then dynamic customer data.
A prefix hit isn't a stored response. The support model still answers the current question and still incurs output spend.
1prefix_cached_policy_answer = Usage(
2 input_tokens=1_800,
3 cached_input_tokens=1_280,
4 output_tokens=180,
5)
6prefix_cost = price_usage(prefix_cached_policy_answer)
7savings = cold_cost.total_usd - prefix_cost.total_usd
8
9assert prefix_cost.total_usd == Decimal("0.004320")
10assert prefix_cost.output_usd == cold_cost.output_usd
11assert savings == Decimal("0.002880")
12
13print(f"cold_generation=${cold_cost.total_usd:.6f}")
14print(f"prefix_cached_generation=${prefix_cost.total_usd:.6f}")
15print(f"saved_per_generated_answer=${savings:.6f}")
16print(f"still_generated={prefix_cost.output_usd > 0}")1cold_generation=$0.007200
2prefix_cached_generation=$0.004320
3saved_per_generated_answer=$0.002880
4still_generated=TrueThe preceding lesson promoted semantic response reuse only for stable public-policy answers under one evaluated release scope. This lesson consumes that decision; it doesn't rebuild its index or threshold. A stored-answer hit records zero generated tokens, while a prompt-prefix hit records a newly generated answer at a reduced input cost.
1@dataclass(frozen=True)
2class RequestOutcome:
3 feature: str
4 decision: str
5 usage: Usage | None
6 counterfactual_usage: Usage | None = None
7
8def generated_cost(outcome: RequestOutcome) -> Decimal:
9 if outcome.usage is None:
10 return Decimal("0")
11 return price_usage(outcome.usage).total_usd
12
13semantic_hit = RequestOutcome(
14 feature="public-policy-answer",
15 decision="SEMANTIC_ANSWER_HIT",
16 usage=None,
17 counterfactual_usage=prefix_cached_policy_answer,
18)
19prefix_hit = RequestOutcome(
20 feature="public-policy-answer",
21 decision="GENERATE_PREFIX_HIT",
22 usage=prefix_cached_policy_answer,
23)
24
25avoided_generation = price_usage(semantic_hit.counterfactual_usage).total_usd
26assert generated_cost(semantic_hit) == Decimal("0")
27assert generated_cost(prefix_hit) == Decimal("0.004320")
28
29print(f"semantic_hit_generation=${generated_cost(semantic_hit):.6f}")
30print(f"semantic_hit_avoided=${avoided_generation:.6f}")
31print(f"prefix_hit_generation=${generated_cost(prefix_hit):.6f}")1semantic_hit_generation=$0.000000
2semantic_hit_avoided=$0.004320
3prefix_hit_generation=$0.004320
A per-request example is useful, but a release decision needs traffic shape. For a ShopFlow baseline replay day, before testing a shorter exception answer:
The semantic-hit counterfactual answers "what cost did safe reuse avoid?" Actual spend answers "what appears on this release ledger?"
1@dataclass(frozen=True)
2class TrafficSlice:
3 outcome: RequestOutcome
4 requests: int
5
6daily_replay = [
7 TrafficSlice(semantic_hit, requests=3_200),
8 TrafficSlice(prefix_hit, requests=1_800),
9 TrafficSlice(
10 RequestOutcome(
11 feature="live-order-answer",
12 decision="GENERATE_LIVE_DATA",
13 usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95),
14 ),
15 requests=3_000,
16 ),
17 TrafficSlice(
18 RequestOutcome(
19 feature="return-exception-answer",
20 decision="GENERATE_PREFIX_HIT",
21 usage=Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220),
22 ),
23 requests=500,
24 ),
25]
26
27actual_daily_spend = sum(
28 generated_cost(slice_.outcome) * slice_.requests
29 for slice_ in daily_replay
30)
31avoided_daily_generation = sum(
32 price_usage(slice_.outcome.counterfactual_usage).total_usd * slice_.requests
33 for slice_ in daily_replay
34 if slice_.outcome.counterfactual_usage is not None
35)
36
37assert actual_daily_spend == Decimal("21.761000")
38assert avoided_daily_generation == Decimal("13.824000")
39
40print(f"generated_daily_spend=${dollars(actual_daily_spend)}")
41print(f"semantic_hit_avoided_daily=${dollars(avoided_daily_generation)}")
42print(f"request_count={sum(slice_.requests for slice_ in daily_replay)}")1generated_daily_spend=$21.76
2semantic_hit_avoided_daily=$13.82
3request_count=8500Avoided generation is useful evidence, but it isn't a refund and shouldn't be mixed into actual spend. If an answer hit later fails review, the evaluation consequence matters before its apparent savings.
Output reductions can be valuable in a rate card where output is expensive. Blind truncation isn't a cost strategy. ShopFlow's cited support answer must state eligibility, cite the policy evidence, and name the customer's next step. Compare three candidate formats generated on the same input:
| Candidate | Intended change | Safe for automatic support? |
|---|---|---|
| Verbose | Full explanation with repetition | Correct but unnecessarily long |
| Concise cited | One decision, one citation, one next step | Candidate for release |
| Truncated | Hard stop before citation and action | Reject |
1@dataclass(frozen=True)
2class AnswerCandidate:
3 label: str
4 output_tokens: int
5 states_decision: bool
6 cites_policy: bool
7 gives_next_step: bool
8
9def satisfies_answer_contract(candidate: AnswerCandidate) -> bool:
10 return (
11 candidate.states_decision
12 and candidate.cites_policy
13 and candidate.gives_next_step
14 )
15
16candidates = [
17 AnswerCandidate("verbose", 220, True, True, True),
18 AnswerCandidate("concise_cited", 130, True, True, True),
19 AnswerCandidate("truncated", 55, True, False, False),
20]
21safe_candidates = [candidate for candidate in candidates if satisfies_answer_contract(candidate)]
22selected_answer = min(safe_candidates, key=lambda candidate: candidate.output_tokens)
23
24verbose_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220)
25concise_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=selected_answer.output_tokens)
26saved_per_exception = price_usage(verbose_usage).total_usd - price_usage(concise_usage).total_usd
27
28assert selected_answer.label == "concise_cited"
29assert not satisfies_answer_contract(candidates[-1])
30assert saved_per_exception == Decimal("0.001350")
31
32print(f"selected={selected_answer.label}")
33print(f"rejected={candidates[-1].label}")
34print(f"saved_per_exception=${saved_per_exception:.6f}")
35print(f"saved_per_replay_day=${dollars(saved_per_exception * 500)}")1selected=concise_cited
2rejected=truncated
3saved_per_exception=$0.001350
4saved_per_replay_day=$0.68The cheapest candidate is rejected. Cost reductions count only after the output still meets the same correctness and evidence contract.
OpenAI's Batch documentation describes a separate asynchronous workload path with a 50% discount compared with synchronous APIs and a 24-hour completion window.[4] The pricing page publishes Batch token rates explicitly, so the ledger reads that row instead of deriving every category from one multiplier.[1] That path is useful for nightly regression evaluation of the support release. It isn't appropriate for a customer waiting in a live conversation.
1@dataclass(frozen=True)
2class Workload:
3 name: str
4 interactive: bool
5 deadline_hours: int
6 endpoint_supported: bool
7 daily_requests: int
8 usage: Usage
9
10def batch_eligible(workload: Workload) -> bool:
11 return (
12 not workload.interactive
13 and workload.deadline_hours >= 24
14 and workload.endpoint_supported
15 )
16
17live_support = Workload(
18 name="live-support-answer",
19 interactive=True,
20 deadline_hours=0,
21 endpoint_supported=True,
22 daily_requests=3_000,
23 usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95),
24)
25nightly_eval = Workload(
26 name="nightly-release-eval",
27 interactive=False,
28 deadline_hours=24,
29 endpoint_supported=True,
30 daily_requests=2_000,
31 usage=Usage(input_tokens=1_800, cached_input_tokens=1_280, output_tokens=80),
32)
33
34nightly_sync_cost = price_usage(nightly_eval.usage).total_usd * nightly_eval.daily_requests
35nightly_batch_cost = price_usage(nightly_eval.usage, batch=True).total_usd * nightly_eval.daily_requests
36
37assert not batch_eligible(live_support)
38assert batch_eligible(nightly_eval)
39assert nightly_batch_cost == Decimal("2.832800")
40assert nightly_batch_cost < nightly_sync_cost
41
42print(f"live_support_batchable={batch_eligible(live_support)}")
43print(f"nightly_eval_batchable={batch_eligible(nightly_eval)}")
44print(f"nightly_eval_sync=${dollars(nightly_sync_cost)}")
45print(f"nightly_eval_batch=${dollars(nightly_batch_cost)}")1live_support_batchable=False
2nightly_eval_batchable=True
3nightly_eval_sync=$5.64
4nightly_eval_batch=$2.83An invoice tells you how much you spent. A release trace tells you why. Each recorded row should name the evaluated release, feature, decision, pricing mode, token usage, and output-contract evidence. Recording only a model name can't distinguish an unsafe answer-cache promotion from a harmless increase in live-order traffic. Make contract evidence mandatory on every trace instead of defaulting it to a pass.
1@dataclass(frozen=True)
2class LedgerTrace:
3 release_id: str
4 feature: str
5 decision: str
6 requests: int
7 usage: Usage | None
8 contract_passed: bool
9 contract_evidence_id: str
10 batch: bool = False
11
12def trace_spend(trace: LedgerTrace) -> Decimal:
13 if trace.usage is None:
14 return Decimal("0")
15 return price_usage(trace.usage, batch=trace.batch).total_usd * trace.requests
16
17release_id = "support-release-2026-05-cost-v1"
18optimized_replay = [
19 TrafficSlice(
20 RequestOutcome(
21 feature=item.outcome.feature,
22 decision=item.outcome.decision,
23 usage=concise_usage if item.outcome.feature == "return-exception-answer" else item.outcome.usage,
24 counterfactual_usage=item.outcome.counterfactual_usage,
25 ),
26 requests=item.requests,
27 )
28 for item in daily_replay
29]
30traces = [
31 LedgerTrace(
32 release_id=release_id,
33 feature=item.outcome.feature,
34 decision=item.outcome.decision,
35 requests=item.requests,
36 usage=item.outcome.usage,
37 contract_passed=True,
38 contract_evidence_id=(
39 "approved-public-policy-cache@v1"
40 if item.outcome.decision == "SEMANTIC_ANSWER_HIT"
41 else "cited-support-answer-v3@canary"
42 ),
43 )
44 for item in optimized_replay
45]
46traces.append(
47 LedgerTrace(
48 release_id=release_id,
49 feature=nightly_eval.name,
50 decision="BATCH_OFFLINE_EVAL",
51 requests=nightly_eval.daily_requests,
52 usage=nightly_eval.usage,
53 contract_passed=True,
54 contract_evidence_id="nightly-release-eval@batch-eligible",
55 batch=True,
56 )
57)
58
59spend_by_feature: defaultdict[str, Decimal] = defaultdict(lambda: Decimal("0"))
60for trace in traces:
61 spend_by_feature[trace.feature] += trace_spend(trace)
62
63total_with_eval = sum(spend_by_feature.values())
64trace_contracts_complete = all(
65 trace.contract_passed and trace.contract_evidence_id
66 for trace in traces
67)
68assert spend_by_feature["public-policy-answer"] == Decimal("7.776000")
69assert actual_daily_spend - sum(trace_spend(trace) for trace in traces[:-1]) == Decimal("0.675000")
70assert trace_contracts_complete
71
72for feature in sorted(spend_by_feature):
73 print(f"{feature}=${dollars(spend_by_feature[feature])}")
74print(f"daily_total_with_eval=${dollars(total_with_eval)}")
75print(f"trace_contracts_complete={trace_contracts_complete}")1live-order-answer=$11.03
2nightly-release-eval=$2.83
3public-policy-answer=$7.78
4return-exception-answer=$2.29
5daily_total_with_eval=$23.92
6trace_contracts_complete=True
A cost target without quality is an incentive to produce short, wrong answers. A quality target without cost leaves a production team unable to plan capacity or margins. Require both.
The canary report below evaluates the concise cited-answer policy for return exceptions. The replay's request mix is projected across 30 days, including its nightly offline evaluation run.
1@dataclass(frozen=True)
2class QualityReport:
3 cited_answer_pass_rate: Decimal
4 unsafe_cache_hits: int
5 evaluated_answers: int
6
7@dataclass(frozen=True)
8class PromotionPolicy:
9 monthly_budget_usd: Decimal
10 minimum_cited_answer_pass_rate: Decimal
11 maximum_unsafe_cache_hits: int
12
13quality = QualityReport(
14 cited_answer_pass_rate=Decimal("0.997"),
15 unsafe_cache_hits=0,
16 evaluated_answers=2_000,
17)
18promotion_policy = PromotionPolicy(
19 monthly_budget_usd=Decimal("750.00"),
20 minimum_cited_answer_pass_rate=Decimal("0.995"),
21 maximum_unsafe_cache_hits=0,
22)
23monthly_forecast = total_with_eval * Decimal("30")
24
25quality_passed = (
26 trace_contracts_complete
27 and quality.cited_answer_pass_rate >= promotion_policy.minimum_cited_answer_pass_rate
28 and quality.unsafe_cache_hits <= promotion_policy.maximum_unsafe_cache_hits
29)
30budget_passed = monthly_forecast <= promotion_policy.monthly_budget_usd
31
32assert dollars(monthly_forecast) == Decimal("717.56")
33assert quality_passed and budget_passed
34
35print(f"monthly_forecast=${dollars(monthly_forecast)}")
36print(f"quality_passed={quality_passed}")
37print(f"budget_passed={budget_passed}")1monthly_forecast=$717.56
2quality_passed=True
3budget_passed=True
The next lesson will implement model gateway routing and fallbacks. Give it a budget contract, not a half-built router embedded in cost analysis. The contract records what a future route must preserve: citation schema, maximum generated-answer spend, and the release evidence behind the limit.
1@dataclass(frozen=True)
2class GatewayBudgetContract:
3 release_id: str
4 required_answer_schema: str
5 maximum_generated_answer_usd: Decimal
6 monthly_forecast_usd: Decimal
7 rate_card_id: str
8 status: str
9
10max_generated_cost = max(
11 price_usage(trace.usage).total_usd
12 for trace in traces
13 if trace.usage is not None and not trace.batch
14)
15status = "PROMOTE_COST_POLICY" if quality_passed and budget_passed else "HOLD_RELEASE"
16gateway_contract = GatewayBudgetContract(
17 release_id=release_id,
18 required_answer_schema="cited-support-answer-v3",
19 maximum_generated_answer_usd=max_generated_cost,
20 monthly_forecast_usd=dollars(monthly_forecast),
21 rate_card_id=rate_card.rate_card_id,
22 status=status,
23)
24
25assert gateway_contract.status == "PROMOTE_COST_POLICY"
26assert gateway_contract.maximum_generated_answer_usd == Decimal("0.004570")
27
28print(f"status={gateway_contract.status}")
29print(f"required_schema={gateway_contract.required_answer_schema}")
30print(f"max_generated_answer=${gateway_contract.maximum_generated_answer_usd:.6f}")
31print("next_step=apply_contract_in_gateway_routing")1status=PROMOTE_COST_POLICY
2required_schema=cited-support-answer-v3
3max_generated_answer=$0.004570
4next_step=apply_contract_in_gateway_routingThis release did not become cheaper because of one vague "cache" switch:
| Evidence | Correct conclusion |
|---|---|
| Semantic answer hit with valid release scope | New generation was avoided for an already approved answer class |
Provider cached_tokens on a generated answer | Repeated prefix processing was charged at the cached-input rate |
| Concise cited-answer candidate passes evaluation | Output reduction is eligible for release |
| Nightly evaluation classified as noninteractive | Batch discount can be modeled without changing customer latency |
| Monthly forecast and quality gate pass | Cost policy can be promoted and handed to gateway design |
Don't infer a pricing benefit from a prompt rewrite until usage traces show cached tokens. Don't count semantic-cache savings unless accepted-hit quality remains valid. Don't approve a shorter output merely because it costs less.
warranty-claim-answer traffic slice with longer cited output. Decide whether it breaks the monthly budget.cache_hit metric.Symptom: Forecast misses invoice even when request count is correct. Cause: Character length or visible response length was treated as provider usage. Fix: Store billable usage categories returned by the provider and price them with a versioned rate card.
Symptom: Team believes generations were skipped, but output spend remains high. Cause: Prefix-cache reuse was confused with stored-answer reuse. Fix: Record separate decisions and cost semantics for each layer.
Symptom: Spend falls while support answers omit evidence or next actions. Cause: Output cap was promoted without evaluating the required answer schema. Fix: Make the quality contract a hard promotion gate.
Symptom: Forecast looks excellent while customer response latency becomes unacceptable. Cause: A discount was modeled without preserving completion-time requirements. Fix: Batch only explicitly asynchronous work such as nightly evaluation.
Symptom: Forecast drifts from provider billing even though token counts are correct. Cause: Standard, Batch, or other processing-mode prices were inferred from one discount instead of loaded from the provider's dated table. Fix: Version explicit rates for every pricing mode used by the release and refresh them from current provider documentation.