LearnApplied LLM EngineeringLLM Cost Engineering & Token Economics

⚙️MediumMLOps & Deployment

LLM Cost Engineering & Token Economics

Build an auditable LLM cost ledger from usage traces, cache decisions, output contracts, offline batch work, and release budget gates.

16 min read

Learning path

Step 73 of 155 in the full curriculum

Semantic Caching & Cost Optimization Model Gateways, Routing, and Fallbacks

Your ShopFlow team just promoted a semantic answer cache for its large language model (LLM) support assistant. Safe public return-policy hits can now return an evaluated answer without generating new text. That doesn't make the bill disappear. Live order checks, exception cases, and cache misses still invoke a model, and nobody can defend the next release with "it should be cheaper."

This lesson builds the missing artifact: a cost ledger for one evaluated support release. It starts from provider-reported usage, accounts for two very different forms of caching, tests output savings against the answer contract, defers only eligible offline work, and produces a promotion decision. The following chapter will turn that budget contract into gateway routing and fallback policy. We won't build a router here.

The accounting boundary

Cost engineering begins after a request decision is known:

Decision	What executes?	What belongs in this ledger?
Semantic answer hit from the previous lesson	Stored evaluated answer returned	No generation charge; record avoided generation separately
Generated answer with provider prompt-cache hit	Model still produces a new answer	Fresh input, cached input, and output token charges
Generated answer without prompt-cache hit	Model reads full input and produces a new answer	Fresh input and output token charges
Offline evaluation batch	Model runs later, outside interactive path	Batch-eligible token spend and completion contract

This boundary matters. A semantic answer cache can skip generation. A provider prompt cache reuses matching prefix work but still generates a new output. Treating both as "cache hits" hides both correctness risk and spend.

Four measured cost controls for a support release: semantic answer reuse skips generation, prompt-prefix reuse discounts repeated input, concise cited answers reduce output, and batch queues defer offline evaluation. — Start with the request decision already made. Each cost control changes a different charge or workload boundary.

Use a dated rate card, not remembered prices

Provider pricing is configuration data. For this lab, the teaching fixture snapshots OpenAI's GPT-5.4 short-context rates from May 31, 2026: standard input costs $2.50 per million tokens, standard cached input costs $0.25 per million tokens, and standard output costs $15.00 per million tokens. OpenAI's pricing page publishes Batch rates as a separate row, so the ledger stores those values explicitly too. This dated fixture teaches the ledger shape; it isn't a permanent price promise. Refresh every production rate card from current provider pricing documentation before forecasting a release.^[1]

The usage , not a character-count estimate, is the source of truth after a model call. A provider can expose total input tokens, the cached subset, and output tokens. The fresh-input charge is the total input minus the cached subset.

01-rate-card-and-usage.py

from collections import defaultdict
from dataclasses import dataclass
from decimal import Decimal, ROUND_HALF_UP

MILLION = Decimal("1000000")
CENT = Decimal("0.01")

def dollars(amount: Decimal) -> Decimal:
    return amount.quantize(CENT, rounding=ROUND_HALF_UP)

@dataclass(frozen=True)
class TokenRates:
    input_per_1m: Decimal
    cached_input_per_1m: Decimal
    output_per_1m: Decimal

@dataclass(frozen=True)
class RateCard:
    rate_card_id: str
    standard: TokenRates
    batch: TokenRates

@dataclass(frozen=True)
class Usage:
    input_tokens: int
    cached_input_tokens: int
    output_tokens: int

    def __post_init__(self) -> None:
        if not 0 <= self.cached_input_tokens <= self.input_tokens:
            raise ValueError("cached input must be a subset of input tokens")
        if min(self.input_tokens, self.output_tokens) < 0:
            raise ValueError("token counts cannot be negative")

rate_card = RateCard(
    rate_card_id="openai-gpt-5.4-short-context-2026-05-31",
    standard=TokenRates(
        input_per_1m=Decimal("2.50"),
        cached_input_per_1m=Decimal("0.25"),
        output_per_1m=Decimal("15.00"),
    ),
    batch=TokenRates(
        input_per_1m=Decimal("1.25"),
        cached_input_per_1m=Decimal("0.13"),
        output_per_1m=Decimal("7.50"),
    ),
)

print(rate_card.rate_card_id)
print(f"standard_input_per_1m=${rate_card.standard.input_per_1m}")
print(f"standard_cached_input_per_1m=${rate_card.standard.cached_input_per_1m}")
print(f"standard_output_per_1m=${rate_card.standard.output_per_1m}")
print(f"batch_input_per_1m=${rate_card.batch.input_per_1m}")
print(f"batch_cached_input_per_1m=${rate_card.batch.cached_input_per_1m}")
print(f"batch_output_per_1m=${rate_card.batch.output_per_1m}")

Output

openai-gpt-5.4-short-context-2026-05-31
standard_input_per_1m=$2.50
standard_cached_input_per_1m=$0.25
standard_output_per_1m=$15.00
batch_input_per_1m=$1.25
batch_cached_input_per_1m=$0.13
batch_output_per_1m=$7.50

Price one generated answer

ShopFlow's public-policy-answer path uses a long evaluated instruction prefix and cites policy evidence in every generated response. A cold request reports 1,800 input tokens and 180 output tokens. Split each billable category rather than multiplying one blended token count by one price.

\begin{aligned} \text{cost} = &\frac{\text{fresh input tokens} \times \text{fresh input price}}{1{,}000{,}000} \\ &+ \frac{\text{cached input tokens} \times \text{cached input price}}{1{,}000{,}000} \\ &+ \frac{\text{output tokens} \times \text{output price}}{1{,}000{,}000} \end{aligned}

02-price-one-generated-answer.py

@dataclass(frozen=True)
class PricedUsage:
    fresh_input_usd: Decimal
    cached_input_usd: Decimal
    output_usd: Decimal

    @property
    def total_usd(self) -> Decimal:
        return self.fresh_input_usd + self.cached_input_usd + self.output_usd

def price_usage(usage: Usage, card: RateCard = rate_card, *, batch: bool = False) -> PricedUsage:
    rates = card.batch if batch else card.standard
    fresh_input_tokens = usage.input_tokens - usage.cached_input_tokens
    return PricedUsage(
        fresh_input_usd=Decimal(fresh_input_tokens) * rates.input_per_1m / MILLION,
        cached_input_usd=Decimal(usage.cached_input_tokens) * rates.cached_input_per_1m / MILLION,
        output_usd=Decimal(usage.output_tokens) * rates.output_per_1m / MILLION,
    )

cold_policy_answer = Usage(input_tokens=1_800, cached_input_tokens=0, output_tokens=180)
cold_cost = price_usage(cold_policy_answer)

assert cold_cost.total_usd == Decimal("0.0072")
assert rate_card.standard.output_per_1m > rate_card.standard.input_per_1m

print(f"fresh_input=${cold_cost.fresh_input_usd:.6f}")
print(f"output=${cold_cost.output_usd:.6f}")
print(f"total=${cold_cost.total_usd:.6f}")
print(f"output_share={(cold_cost.output_usd / cold_cost.total_usd):.1%}")

Output

fresh_input=$0.004500
output=$0.002700
total=$0.007200
output_share=37.5%

The output token rate is higher in this rate card, although this long-input request still spends more dollars on input overall. That distinction is why both token volume and per-category rate belong in the ledger. Decode is autoregressive and serving systems manage growing KV state during generation; those mechanisms help explain why output can be priced differently, while the current rate card remains the billing authority.^[2]^[1]

Prefix reuse is still a generated answer

OpenAI's documented prompt caching applies automatically for prompts of at least 1,024 tokens, depends on exact matching at the beginning of the prompt, and reports cached prefix tokens in usage.prompt_tokens_details.cached_tokens.^[3] That gives the application one actionable design rule: put stable instructions and shared evidence first, then dynamic customer data.

A prefix hit isn't a stored response. The support model still answers the current question and still incurs output spend.

03-account-for-prefix-reuse.py

prefix_cached_policy_answer = Usage(
    input_tokens=1_800,
    cached_input_tokens=1_280,
    output_tokens=180,
)
prefix_cost = price_usage(prefix_cached_policy_answer)
savings = cold_cost.total_usd - prefix_cost.total_usd

assert prefix_cost.total_usd == Decimal("0.004320")
assert prefix_cost.output_usd == cold_cost.output_usd
assert savings == Decimal("0.002880")

print(f"cold_generation=${cold_cost.total_usd:.6f}")
print(f"prefix_cached_generation=${prefix_cost.total_usd:.6f}")
print(f"saved_per_generated_answer=${savings:.6f}")
print(f"still_generated={prefix_cost.output_usd > 0}")

Output

cold_generation=$0.007200
prefix_cached_generation=$0.004320
saved_per_generated_answer=$0.002880
still_generated=True

Carry forward the semantic-cache decision

The preceding lesson promoted semantic response reuse only for stable public-policy answers under one evaluated release scope. This lesson consumes that decision; it doesn't rebuild its index or threshold. A stored-answer hit records zero generated tokens, while a prompt-prefix hit records a newly generated answer at a reduced input cost.

04-separate-answer-hits-from-prefix-hits.py

@dataclass(frozen=True)
class RequestOutcome:
    feature: str
    decision: str
    usage: Usage | None
    counterfactual_usage: Usage | None = None

def generated_cost(outcome: RequestOutcome) -> Decimal:
    if outcome.usage is None:
        return Decimal("0")
    return price_usage(outcome.usage).total_usd

semantic_hit = RequestOutcome(
    feature="public-policy-answer",
    decision="SEMANTIC_ANSWER_HIT",
    usage=None,
    counterfactual_usage=prefix_cached_policy_answer,
)
prefix_hit = RequestOutcome(
    feature="public-policy-answer",
    decision="GENERATE_PREFIX_HIT",
    usage=prefix_cached_policy_answer,
)

avoided_generation = price_usage(semantic_hit.counterfactual_usage).total_usd
assert generated_cost(semantic_hit) == Decimal("0")
assert generated_cost(prefix_hit) == Decimal("0.004320")

print(f"semantic_hit_generation=${generated_cost(semantic_hit):.6f}")
print(f"semantic_hit_avoided=${avoided_generation:.6f}")
print(f"prefix_hit_generation=${generated_cost(prefix_hit):.6f}")

Output

semantic_hit_generation=$0.000000
semantic_hit_avoided=$0.004320
prefix_hit_generation=$0.004320

Cost ledger flow from request decision through provider usage categories to release spend and avoided-generation reporting. — The ledger records actual charges separately from counterfactual avoided generation. A cheaper release must still satisfy the evaluated answer contract.

Replay a day of traffic

A per-request example is useful, but a release decision needs traffic shape. For a ShopFlow baseline replay day, before testing a shorter exception answer:

3,200 stable policy questions safely hit the semantic answer cache.
1,800 policy questions still generate, but reuse the evaluated prompt prefix.
3,000 live-order questions must generate from current tool evidence.
500 return exceptions generate a longer cited answer with the stable prefix.

The semantic-hit counterfactual answers "what cost did safe reuse avoid?" Actual spend answers "what appears on this release ledger?"

05-price-a-traffic-replay.py

@dataclass(frozen=True)
class TrafficSlice:
    outcome: RequestOutcome
    requests: int

daily_replay = [
    TrafficSlice(semantic_hit, requests=3_200),
    TrafficSlice(prefix_hit, requests=1_800),
    TrafficSlice(
        RequestOutcome(
            feature="live-order-answer",
            decision="GENERATE_LIVE_DATA",
            usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95),
        ),
        requests=3_000,
    ),
    TrafficSlice(
        RequestOutcome(
            feature="return-exception-answer",
            decision="GENERATE_PREFIX_HIT",
            usage=Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220),
        ),
        requests=500,
    ),
]

actual_daily_spend = sum(
    generated_cost(slice_.outcome) * slice_.requests
    for slice_ in daily_replay
)
avoided_daily_generation = sum(
    price_usage(slice_.outcome.counterfactual_usage).total_usd * slice_.requests
    for slice_ in daily_replay
    if slice_.outcome.counterfactual_usage is not None
)

assert actual_daily_spend == Decimal("21.761000")
assert avoided_daily_generation == Decimal("13.824000")

print(f"generated_daily_spend=${dollars(actual_daily_spend)}")
print(f"semantic_hit_avoided_daily=${dollars(avoided_daily_generation)}")
print(f"request_count={sum(slice_.requests for slice_ in daily_replay)}")

Output

generated_daily_spend=$21.76
semantic_hit_avoided_daily=$13.82
request_count=8500

Avoided generation is useful evidence, but it isn't a refund and shouldn't be mixed into actual spend. If an answer hit later fails review, the evaluation consequence matters before its apparent savings.

Reduce output only under an answer contract

Output reductions can be valuable in a rate card where output is expensive. Blind truncation isn't a cost strategy. ShopFlow's cited support answer must state eligibility, cite the policy evidence, and name the customer's next step. Compare three candidate formats generated on the same input:

Candidate	Intended change	Safe for automatic support?
Verbose	Full explanation with repetition	Correct but unnecessarily long
Concise cited	One decision, one citation, one next step	Candidate for release
Truncated	Hard stop before citation and action	Reject

06-optimize-output-under-contract.py

@dataclass(frozen=True)
class AnswerCandidate:
    label: str
    output_tokens: int
    states_decision: bool
    cites_policy: bool
    gives_next_step: bool

def satisfies_answer_contract(candidate: AnswerCandidate) -> bool:
    return (
        candidate.states_decision
        and candidate.cites_policy
        and candidate.gives_next_step
    )

candidates = [
    AnswerCandidate("verbose", 220, True, True, True),
    AnswerCandidate("concise_cited", 130, True, True, True),
    AnswerCandidate("truncated", 55, True, False, False),
]
safe_candidates = [candidate for candidate in candidates if satisfies_answer_contract(candidate)]
selected_answer = min(safe_candidates, key=lambda candidate: candidate.output_tokens)

verbose_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220)
concise_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=selected_answer.output_tokens)
saved_per_exception = price_usage(verbose_usage).total_usd - price_usage(concise_usage).total_usd

assert selected_answer.label == "concise_cited"
assert not satisfies_answer_contract(candidates[-1])
assert saved_per_exception == Decimal("0.001350")

print(f"selected={selected_answer.label}")
print(f"rejected={candidates[-1].label}")
print(f"saved_per_exception=${saved_per_exception:.6f}")
print(f"saved_per_replay_day=${dollars(saved_per_exception * 500)}")

Output

selected=concise_cited
rejected=truncated
saved_per_exception=$0.001350
saved_per_replay_day=$0.68

The cheapest candidate is rejected. Cost reductions count only after the output still meets the same correctness and evidence contract.

Batch only work that can wait

OpenAI's Batch documentation describes a separate asynchronous workload path with a 50% discount compared with synchronous APIs and a 24-hour completion window.^[4] The pricing page publishes Batch token rates explicitly, so the ledger reads that row instead of deriving every category from one multiplier.^[1] That path is useful for nightly regression evaluation of the support release. It isn't appropriate for a customer waiting in a live conversation.

07-separate-interactive-and-batch-work.py

@dataclass(frozen=True)
class Workload:
    name: str
    interactive: bool
    deadline_hours: int
    endpoint_supported: bool
    daily_requests: int
    usage: Usage

def batch_eligible(workload: Workload) -> bool:
    return (
        not workload.interactive
        and workload.deadline_hours >= 24
        and workload.endpoint_supported
    )

live_support = Workload(
    name="live-support-answer",
    interactive=True,
    deadline_hours=0,
    endpoint_supported=True,
    daily_requests=3_000,
    usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95),
)
nightly_eval = Workload(
    name="nightly-release-eval",
    interactive=False,
    deadline_hours=24,
    endpoint_supported=True,
    daily_requests=2_000,
    usage=Usage(input_tokens=1_800, cached_input_tokens=1_280, output_tokens=80),
)

nightly_sync_cost = price_usage(nightly_eval.usage).total_usd * nightly_eval.daily_requests
nightly_batch_cost = price_usage(nightly_eval.usage, batch=True).total_usd * nightly_eval.daily_requests

assert not batch_eligible(live_support)
assert batch_eligible(nightly_eval)
assert nightly_batch_cost == Decimal("2.832800")
assert nightly_batch_cost < nightly_sync_cost

print(f"live_support_batchable={batch_eligible(live_support)}")
print(f"nightly_eval_batchable={batch_eligible(nightly_eval)}")
print(f"nightly_eval_sync=${dollars(nightly_sync_cost)}")
print(f"nightly_eval_batch=${dollars(nightly_batch_cost)}")

Output

live_support_batchable=False
nightly_eval_batchable=True
nightly_eval_sync=$5.64
nightly_eval_batch=$2.83

Attribute cost to decisions, not just models

An invoice tells you how much you spent. A release trace tells you why. Each recorded row should name the evaluated release, feature, decision, pricing mode, token usage, and output-contract evidence. Recording only a model name can't distinguish an unsafe answer-cache promotion from a harmless increase in live-order traffic. Make contract evidence mandatory on every trace instead of defaulting it to a pass.

08-build-feature-cost-telemetry.py

@dataclass(frozen=True)
class LedgerTrace:
    release_id: str
    feature: str
    decision: str
    requests: int
    usage: Usage | None
    contract_passed: bool
    contract_evidence_id: str
    batch: bool = False

def trace_spend(trace: LedgerTrace) -> Decimal:
    if trace.usage is None:
        return Decimal("0")
    return price_usage(trace.usage, batch=trace.batch).total_usd * trace.requests

release_id = "support-release-2026-05-cost-v1"
optimized_replay = [
    TrafficSlice(
        RequestOutcome(
            feature=item.outcome.feature,
            decision=item.outcome.decision,
            usage=concise_usage if item.outcome.feature == "return-exception-answer" else item.outcome.usage,
            counterfactual_usage=item.outcome.counterfactual_usage,
        ),
        requests=item.requests,
    )
    for item in daily_replay
]
traces = [
    LedgerTrace(
        release_id=release_id,
        feature=item.outcome.feature,
        decision=item.outcome.decision,
        requests=item.requests,
        usage=item.outcome.usage,
        contract_passed=True,
        contract_evidence_id=(
            "approved-public-policy-cache@v1"
            if item.outcome.decision == "SEMANTIC_ANSWER_HIT"
            else "cited-support-answer-v3@canary"
        ),
    )
    for item in optimized_replay
]
traces.append(
    LedgerTrace(
        release_id=release_id,
        feature=nightly_eval.name,
        decision="BATCH_OFFLINE_EVAL",
        requests=nightly_eval.daily_requests,
        usage=nightly_eval.usage,
        contract_passed=True,
        contract_evidence_id="nightly-release-eval@batch-eligible",
        batch=True,
    )
)

spend_by_feature: defaultdict[str, Decimal] = defaultdict(lambda: Decimal("0"))
for trace in traces:
    spend_by_feature[trace.feature] += trace_spend(trace)

total_with_eval = sum(spend_by_feature.values())
trace_contracts_complete = all(
    trace.contract_passed and trace.contract_evidence_id
    for trace in traces
)
assert spend_by_feature["public-policy-answer"] == Decimal("7.776000")
assert actual_daily_spend - sum(trace_spend(trace) for trace in traces[:-1]) == Decimal("0.675000")
assert trace_contracts_complete

for feature in sorted(spend_by_feature):
    print(f"{feature}=${dollars(spend_by_feature[feature])}")
print(f"daily_total_with_eval=${dollars(total_with_eval)}")
print(f"trace_contracts_complete={trace_contracts_complete}")

Output

live-order-answer=$11.03
nightly-release-eval=$2.83
public-policy-answer=$7.78
return-exception-answer=$2.29
daily_total_with_eval=$23.92
trace_contracts_complete=True

Release decision flow: ingest provider usage traces, price them with a versioned rate card, evaluate answer quality and budget gates, then promote, hold, or investigate. — Budget is one promotion gate. The evaluated answer contract is another, and it can't be waived by savings.

Forecast and gate the release

A cost target without quality is an incentive to produce short, wrong answers. A quality target without cost leaves a production team unable to plan capacity or margins. Require both.

The canary report below evaluates the concise cited-answer policy for return exceptions. The replay's request mix is projected across 30 days, including its nightly offline evaluation run.

09-evaluate-a-release-budget.py

@dataclass(frozen=True)
class QualityReport:
    cited_answer_pass_rate: Decimal
    unsafe_cache_hits: int
    evaluated_answers: int

@dataclass(frozen=True)
class PromotionPolicy:
    monthly_budget_usd: Decimal
    minimum_cited_answer_pass_rate: Decimal
    maximum_unsafe_cache_hits: int

quality = QualityReport(
    cited_answer_pass_rate=Decimal("0.997"),
    unsafe_cache_hits=0,
    evaluated_answers=2_000,
)
promotion_policy = PromotionPolicy(
    monthly_budget_usd=Decimal("750.00"),
    minimum_cited_answer_pass_rate=Decimal("0.995"),
    maximum_unsafe_cache_hits=0,
)
monthly_forecast = total_with_eval * Decimal("30")

quality_passed = (
    trace_contracts_complete
    and quality.cited_answer_pass_rate >= promotion_policy.minimum_cited_answer_pass_rate
    and quality.unsafe_cache_hits <= promotion_policy.maximum_unsafe_cache_hits
)
budget_passed = monthly_forecast <= promotion_policy.monthly_budget_usd

assert dollars(monthly_forecast) == Decimal("717.56")
assert quality_passed and budget_passed

print(f"monthly_forecast=${dollars(monthly_forecast)}")
print(f"quality_passed={quality_passed}")
print(f"budget_passed={budget_passed}")

Output

monthly_forecast=$717.56
quality_passed=True
budget_passed=True

Two release promotion gates: answer quality must remain within the evaluated contract and forecast spend must remain within the approved budget. — A cost optimization is promotable only when both the quality contract and budget forecast pass on evidence.

Produce the next policy input

The next lesson will implement model gateway routing and fallbacks. Give it a budget contract, not a half-built router embedded in cost analysis. The contract records what a future route must preserve: citation schema, maximum generated-answer spend, and the release evidence behind the limit.

10-emit-the-budget-contract.py

@dataclass(frozen=True)
class GatewayBudgetContract:
    release_id: str
    required_answer_schema: str
    maximum_generated_answer_usd: Decimal
    monthly_forecast_usd: Decimal
    rate_card_id: str
    status: str

max_generated_cost = max(
    price_usage(trace.usage).total_usd
    for trace in traces
    if trace.usage is not None and not trace.batch
)
status = "PROMOTE_COST_POLICY" if quality_passed and budget_passed else "HOLD_RELEASE"
gateway_contract = GatewayBudgetContract(
    release_id=release_id,
    required_answer_schema="cited-support-answer-v3",
    maximum_generated_answer_usd=max_generated_cost,
    monthly_forecast_usd=dollars(monthly_forecast),
    rate_card_id=rate_card.rate_card_id,
    status=status,
)

assert gateway_contract.status == "PROMOTE_COST_POLICY"
assert gateway_contract.maximum_generated_answer_usd == Decimal("0.004570")

print(f"status={gateway_contract.status}")
print(f"required_schema={gateway_contract.required_answer_schema}")
print(f"max_generated_answer=${gateway_contract.maximum_generated_answer_usd:.6f}")
print("next_step=apply_contract_in_gateway_routing")

Output

status=PROMOTE_COST_POLICY
required_schema=cited-support-answer-v3
max_generated_answer=$0.004570
next_step=apply_contract_in_gateway_routing

What the ledger proves

This release did not become cheaper because of one vague "cache" switch:

Evidence	Correct conclusion
Semantic answer hit with valid release scope	New generation was avoided for an already approved answer class
Provider `cached_tokens` on a generated answer	Repeated prefix processing was charged at the cached-input rate
Concise cited-answer candidate passes evaluation	Output reduction is eligible for release
Nightly evaluation classified as noninteractive	Batch discount can be modeled without changing customer latency
Monthly forecast and quality gate pass	Cost policy can be promoted and handed to gateway design

Don't infer a pricing benefit from a prompt rewrite until usage traces show cached tokens. Don't count semantic-cache savings unless accepted-hit quality remains valid. Don't approve a shorter output merely because it costs less.

Mastery check

Key concepts

Version rate cards because provider prices and discount rules change.
Price provider-reported fresh input, cached input, and output separately.
A semantic response hit can skip generation; a prompt-prefix hit can't.
Output reduction is valid only after the answer schema and evidence requirements still pass.
Batch is a workload-class decision: asynchronous evaluation can wait; live support can't.
Store explicit rates for each provider pricing mode instead of assuming every discount composes arithmetically.
A release promotion needs both a quality gate and a budget gate.

Practice tasks

Add a new warranty-claim-answer traffic slice with longer cited output. Decide whether it breaks the monthly budget.
Change the rate card to a newly verified provider/model rate card and rerun the ledger. Explain which request class moves the forecast most.
Add one unsafe semantic answer hit to the quality report. Verify that apparent savings don't allow promotion.
Replace the fixed replay with exported usage traces from an application you control, keeping rate-card identity and release identity explicit.

Evaluation rubric

Foundational: Computes a generated request charge from fresh input, cached input, and output categories.
Foundational: Explains why answer reuse and prefix reuse can't share one cache_hit metric.
Intermediate: Attributes spend by feature and decision, including avoided generation without treating it as actual spend.
Intermediate: Tests an output optimization against a required answer contract.
Advanced: Promotes only when a dated rate card, measured traffic replay, quality report, and budget policy all pass.
Advanced: Hands a budget contract to the routing layer without prematurely implementing routing in this lesson.

Self-check questions

Common pitfalls

Estimating the bill from characters

Symptom: Forecast misses invoice even when request count is correct. Cause: Character length or visible response length was treated as provider usage. Fix: Store billable usage categories returned by the provider and price them with a versioned rate card.

Calling all reuse a cache hit

Symptom: Team believes generations were skipped, but output spend remains high. Cause: Prefix-cache reuse was confused with stored-answer reuse. Fix: Record separate decisions and cost semantics for each layer.

Cutting output below correctness

Symptom: Spend falls while support answers omit evidence or next actions. Cause: Output cap was promoted without evaluating the required answer schema. Fix: Make the quality contract a hard promotion gate.

Batching interactive requests

Symptom: Forecast looks excellent while customer response latency becomes unacceptable. Cause: A discount was modeled without preserving completion-time requirements. Fix: Batch only explicitly asynchronous work such as nightly evaluation.

Deriving every pricing mode from one multiplier

Symptom: Forecast drifts from provider billing even though token counts are correct. Cause: Standard, Batch, or other processing-mode prices were inferred from one discount instead of loaded from the provider's dated table. Fix: Version explicit rates for every pricing mode used by the release and refresh them from current provider documentation.

Next Step

Continue to Model Gateways, Routing, and Fallbacks

You now have a measured budget contract for each generated support answer. Next you will apply it when selecting lanes and safe fallbacks.

PreviousSemantic Caching & Cost Optimization

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

OpenAI API Pricing

OpenAI · 2026

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Prompt caching

OpenAI · 2026

OpenAI Batch API Guide

OpenAI · 2026

Back to Topics

LearnApplied LLM EngineeringLLM Cost Engineering & Token Economics

⚙️MediumMLOps & Deployment

LLM Cost Engineering & Token Economics

Build an auditable LLM cost ledger from usage traces, cache decisions, output contracts, offline batch work, and release budget gates.

16 min read

Learning path

Step 73 of 155 in the full curriculum

Semantic Caching & Cost Optimization Model Gateways, Routing, and Fallbacks

The accounting boundary

Cost engineering begins after a request decision is known:

Decision	What executes?	What belongs in this ledger?
Semantic answer hit from the previous lesson	Stored evaluated answer returned	No generation charge; record avoided generation separately
Generated answer with provider prompt-cache hit	Model still produces a new answer	Fresh input, cached input, and output token charges
Generated answer without prompt-cache hit	Model reads full input and produces a new answer	Fresh input and output token charges
Offline evaluation batch	Model runs later, outside interactive path	Batch-eligible token spend and completion contract

Use a dated rate card, not remembered prices

01-rate-card-and-usage.py

from collections import defaultdict
from dataclasses import dataclass
from decimal import Decimal, ROUND_HALF_UP

MILLION = Decimal("1000000")
CENT = Decimal("0.01")

def dollars(amount: Decimal) -> Decimal:
    return amount.quantize(CENT, rounding=ROUND_HALF_UP)

@dataclass(frozen=True)
class TokenRates:
    input_per_1m: Decimal
    cached_input_per_1m: Decimal
    output_per_1m: Decimal

@dataclass(frozen=True)
class RateCard:
    rate_card_id: str
    standard: TokenRates
    batch: TokenRates

@dataclass(frozen=True)
class Usage:
    input_tokens: int
    cached_input_tokens: int
    output_tokens: int

    def __post_init__(self) -> None:
        if not 0 <= self.cached_input_tokens <= self.input_tokens:
            raise ValueError("cached input must be a subset of input tokens")
        if min(self.input_tokens, self.output_tokens) < 0:
            raise ValueError("token counts cannot be negative")

rate_card = RateCard(
    rate_card_id="openai-gpt-5.4-short-context-2026-05-31",
    standard=TokenRates(
        input_per_1m=Decimal("2.50"),
        cached_input_per_1m=Decimal("0.25"),
        output_per_1m=Decimal("15.00"),
    ),
    batch=TokenRates(
        input_per_1m=Decimal("1.25"),
        cached_input_per_1m=Decimal("0.13"),
        output_per_1m=Decimal("7.50"),
    ),
)

print(rate_card.rate_card_id)
print(f"standard_input_per_1m=${rate_card.standard.input_per_1m}")
print(f"standard_cached_input_per_1m=${rate_card.standard.cached_input_per_1m}")
print(f"standard_output_per_1m=${rate_card.standard.output_per_1m}")
print(f"batch_input_per_1m=${rate_card.batch.input_per_1m}")
print(f"batch_cached_input_per_1m=${rate_card.batch.cached_input_per_1m}")
print(f"batch_output_per_1m=${rate_card.batch.output_per_1m}")

Output

openai-gpt-5.4-short-context-2026-05-31
standard_input_per_1m=$2.50
standard_cached_input_per_1m=$0.25
standard_output_per_1m=$15.00
batch_input_per_1m=$1.25
batch_cached_input_per_1m=$0.13
batch_output_per_1m=$7.50

Price one generated answer

\begin{aligned} \text{cost} = &\frac{\text{fresh input tokens} \times \text{fresh input price}}{1{,}000{,}000} \\ &+ \frac{\text{cached input tokens} \times \text{cached input price}}{1{,}000{,}000} \\ &+ \frac{\text{output tokens} \times \text{output price}}{1{,}000{,}000} \end{aligned}

02-price-one-generated-answer.py

@dataclass(frozen=True)
class PricedUsage:
    fresh_input_usd: Decimal
    cached_input_usd: Decimal
    output_usd: Decimal

    @property
    def total_usd(self) -> Decimal:
        return self.fresh_input_usd + self.cached_input_usd + self.output_usd

def price_usage(usage: Usage, card: RateCard = rate_card, *, batch: bool = False) -> PricedUsage:
    rates = card.batch if batch else card.standard
    fresh_input_tokens = usage.input_tokens - usage.cached_input_tokens
    return PricedUsage(
        fresh_input_usd=Decimal(fresh_input_tokens) * rates.input_per_1m / MILLION,
        cached_input_usd=Decimal(usage.cached_input_tokens) * rates.cached_input_per_1m / MILLION,
        output_usd=Decimal(usage.output_tokens) * rates.output_per_1m / MILLION,
    )

cold_policy_answer = Usage(input_tokens=1_800, cached_input_tokens=0, output_tokens=180)
cold_cost = price_usage(cold_policy_answer)

assert cold_cost.total_usd == Decimal("0.0072")
assert rate_card.standard.output_per_1m > rate_card.standard.input_per_1m

print(f"fresh_input=${cold_cost.fresh_input_usd:.6f}")
print(f"output=${cold_cost.output_usd:.6f}")
print(f"total=${cold_cost.total_usd:.6f}")
print(f"output_share={(cold_cost.output_usd / cold_cost.total_usd):.1%}")

Output

fresh_input=$0.004500
output=$0.002700
total=$0.007200
output_share=37.5%

Prefix reuse is still a generated answer

A prefix hit isn't a stored response. The support model still answers the current question and still incurs output spend.

03-account-for-prefix-reuse.py

prefix_cached_policy_answer = Usage(
    input_tokens=1_800,
    cached_input_tokens=1_280,
    output_tokens=180,
)
prefix_cost = price_usage(prefix_cached_policy_answer)
savings = cold_cost.total_usd - prefix_cost.total_usd

assert prefix_cost.total_usd == Decimal("0.004320")
assert prefix_cost.output_usd == cold_cost.output_usd
assert savings == Decimal("0.002880")

print(f"cold_generation=${cold_cost.total_usd:.6f}")
print(f"prefix_cached_generation=${prefix_cost.total_usd:.6f}")
print(f"saved_per_generated_answer=${savings:.6f}")
print(f"still_generated={prefix_cost.output_usd > 0}")

Output

cold_generation=$0.007200
prefix_cached_generation=$0.004320
saved_per_generated_answer=$0.002880
still_generated=True

Carry forward the semantic-cache decision

04-separate-answer-hits-from-prefix-hits.py

@dataclass(frozen=True)
class RequestOutcome:
    feature: str
    decision: str
    usage: Usage | None
    counterfactual_usage: Usage | None = None

def generated_cost(outcome: RequestOutcome) -> Decimal:
    if outcome.usage is None:
        return Decimal("0")
    return price_usage(outcome.usage).total_usd

semantic_hit = RequestOutcome(
    feature="public-policy-answer",
    decision="SEMANTIC_ANSWER_HIT",
    usage=None,
    counterfactual_usage=prefix_cached_policy_answer,
)
prefix_hit = RequestOutcome(
    feature="public-policy-answer",
    decision="GENERATE_PREFIX_HIT",
    usage=prefix_cached_policy_answer,
)

avoided_generation = price_usage(semantic_hit.counterfactual_usage).total_usd
assert generated_cost(semantic_hit) == Decimal("0")
assert generated_cost(prefix_hit) == Decimal("0.004320")

print(f"semantic_hit_generation=${generated_cost(semantic_hit):.6f}")
print(f"semantic_hit_avoided=${avoided_generation:.6f}")
print(f"prefix_hit_generation=${generated_cost(prefix_hit):.6f}")

Output

semantic_hit_generation=$0.000000
semantic_hit_avoided=$0.004320
prefix_hit_generation=$0.004320

Replay a day of traffic

A per-request example is useful, but a release decision needs traffic shape. For a ShopFlow baseline replay day, before testing a shorter exception answer:

3,200 stable policy questions safely hit the semantic answer cache.
1,800 policy questions still generate, but reuse the evaluated prompt prefix.
3,000 live-order questions must generate from current tool evidence.
500 return exceptions generate a longer cited answer with the stable prefix.

The semantic-hit counterfactual answers "what cost did safe reuse avoid?" Actual spend answers "what appears on this release ledger?"

05-price-a-traffic-replay.py

@dataclass(frozen=True)
class TrafficSlice:
    outcome: RequestOutcome
    requests: int

daily_replay = [
    TrafficSlice(semantic_hit, requests=3_200),
    TrafficSlice(prefix_hit, requests=1_800),
    TrafficSlice(
        RequestOutcome(
            feature="live-order-answer",
            decision="GENERATE_LIVE_DATA",
            usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95),
        ),
        requests=3_000,
    ),
    TrafficSlice(
        RequestOutcome(
            feature="return-exception-answer",
            decision="GENERATE_PREFIX_HIT",
            usage=Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220),
        ),
        requests=500,
    ),
]

actual_daily_spend = sum(
    generated_cost(slice_.outcome) * slice_.requests
    for slice_ in daily_replay
)
avoided_daily_generation = sum(
    price_usage(slice_.outcome.counterfactual_usage).total_usd * slice_.requests
    for slice_ in daily_replay
    if slice_.outcome.counterfactual_usage is not None
)

assert actual_daily_spend == Decimal("21.761000")
assert avoided_daily_generation == Decimal("13.824000")

print(f"generated_daily_spend=${dollars(actual_daily_spend)}")
print(f"semantic_hit_avoided_daily=${dollars(avoided_daily_generation)}")
print(f"request_count={sum(slice_.requests for slice_ in daily_replay)}")

Output

generated_daily_spend=$21.76
semantic_hit_avoided_daily=$13.82
request_count=8500

Reduce output only under an answer contract

Candidate	Intended change	Safe for automatic support?
Verbose	Full explanation with repetition	Correct but unnecessarily long
Concise cited	One decision, one citation, one next step	Candidate for release
Truncated	Hard stop before citation and action	Reject

06-optimize-output-under-contract.py

@dataclass(frozen=True)
class AnswerCandidate:
    label: str
    output_tokens: int
    states_decision: bool
    cites_policy: bool
    gives_next_step: bool

def satisfies_answer_contract(candidate: AnswerCandidate) -> bool:
    return (
        candidate.states_decision
        and candidate.cites_policy
        and candidate.gives_next_step
    )

candidates = [
    AnswerCandidate("verbose", 220, True, True, True),
    AnswerCandidate("concise_cited", 130, True, True, True),
    AnswerCandidate("truncated", 55, True, False, False),
]
safe_candidates = [candidate for candidate in candidates if satisfies_answer_contract(candidate)]
selected_answer = min(safe_candidates, key=lambda candidate: candidate.output_tokens)

verbose_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220)
concise_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=selected_answer.output_tokens)
saved_per_exception = price_usage(verbose_usage).total_usd - price_usage(concise_usage).total_usd

assert selected_answer.label == "concise_cited"
assert not satisfies_answer_contract(candidates[-1])
assert saved_per_exception == Decimal("0.001350")

print(f"selected={selected_answer.label}")
print(f"rejected={candidates[-1].label}")
print(f"saved_per_exception=${saved_per_exception:.6f}")
print(f"saved_per_replay_day=${dollars(saved_per_exception * 500)}")

Output

selected=concise_cited
rejected=truncated
saved_per_exception=$0.001350
saved_per_replay_day=$0.68

The cheapest candidate is rejected. Cost reductions count only after the output still meets the same correctness and evidence contract.

Batch only work that can wait

07-separate-interactive-and-batch-work.py

@dataclass(frozen=True)
class Workload:
    name: str
    interactive: bool
    deadline_hours: int
    endpoint_supported: bool
    daily_requests: int
    usage: Usage

def batch_eligible(workload: Workload) -> bool:
    return (
        not workload.interactive
        and workload.deadline_hours >= 24
        and workload.endpoint_supported
    )

live_support = Workload(
    name="live-support-answer",
    interactive=True,
    deadline_hours=0,
    endpoint_supported=True,
    daily_requests=3_000,
    usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95),
)
nightly_eval = Workload(
    name="nightly-release-eval",
    interactive=False,
    deadline_hours=24,
    endpoint_supported=True,
    daily_requests=2_000,
    usage=Usage(input_tokens=1_800, cached_input_tokens=1_280, output_tokens=80),
)

nightly_sync_cost = price_usage(nightly_eval.usage).total_usd * nightly_eval.daily_requests
nightly_batch_cost = price_usage(nightly_eval.usage, batch=True).total_usd * nightly_eval.daily_requests

assert not batch_eligible(live_support)
assert batch_eligible(nightly_eval)
assert nightly_batch_cost == Decimal("2.832800")
assert nightly_batch_cost < nightly_sync_cost

print(f"live_support_batchable={batch_eligible(live_support)}")
print(f"nightly_eval_batchable={batch_eligible(nightly_eval)}")
print(f"nightly_eval_sync=${dollars(nightly_sync_cost)}")
print(f"nightly_eval_batch=${dollars(nightly_batch_cost)}")

Output

live_support_batchable=False
nightly_eval_batchable=True
nightly_eval_sync=$5.64
nightly_eval_batch=$2.83

Attribute cost to decisions, not just models

08-build-feature-cost-telemetry.py

@dataclass(frozen=True)
class LedgerTrace:
    release_id: str
    feature: str
    decision: str
    requests: int
    usage: Usage | None
    contract_passed: bool
    contract_evidence_id: str
    batch: bool = False

def trace_spend(trace: LedgerTrace) -> Decimal:
    if trace.usage is None:
        return Decimal("0")
    return price_usage(trace.usage, batch=trace.batch).total_usd * trace.requests

release_id = "support-release-2026-05-cost-v1"
optimized_replay = [
    TrafficSlice(
        RequestOutcome(
            feature=item.outcome.feature,
            decision=item.outcome.decision,
            usage=concise_usage if item.outcome.feature == "return-exception-answer" else item.outcome.usage,
            counterfactual_usage=item.outcome.counterfactual_usage,
        ),
        requests=item.requests,
    )
    for item in daily_replay
]
traces = [
    LedgerTrace(
        release_id=release_id,
        feature=item.outcome.feature,
        decision=item.outcome.decision,
        requests=item.requests,
        usage=item.outcome.usage,
        contract_passed=True,
        contract_evidence_id=(
            "approved-public-policy-cache@v1"
            if item.outcome.decision == "SEMANTIC_ANSWER_HIT"
            else "cited-support-answer-v3@canary"
        ),
    )
    for item in optimized_replay
]
traces.append(
    LedgerTrace(
        release_id=release_id,
        feature=nightly_eval.name,
        decision="BATCH_OFFLINE_EVAL",
        requests=nightly_eval.daily_requests,
        usage=nightly_eval.usage,
        contract_passed=True,
        contract_evidence_id="nightly-release-eval@batch-eligible",
        batch=True,
    )
)

spend_by_feature: defaultdict[str, Decimal] = defaultdict(lambda: Decimal("0"))
for trace in traces:
    spend_by_feature[trace.feature] += trace_spend(trace)

total_with_eval = sum(spend_by_feature.values())
trace_contracts_complete = all(
    trace.contract_passed and trace.contract_evidence_id
    for trace in traces
)
assert spend_by_feature["public-policy-answer"] == Decimal("7.776000")
assert actual_daily_spend - sum(trace_spend(trace) for trace in traces[:-1]) == Decimal("0.675000")
assert trace_contracts_complete

for feature in sorted(spend_by_feature):
    print(f"{feature}=${dollars(spend_by_feature[feature])}")
print(f"daily_total_with_eval=${dollars(total_with_eval)}")
print(f"trace_contracts_complete={trace_contracts_complete}")

Output

live-order-answer=$11.03
nightly-release-eval=$2.83
public-policy-answer=$7.78
return-exception-answer=$2.29
daily_total_with_eval=$23.92
trace_contracts_complete=True

Forecast and gate the release

A cost target without quality is an incentive to produce short, wrong answers. A quality target without cost leaves a production team unable to plan capacity or margins. Require both.

The canary report below evaluates the concise cited-answer policy for return exceptions. The replay's request mix is projected across 30 days, including its nightly offline evaluation run.

09-evaluate-a-release-budget.py

@dataclass(frozen=True)
class QualityReport:
    cited_answer_pass_rate: Decimal
    unsafe_cache_hits: int
    evaluated_answers: int

@dataclass(frozen=True)
class PromotionPolicy:
    monthly_budget_usd: Decimal
    minimum_cited_answer_pass_rate: Decimal
    maximum_unsafe_cache_hits: int

quality = QualityReport(
    cited_answer_pass_rate=Decimal("0.997"),
    unsafe_cache_hits=0,
    evaluated_answers=2_000,
)
promotion_policy = PromotionPolicy(
    monthly_budget_usd=Decimal("750.00"),
    minimum_cited_answer_pass_rate=Decimal("0.995"),
    maximum_unsafe_cache_hits=0,
)
monthly_forecast = total_with_eval * Decimal("30")

quality_passed = (
    trace_contracts_complete
    and quality.cited_answer_pass_rate >= promotion_policy.minimum_cited_answer_pass_rate
    and quality.unsafe_cache_hits <= promotion_policy.maximum_unsafe_cache_hits
)
budget_passed = monthly_forecast <= promotion_policy.monthly_budget_usd

assert dollars(monthly_forecast) == Decimal("717.56")
assert quality_passed and budget_passed

print(f"monthly_forecast=${dollars(monthly_forecast)}")
print(f"quality_passed={quality_passed}")
print(f"budget_passed={budget_passed}")

Output

monthly_forecast=$717.56
quality_passed=True
budget_passed=True

Produce the next policy input

10-emit-the-budget-contract.py

@dataclass(frozen=True)
class GatewayBudgetContract:
    release_id: str
    required_answer_schema: str
    maximum_generated_answer_usd: Decimal
    monthly_forecast_usd: Decimal
    rate_card_id: str
    status: str

max_generated_cost = max(
    price_usage(trace.usage).total_usd
    for trace in traces
    if trace.usage is not None and not trace.batch
)
status = "PROMOTE_COST_POLICY" if quality_passed and budget_passed else "HOLD_RELEASE"
gateway_contract = GatewayBudgetContract(
    release_id=release_id,
    required_answer_schema="cited-support-answer-v3",
    maximum_generated_answer_usd=max_generated_cost,
    monthly_forecast_usd=dollars(monthly_forecast),
    rate_card_id=rate_card.rate_card_id,
    status=status,
)

assert gateway_contract.status == "PROMOTE_COST_POLICY"
assert gateway_contract.maximum_generated_answer_usd == Decimal("0.004570")

print(f"status={gateway_contract.status}")
print(f"required_schema={gateway_contract.required_answer_schema}")
print(f"max_generated_answer=${gateway_contract.maximum_generated_answer_usd:.6f}")
print("next_step=apply_contract_in_gateway_routing")

Output

status=PROMOTE_COST_POLICY
required_schema=cited-support-answer-v3
max_generated_answer=$0.004570
next_step=apply_contract_in_gateway_routing

What the ledger proves

This release did not become cheaper because of one vague "cache" switch:

Evidence	Correct conclusion
Semantic answer hit with valid release scope	New generation was avoided for an already approved answer class
Provider `cached_tokens` on a generated answer	Repeated prefix processing was charged at the cached-input rate
Concise cited-answer candidate passes evaluation	Output reduction is eligible for release
Nightly evaluation classified as noninteractive	Batch discount can be modeled without changing customer latency
Monthly forecast and quality gate pass	Cost policy can be promoted and handed to gateway design

Mastery check

Key concepts

Version rate cards because provider prices and discount rules change.
Price provider-reported fresh input, cached input, and output separately.
A semantic response hit can skip generation; a prompt-prefix hit can't.
Output reduction is valid only after the answer schema and evidence requirements still pass.
Batch is a workload-class decision: asynchronous evaluation can wait; live support can't.
Store explicit rates for each provider pricing mode instead of assuming every discount composes arithmetically.
A release promotion needs both a quality gate and a budget gate.

Practice tasks

Add a new warranty-claim-answer traffic slice with longer cited output. Decide whether it breaks the monthly budget.
Change the rate card to a newly verified provider/model rate card and rerun the ledger. Explain which request class moves the forecast most.
Add one unsafe semantic answer hit to the quality report. Verify that apparent savings don't allow promotion.
Replace the fixed replay with exported usage traces from an application you control, keeping rate-card identity and release identity explicit.

Evaluation rubric

Foundational: Computes a generated request charge from fresh input, cached input, and output categories.
Foundational: Explains why answer reuse and prefix reuse can't share one cache_hit metric.
Intermediate: Attributes spend by feature and decision, including avoided generation without treating it as actual spend.
Intermediate: Tests an output optimization against a required answer contract.
Advanced: Promotes only when a dated rate card, measured traffic replay, quality report, and budget policy all pass.
Advanced: Hands a budget contract to the routing layer without prematurely implementing routing in this lesson.

Self-check questions

Common pitfalls

Estimating the bill from characters

Calling all reuse a cache hit

Cutting output below correctness

Batching interactive requests

Deriving every pricing mode from one multiplier

Next Step

Continue to Model Gateways, Routing, and Fallbacks

You now have a measured budget contract for each generated support answer. Next you will apply it when selecting lanes and safe fallbacks.

PreviousSemantic Caching & Cost Optimization

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

OpenAI API Pricing

OpenAI · 2026

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Prompt caching

OpenAI · 2026

OpenAI Batch API Guide

OpenAI · 2026

LLM Cost Engineering & Token Economics

The accounting boundary

Use a dated rate card, not remembered prices

Price one generated answer

Prefix reuse is still a generated answer

Why does a provider prompt-cache hit still have an output charge?

Carry forward the semantic-cache decision

Replay a day of traffic

Reduce output only under an answer contract

Batch only work that can wait

Your nightly evaluation workload is cheaper through Batch. Should the interactive support endpoint use the same path?

Attribute cost to decisions, not just models

Forecast and gate the release

Produce the next policy input

What the ledger proves

Mastery check

Key concepts

Practice tasks

Evaluation rubric

Self-check questions

An answer-cache hit and a provider prefix-cache hit both appear as "hit" in a dashboard. What must change?

A truncated response costs less but drops its cited evidence. Is it a successful output optimization?

Why should a forecast store rate_card_id beside release_id?

Common pitfalls

Estimating the bill from characters

Calling all reuse a cache hit

Cutting output below correctness

Batching interactive requests

Deriving every pricing mode from one multiplier

LLM Cost Engineering & Token Economics

The accounting boundary

Use a dated rate card, not remembered prices

Price one generated answer

Prefix reuse is still a generated answer

Why does a provider prompt-cache hit still have an output charge?

Carry forward the semantic-cache decision

Replay a day of traffic

Reduce output only under an answer contract

Batch only work that can wait

Your nightly evaluation workload is cheaper through Batch. Should the interactive support endpoint use the same path?

Attribute cost to decisions, not just models

Forecast and gate the release

Produce the next policy input

What the ledger proves

Mastery check

Key concepts

Practice tasks

Evaluation rubric

Self-check questions

An answer-cache hit and a provider prefix-cache hit both appear as "hit" in a dashboard. What must change?

A truncated response costs less but drops its cited evidence. Is it a successful output optimization?

Why should a forecast store rate_card_id beside release_id?

Common pitfalls

Estimating the bill from characters

Calling all reuse a cache hit

Cutting output below correctness

Batching interactive requests

Deriving every pricing mode from one multiplier