LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringLLM Cost Engineering & Token Economics
⚙️MediumMLOps & Deployment

LLM Cost Engineering & Token Economics

Build an auditable LLM cost ledger from usage traces, cache decisions, output contracts, offline batch work, and release budget gates.

16 min read
Learning path
Step 73 of 155 in the full curriculum
Semantic Caching & Cost OptimizationModel Gateways, Routing, and Fallbacks

Your ShopFlow team just promoted a semantic answer cache for its large language model (LLM) support assistant. Safe public return-policy hits can now return an evaluated answer without generating new text. That doesn't make the bill disappear. Live order checks, exception cases, and cache misses still invoke a model, and nobody can defend the next release with "it should be cheaper."

This lesson builds the missing artifact: a cost ledger for one evaluated support release. It starts from provider-reported usage, accounts for two very different forms of caching, tests output savings against the answer contract, defers only eligible offline work, and produces a promotion decision. The following chapter will turn that budget contract into gateway routing and fallback policy. We won't build a router here.

The accounting boundary

Cost engineering begins after a request decision is known:

DecisionWhat executes?What belongs in this ledger?
Semantic answer hit from the previous lessonStored evaluated answer returnedNo generation charge; record avoided generation separately
Generated answer with provider prompt-cache hitModel still produces a new answerFresh input, cached input, and output token charges
Generated answer without prompt-cache hitModel reads full input and produces a new answerFresh input and output token charges
Offline evaluation batchModel runs later, outside interactive pathBatch-eligible token spend and completion contract

This boundary matters. A semantic answer cache can skip generation. A provider prompt cache reuses matching prefix work but still generates a new output. Treating both as "cache hits" hides both correctness risk and spend.

Four measured cost controls for a support release: semantic answer reuse skips generation, prompt-prefix reuse discounts repeated input, concise cited answers reduce output, and batch queues defer offline evaluation. Four measured cost controls for a support release: semantic answer reuse skips generation, prompt-prefix reuse discounts repeated input, concise cited answers reduce output, and batch queues defer offline evaluation.
Start with the request decision already made. Each cost control changes a different charge or workload boundary.

Use a dated rate card, not remembered prices

Provider pricing is configuration data. For this lab, the teaching fixture snapshots OpenAI's GPT-5.4 short-context rates from May 31, 2026: standard input costs $2.50 per million tokens, standard cached input costs $0.25 per million tokens, and standard output costs $15.00 per million tokens. OpenAI's pricing page publishes Batch rates as a separate row, so the ledger stores those values explicitly too. This dated fixture teaches the ledger shape; it isn't a permanent price promise. Refresh every production rate card from current provider pricing documentation before forecasting a release.[1]

The usage , not a character-count estimate, is the source of truth after a model call. A provider can expose total input tokens, the cached subset, and output tokens. The fresh-input charge is the total input minus the cached subset.

01-rate-card-and-usage.py
1from collections import defaultdict 2from dataclasses import dataclass 3from decimal import Decimal, ROUND_HALF_UP 4 5MILLION = Decimal("1000000") 6CENT = Decimal("0.01") 7 8def dollars(amount: Decimal) -> Decimal: 9 return amount.quantize(CENT, rounding=ROUND_HALF_UP) 10 11@dataclass(frozen=True) 12class TokenRates: 13 input_per_1m: Decimal 14 cached_input_per_1m: Decimal 15 output_per_1m: Decimal 16 17@dataclass(frozen=True) 18class RateCard: 19 rate_card_id: str 20 standard: TokenRates 21 batch: TokenRates 22 23@dataclass(frozen=True) 24class Usage: 25 input_tokens: int 26 cached_input_tokens: int 27 output_tokens: int 28 29 def __post_init__(self) -> None: 30 if not 0 <= self.cached_input_tokens <= self.input_tokens: 31 raise ValueError("cached input must be a subset of input tokens") 32 if min(self.input_tokens, self.output_tokens) < 0: 33 raise ValueError("token counts cannot be negative") 34 35rate_card = RateCard( 36 rate_card_id="openai-gpt-5.4-short-context-2026-05-31", 37 standard=TokenRates( 38 input_per_1m=Decimal("2.50"), 39 cached_input_per_1m=Decimal("0.25"), 40 output_per_1m=Decimal("15.00"), 41 ), 42 batch=TokenRates( 43 input_per_1m=Decimal("1.25"), 44 cached_input_per_1m=Decimal("0.13"), 45 output_per_1m=Decimal("7.50"), 46 ), 47) 48 49print(rate_card.rate_card_id) 50print(f"standard_input_per_1m=${rate_card.standard.input_per_1m}") 51print(f"standard_cached_input_per_1m=${rate_card.standard.cached_input_per_1m}") 52print(f"standard_output_per_1m=${rate_card.standard.output_per_1m}") 53print(f"batch_input_per_1m=${rate_card.batch.input_per_1m}") 54print(f"batch_cached_input_per_1m=${rate_card.batch.cached_input_per_1m}") 55print(f"batch_output_per_1m=${rate_card.batch.output_per_1m}")
Output
1openai-gpt-5.4-short-context-2026-05-31 2standard_input_per_1m=$2.50 3standard_cached_input_per_1m=$0.25 4standard_output_per_1m=$15.00 5batch_input_per_1m=$1.25 6batch_cached_input_per_1m=$0.13 7batch_output_per_1m=$7.50

Price one generated answer

ShopFlow's public-policy-answer path uses a long evaluated instruction prefix and cites policy evidence in every generated response. A cold request reports 1,800 input tokens and 180 output tokens. Split each billable category rather than multiplying one blended token count by one price.

cost=fresh input tokens×fresh input price1,000,000+cached input tokens×cached input price1,000,000+output tokens×output price1,000,000\begin{aligned} \text{cost} = &\frac{\text{fresh input tokens} \times \text{fresh input price}}{1{,}000{,}000} \\ &+ \frac{\text{cached input tokens} \times \text{cached input price}}{1{,}000{,}000} \\ &+ \frac{\text{output tokens} \times \text{output price}}{1{,}000{,}000} \end{aligned}cost=​1,000,000fresh input tokens×fresh input price​+1,000,000cached input tokens×cached input price​+1,000,000output tokens×output price​​
02-price-one-generated-answer.py
1@dataclass(frozen=True) 2class PricedUsage: 3 fresh_input_usd: Decimal 4 cached_input_usd: Decimal 5 output_usd: Decimal 6 7 @property 8 def total_usd(self) -> Decimal: 9 return self.fresh_input_usd + self.cached_input_usd + self.output_usd 10 11def price_usage(usage: Usage, card: RateCard = rate_card, *, batch: bool = False) -> PricedUsage: 12 rates = card.batch if batch else card.standard 13 fresh_input_tokens = usage.input_tokens - usage.cached_input_tokens 14 return PricedUsage( 15 fresh_input_usd=Decimal(fresh_input_tokens) * rates.input_per_1m / MILLION, 16 cached_input_usd=Decimal(usage.cached_input_tokens) * rates.cached_input_per_1m / MILLION, 17 output_usd=Decimal(usage.output_tokens) * rates.output_per_1m / MILLION, 18 ) 19 20cold_policy_answer = Usage(input_tokens=1_800, cached_input_tokens=0, output_tokens=180) 21cold_cost = price_usage(cold_policy_answer) 22 23assert cold_cost.total_usd == Decimal("0.0072") 24assert rate_card.standard.output_per_1m > rate_card.standard.input_per_1m 25 26print(f"fresh_input=${cold_cost.fresh_input_usd:.6f}") 27print(f"output=${cold_cost.output_usd:.6f}") 28print(f"total=${cold_cost.total_usd:.6f}") 29print(f"output_share={(cold_cost.output_usd / cold_cost.total_usd):.1%}")
Output
1fresh_input=$0.004500 2output=$0.002700 3total=$0.007200 4output_share=37.5%

The output token rate is higher in this rate card, although this long-input request still spends more dollars on input overall. That distinction is why both token volume and per-category rate belong in the ledger. Decode is autoregressive and serving systems manage growing KV state during generation; those mechanisms help explain why output can be priced differently, while the current rate card remains the billing authority.[2][1]

Prefix reuse is still a generated answer

OpenAI's documented prompt caching applies automatically for prompts of at least 1,024 tokens, depends on exact matching at the beginning of the prompt, and reports cached prefix tokens in usage.prompt_tokens_details.cached_tokens.[3] That gives the application one actionable design rule: put stable instructions and shared evidence first, then dynamic customer data.

A prefix hit isn't a stored response. The support model still answers the current question and still incurs output spend.

03-account-for-prefix-reuse.py
1prefix_cached_policy_answer = Usage( 2 input_tokens=1_800, 3 cached_input_tokens=1_280, 4 output_tokens=180, 5) 6prefix_cost = price_usage(prefix_cached_policy_answer) 7savings = cold_cost.total_usd - prefix_cost.total_usd 8 9assert prefix_cost.total_usd == Decimal("0.004320") 10assert prefix_cost.output_usd == cold_cost.output_usd 11assert savings == Decimal("0.002880") 12 13print(f"cold_generation=${cold_cost.total_usd:.6f}") 14print(f"prefix_cached_generation=${prefix_cost.total_usd:.6f}") 15print(f"saved_per_generated_answer=${savings:.6f}") 16print(f"still_generated={prefix_cost.output_usd > 0}")
Output
1cold_generation=$0.007200 2prefix_cached_generation=$0.004320 3saved_per_generated_answer=$0.002880 4still_generated=True

Carry forward the semantic-cache decision

The preceding lesson promoted semantic response reuse only for stable public-policy answers under one evaluated release scope. This lesson consumes that decision; it doesn't rebuild its index or threshold. A stored-answer hit records zero generated tokens, while a prompt-prefix hit records a newly generated answer at a reduced input cost.

04-separate-answer-hits-from-prefix-hits.py
1@dataclass(frozen=True) 2class RequestOutcome: 3 feature: str 4 decision: str 5 usage: Usage | None 6 counterfactual_usage: Usage | None = None 7 8def generated_cost(outcome: RequestOutcome) -> Decimal: 9 if outcome.usage is None: 10 return Decimal("0") 11 return price_usage(outcome.usage).total_usd 12 13semantic_hit = RequestOutcome( 14 feature="public-policy-answer", 15 decision="SEMANTIC_ANSWER_HIT", 16 usage=None, 17 counterfactual_usage=prefix_cached_policy_answer, 18) 19prefix_hit = RequestOutcome( 20 feature="public-policy-answer", 21 decision="GENERATE_PREFIX_HIT", 22 usage=prefix_cached_policy_answer, 23) 24 25avoided_generation = price_usage(semantic_hit.counterfactual_usage).total_usd 26assert generated_cost(semantic_hit) == Decimal("0") 27assert generated_cost(prefix_hit) == Decimal("0.004320") 28 29print(f"semantic_hit_generation=${generated_cost(semantic_hit):.6f}") 30print(f"semantic_hit_avoided=${avoided_generation:.6f}") 31print(f"prefix_hit_generation=${generated_cost(prefix_hit):.6f}")
Output
1semantic_hit_generation=$0.000000 2semantic_hit_avoided=$0.004320 3prefix_hit_generation=$0.004320
Cost ledger flow from request decision through provider usage categories to release spend and avoided-generation reporting. Cost ledger flow from request decision through provider usage categories to release spend and avoided-generation reporting.
The ledger records actual charges separately from counterfactual avoided generation. A cheaper release must still satisfy the evaluated answer contract.

Replay a day of traffic

A per-request example is useful, but a release decision needs traffic shape. For a ShopFlow baseline replay day, before testing a shorter exception answer:

  • 3,200 stable policy questions safely hit the semantic answer cache.
  • 1,800 policy questions still generate, but reuse the evaluated prompt prefix.
  • 3,000 live-order questions must generate from current tool evidence.
  • 500 return exceptions generate a longer cited answer with the stable prefix.

The semantic-hit counterfactual answers "what cost did safe reuse avoid?" Actual spend answers "what appears on this release ledger?"

05-price-a-traffic-replay.py
1@dataclass(frozen=True) 2class TrafficSlice: 3 outcome: RequestOutcome 4 requests: int 5 6daily_replay = [ 7 TrafficSlice(semantic_hit, requests=3_200), 8 TrafficSlice(prefix_hit, requests=1_800), 9 TrafficSlice( 10 RequestOutcome( 11 feature="live-order-answer", 12 decision="GENERATE_LIVE_DATA", 13 usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95), 14 ), 15 requests=3_000, 16 ), 17 TrafficSlice( 18 RequestOutcome( 19 feature="return-exception-answer", 20 decision="GENERATE_PREFIX_HIT", 21 usage=Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220), 22 ), 23 requests=500, 24 ), 25] 26 27actual_daily_spend = sum( 28 generated_cost(slice_.outcome) * slice_.requests 29 for slice_ in daily_replay 30) 31avoided_daily_generation = sum( 32 price_usage(slice_.outcome.counterfactual_usage).total_usd * slice_.requests 33 for slice_ in daily_replay 34 if slice_.outcome.counterfactual_usage is not None 35) 36 37assert actual_daily_spend == Decimal("21.761000") 38assert avoided_daily_generation == Decimal("13.824000") 39 40print(f"generated_daily_spend=${dollars(actual_daily_spend)}") 41print(f"semantic_hit_avoided_daily=${dollars(avoided_daily_generation)}") 42print(f"request_count={sum(slice_.requests for slice_ in daily_replay)}")
Output
1generated_daily_spend=$21.76 2semantic_hit_avoided_daily=$13.82 3request_count=8500

Avoided generation is useful evidence, but it isn't a refund and shouldn't be mixed into actual spend. If an answer hit later fails review, the evaluation consequence matters before its apparent savings.

Reduce output only under an answer contract

Output reductions can be valuable in a rate card where output is expensive. Blind truncation isn't a cost strategy. ShopFlow's cited support answer must state eligibility, cite the policy evidence, and name the customer's next step. Compare three candidate formats generated on the same input:

CandidateIntended changeSafe for automatic support?
VerboseFull explanation with repetitionCorrect but unnecessarily long
Concise citedOne decision, one citation, one next stepCandidate for release
TruncatedHard stop before citation and actionReject
06-optimize-output-under-contract.py
1@dataclass(frozen=True) 2class AnswerCandidate: 3 label: str 4 output_tokens: int 5 states_decision: bool 6 cites_policy: bool 7 gives_next_step: bool 8 9def satisfies_answer_contract(candidate: AnswerCandidate) -> bool: 10 return ( 11 candidate.states_decision 12 and candidate.cites_policy 13 and candidate.gives_next_step 14 ) 15 16candidates = [ 17 AnswerCandidate("verbose", 220, True, True, True), 18 AnswerCandidate("concise_cited", 130, True, True, True), 19 AnswerCandidate("truncated", 55, True, False, False), 20] 21safe_candidates = [candidate for candidate in candidates if satisfies_answer_contract(candidate)] 22selected_answer = min(safe_candidates, key=lambda candidate: candidate.output_tokens) 23 24verbose_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=220) 25concise_usage = Usage(input_tokens=2_200, cached_input_tokens=1_280, output_tokens=selected_answer.output_tokens) 26saved_per_exception = price_usage(verbose_usage).total_usd - price_usage(concise_usage).total_usd 27 28assert selected_answer.label == "concise_cited" 29assert not satisfies_answer_contract(candidates[-1]) 30assert saved_per_exception == Decimal("0.001350") 31 32print(f"selected={selected_answer.label}") 33print(f"rejected={candidates[-1].label}") 34print(f"saved_per_exception=${saved_per_exception:.6f}") 35print(f"saved_per_replay_day=${dollars(saved_per_exception * 500)}")
Output
1selected=concise_cited 2rejected=truncated 3saved_per_exception=$0.001350 4saved_per_replay_day=$0.68

The cheapest candidate is rejected. Cost reductions count only after the output still meets the same correctness and evidence contract.

Batch only work that can wait

OpenAI's Batch documentation describes a separate asynchronous workload path with a 50% discount compared with synchronous APIs and a 24-hour completion window.[4] The pricing page publishes Batch token rates explicitly, so the ledger reads that row instead of deriving every category from one multiplier.[1] That path is useful for nightly regression evaluation of the support release. It isn't appropriate for a customer waiting in a live conversation.

07-separate-interactive-and-batch-work.py
1@dataclass(frozen=True) 2class Workload: 3 name: str 4 interactive: bool 5 deadline_hours: int 6 endpoint_supported: bool 7 daily_requests: int 8 usage: Usage 9 10def batch_eligible(workload: Workload) -> bool: 11 return ( 12 not workload.interactive 13 and workload.deadline_hours >= 24 14 and workload.endpoint_supported 15 ) 16 17live_support = Workload( 18 name="live-support-answer", 19 interactive=True, 20 deadline_hours=0, 21 endpoint_supported=True, 22 daily_requests=3_000, 23 usage=Usage(input_tokens=900, cached_input_tokens=0, output_tokens=95), 24) 25nightly_eval = Workload( 26 name="nightly-release-eval", 27 interactive=False, 28 deadline_hours=24, 29 endpoint_supported=True, 30 daily_requests=2_000, 31 usage=Usage(input_tokens=1_800, cached_input_tokens=1_280, output_tokens=80), 32) 33 34nightly_sync_cost = price_usage(nightly_eval.usage).total_usd * nightly_eval.daily_requests 35nightly_batch_cost = price_usage(nightly_eval.usage, batch=True).total_usd * nightly_eval.daily_requests 36 37assert not batch_eligible(live_support) 38assert batch_eligible(nightly_eval) 39assert nightly_batch_cost == Decimal("2.832800") 40assert nightly_batch_cost < nightly_sync_cost 41 42print(f"live_support_batchable={batch_eligible(live_support)}") 43print(f"nightly_eval_batchable={batch_eligible(nightly_eval)}") 44print(f"nightly_eval_sync=${dollars(nightly_sync_cost)}") 45print(f"nightly_eval_batch=${dollars(nightly_batch_cost)}")
Output
1live_support_batchable=False 2nightly_eval_batchable=True 3nightly_eval_sync=$5.64 4nightly_eval_batch=$2.83

Attribute cost to decisions, not just models

An invoice tells you how much you spent. A release trace tells you why. Each recorded row should name the evaluated release, feature, decision, pricing mode, token usage, and output-contract evidence. Recording only a model name can't distinguish an unsafe answer-cache promotion from a harmless increase in live-order traffic. Make contract evidence mandatory on every trace instead of defaulting it to a pass.

08-build-feature-cost-telemetry.py
1@dataclass(frozen=True) 2class LedgerTrace: 3 release_id: str 4 feature: str 5 decision: str 6 requests: int 7 usage: Usage | None 8 contract_passed: bool 9 contract_evidence_id: str 10 batch: bool = False 11 12def trace_spend(trace: LedgerTrace) -> Decimal: 13 if trace.usage is None: 14 return Decimal("0") 15 return price_usage(trace.usage, batch=trace.batch).total_usd * trace.requests 16 17release_id = "support-release-2026-05-cost-v1" 18optimized_replay = [ 19 TrafficSlice( 20 RequestOutcome( 21 feature=item.outcome.feature, 22 decision=item.outcome.decision, 23 usage=concise_usage if item.outcome.feature == "return-exception-answer" else item.outcome.usage, 24 counterfactual_usage=item.outcome.counterfactual_usage, 25 ), 26 requests=item.requests, 27 ) 28 for item in daily_replay 29] 30traces = [ 31 LedgerTrace( 32 release_id=release_id, 33 feature=item.outcome.feature, 34 decision=item.outcome.decision, 35 requests=item.requests, 36 usage=item.outcome.usage, 37 contract_passed=True, 38 contract_evidence_id=( 39 "approved-public-policy-cache@v1" 40 if item.outcome.decision == "SEMANTIC_ANSWER_HIT" 41 else "cited-support-answer-v3@canary" 42 ), 43 ) 44 for item in optimized_replay 45] 46traces.append( 47 LedgerTrace( 48 release_id=release_id, 49 feature=nightly_eval.name, 50 decision="BATCH_OFFLINE_EVAL", 51 requests=nightly_eval.daily_requests, 52 usage=nightly_eval.usage, 53 contract_passed=True, 54 contract_evidence_id="nightly-release-eval@batch-eligible", 55 batch=True, 56 ) 57) 58 59spend_by_feature: defaultdict[str, Decimal] = defaultdict(lambda: Decimal("0")) 60for trace in traces: 61 spend_by_feature[trace.feature] += trace_spend(trace) 62 63total_with_eval = sum(spend_by_feature.values()) 64trace_contracts_complete = all( 65 trace.contract_passed and trace.contract_evidence_id 66 for trace in traces 67) 68assert spend_by_feature["public-policy-answer"] == Decimal("7.776000") 69assert actual_daily_spend - sum(trace_spend(trace) for trace in traces[:-1]) == Decimal("0.675000") 70assert trace_contracts_complete 71 72for feature in sorted(spend_by_feature): 73 print(f"{feature}=${dollars(spend_by_feature[feature])}") 74print(f"daily_total_with_eval=${dollars(total_with_eval)}") 75print(f"trace_contracts_complete={trace_contracts_complete}")
Output
1live-order-answer=$11.03 2nightly-release-eval=$2.83 3public-policy-answer=$7.78 4return-exception-answer=$2.29 5daily_total_with_eval=$23.92 6trace_contracts_complete=True
Release decision flow: ingest provider usage traces, price them with a versioned rate card, evaluate answer quality and budget gates, then promote, hold, or investigate. Release decision flow: ingest provider usage traces, price them with a versioned rate card, evaluate answer quality and budget gates, then promote, hold, or investigate.
Budget is one promotion gate. The evaluated answer contract is another, and it can't be waived by savings.

Forecast and gate the release

A cost target without quality is an incentive to produce short, wrong answers. A quality target without cost leaves a production team unable to plan capacity or margins. Require both.

The canary report below evaluates the concise cited-answer policy for return exceptions. The replay's request mix is projected across 30 days, including its nightly offline evaluation run.

09-evaluate-a-release-budget.py
1@dataclass(frozen=True) 2class QualityReport: 3 cited_answer_pass_rate: Decimal 4 unsafe_cache_hits: int 5 evaluated_answers: int 6 7@dataclass(frozen=True) 8class PromotionPolicy: 9 monthly_budget_usd: Decimal 10 minimum_cited_answer_pass_rate: Decimal 11 maximum_unsafe_cache_hits: int 12 13quality = QualityReport( 14 cited_answer_pass_rate=Decimal("0.997"), 15 unsafe_cache_hits=0, 16 evaluated_answers=2_000, 17) 18promotion_policy = PromotionPolicy( 19 monthly_budget_usd=Decimal("750.00"), 20 minimum_cited_answer_pass_rate=Decimal("0.995"), 21 maximum_unsafe_cache_hits=0, 22) 23monthly_forecast = total_with_eval * Decimal("30") 24 25quality_passed = ( 26 trace_contracts_complete 27 and quality.cited_answer_pass_rate >= promotion_policy.minimum_cited_answer_pass_rate 28 and quality.unsafe_cache_hits <= promotion_policy.maximum_unsafe_cache_hits 29) 30budget_passed = monthly_forecast <= promotion_policy.monthly_budget_usd 31 32assert dollars(monthly_forecast) == Decimal("717.56") 33assert quality_passed and budget_passed 34 35print(f"monthly_forecast=${dollars(monthly_forecast)}") 36print(f"quality_passed={quality_passed}") 37print(f"budget_passed={budget_passed}")
Output
1monthly_forecast=$717.56 2quality_passed=True 3budget_passed=True
Two release promotion gates: answer quality must remain within the evaluated contract and forecast spend must remain within the approved budget. Two release promotion gates: answer quality must remain within the evaluated contract and forecast spend must remain within the approved budget.
A cost optimization is promotable only when both the quality contract and budget forecast pass on evidence.

Produce the next policy input

The next lesson will implement model gateway routing and fallbacks. Give it a budget contract, not a half-built router embedded in cost analysis. The contract records what a future route must preserve: citation schema, maximum generated-answer spend, and the release evidence behind the limit.

10-emit-the-budget-contract.py
1@dataclass(frozen=True) 2class GatewayBudgetContract: 3 release_id: str 4 required_answer_schema: str 5 maximum_generated_answer_usd: Decimal 6 monthly_forecast_usd: Decimal 7 rate_card_id: str 8 status: str 9 10max_generated_cost = max( 11 price_usage(trace.usage).total_usd 12 for trace in traces 13 if trace.usage is not None and not trace.batch 14) 15status = "PROMOTE_COST_POLICY" if quality_passed and budget_passed else "HOLD_RELEASE" 16gateway_contract = GatewayBudgetContract( 17 release_id=release_id, 18 required_answer_schema="cited-support-answer-v3", 19 maximum_generated_answer_usd=max_generated_cost, 20 monthly_forecast_usd=dollars(monthly_forecast), 21 rate_card_id=rate_card.rate_card_id, 22 status=status, 23) 24 25assert gateway_contract.status == "PROMOTE_COST_POLICY" 26assert gateway_contract.maximum_generated_answer_usd == Decimal("0.004570") 27 28print(f"status={gateway_contract.status}") 29print(f"required_schema={gateway_contract.required_answer_schema}") 30print(f"max_generated_answer=${gateway_contract.maximum_generated_answer_usd:.6f}") 31print("next_step=apply_contract_in_gateway_routing")
Output
1status=PROMOTE_COST_POLICY 2required_schema=cited-support-answer-v3 3max_generated_answer=$0.004570 4next_step=apply_contract_in_gateway_routing

What the ledger proves

This release did not become cheaper because of one vague "cache" switch:

EvidenceCorrect conclusion
Semantic answer hit with valid release scopeNew generation was avoided for an already approved answer class
Provider cached_tokens on a generated answerRepeated prefix processing was charged at the cached-input rate
Concise cited-answer candidate passes evaluationOutput reduction is eligible for release
Nightly evaluation classified as noninteractiveBatch discount can be modeled without changing customer latency
Monthly forecast and quality gate passCost policy can be promoted and handed to gateway design

Don't infer a pricing benefit from a prompt rewrite until usage traces show cached tokens. Don't count semantic-cache savings unless accepted-hit quality remains valid. Don't approve a shorter output merely because it costs less.

Mastery check

Key concepts

  • Version rate cards because provider prices and discount rules change.
  • Price provider-reported fresh input, cached input, and output separately.
  • A semantic response hit can skip generation; a prompt-prefix hit can't.
  • Output reduction is valid only after the answer schema and evidence requirements still pass.
  • Batch is a workload-class decision: asynchronous evaluation can wait; live support can't.
  • Store explicit rates for each provider pricing mode instead of assuming every discount composes arithmetically.
  • A release promotion needs both a quality gate and a budget gate.

Practice tasks

  1. Add a new warranty-claim-answer traffic slice with longer cited output. Decide whether it breaks the monthly budget.
  2. Change the rate card to a newly verified provider/model rate card and rerun the ledger. Explain which request class moves the forecast most.
  3. Add one unsafe semantic answer hit to the quality report. Verify that apparent savings don't allow promotion.
  4. Replace the fixed replay with exported usage traces from an application you control, keeping rate-card identity and release identity explicit.

Evaluation rubric

  • Foundational: Computes a generated request charge from fresh input, cached input, and output categories.
  • Foundational: Explains why answer reuse and prefix reuse can't share one cache_hit metric.
  • Intermediate: Attributes spend by feature and decision, including avoided generation without treating it as actual spend.
  • Intermediate: Tests an output optimization against a required answer contract.
  • Advanced: Promotes only when a dated rate card, measured traffic replay, quality report, and budget policy all pass.
  • Advanced: Hands a budget contract to the routing layer without prematurely implementing routing in this lesson.

Self-check questions

Common pitfalls

Estimating the bill from characters

Symptom: Forecast misses invoice even when request count is correct. Cause: Character length or visible response length was treated as provider usage. Fix: Store billable usage categories returned by the provider and price them with a versioned rate card.

Calling all reuse a cache hit

Symptom: Team believes generations were skipped, but output spend remains high. Cause: Prefix-cache reuse was confused with stored-answer reuse. Fix: Record separate decisions and cost semantics for each layer.

Cutting output below correctness

Symptom: Spend falls while support answers omit evidence or next actions. Cause: Output cap was promoted without evaluating the required answer schema. Fix: Make the quality contract a hard promotion gate.

Batching interactive requests

Symptom: Forecast looks excellent while customer response latency becomes unacceptable. Cause: A discount was modeled without preserving completion-time requirements. Fix: Batch only explicitly asynchronous work such as nightly evaluation.

Deriving every pricing mode from one multiplier

Symptom: Forecast drifts from provider billing even though token counts are correct. Cause: Standard, Batch, or other processing-mode prices were inferred from one discount instead of loaded from the provider's dated table. Fix: Version explicit rates for every pricing mode used by the release and refresh them from current provider documentation.

Next Step
Continue to Model Gateways, Routing, and Fallbacks

You now have a measured budget contract for each generated support answer. Next you will apply it when selecting lanes and safe fallbacks.

PreviousSemantic Caching & Cost Optimization
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

OpenAI API Pricing

OpenAI · 2026

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Prompt caching

OpenAI · 2026

OpenAI Batch API Guide

OpenAI · 2026