LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringModel Gateways, Routing, and Fallbacks
⚙️MediumMLOps & Deployment

Model Gateways, Routing, and Fallbacks

Turn an audited cost contract into a model gateway that preserves data, schema, review, and budget requirements across routing and fallback.

18 min read
Learning path
Step 77 of 158 in the full curriculum
LLM Cost Engineering & Token EconomicsDesign an Automated Support Agent

Your CodeAssist developer assistant can price each generated answer. The budget contract says an evaluated answer may cost at most 0.004570 USD and must still keep its evidence. A production app faces the harder routing question: which large language model (LLM) lane may answer each request, and what happens when that lane is unavailable?

A model gateway makes that decision. The app submits one request and one set of requirements. The gateway selects a compatible lane, invokes a model adapter, records why it chose that lane, and either finds a compatible fallback or escalates safely after failure.

The important word is compatible. A private, high-risk break-glass access request needs privacy, human review, cited structured output, and a budget ceiling together. A small gateway can enforce that full contract so code can't discard one requirement while satisfying another.

Gateway routing flow that hard-filters lanes by contract, keeps compatible private lanes, then ranks survivors by measured cost. Gateway routing flow that hard-filters lanes by contract, keeps compatible private lanes, then ranks survivors by measured cost.
Gateway first filters on full contract. Only three private cited-review lanes survive, then measured answer cost ranks primary ahead of local and regional.

What belongs at the gateway boundary

In the previous lesson, the cost ledger measured model generation after cache decisions had already been made. The gateway consumes that budget contract alongside product and safety requirements. It doesn't decide whether a stored semantic-cache answer is trustworthy; it decides where a request that still requires generation may run.

Use three terms precisely:

TermJobExample here
AdapterTranslates a stable application request into one provider's API shapeSend messages and parse a structured reply
LaneNames a model path with measured capabilities and costlocal-private-cited-review
GatewayCompiles requirements, filters lanes, selects or falls back, and logs the decisionReject a cheap lane that drops citations

An OpenAI-shaped adapter isn't proof that two lanes have the same contract. Anthropic's OpenAI SDK compatibility documentation describes that path as a comparison/testing option rather than its usual production path, and documents limitations including unsupported prompt caching and an ignored strict tool-calling parameter. Native structured outputs are required for guaranteed schema conformance there.[1] A gateway therefore records capabilities explicitly instead of assuming interface similarity means behavior similarity.

The first lab cell defines the artifact arriving from cost engineering, the request facts the app knows, the contract the gateway must compile, and a registry of candidate lanes. The dollar figures are teaching fixtures measured before promotion, not live provider prices. They are preflight evidence, not the final bill: after generation, the gateway still records provider-reported usage and sends it back through the cost ledger.

01-gateway-types-and-lanes.py
1from dataclasses import dataclass 2from decimal import Decimal 3from enum import Enum 4import json 5 6COST_RELEASE_ID = "assistant-release-2026-05-cost-v1" 7GATEWAY_POLICY_ID = "gateway-policy-v1" 8MAX_GENERATED_ANSWER_USD = Decimal("0.004570") 9MAX_GENERATION_ATTEMPTS = 2 10REQUEST_DEADLINE_MS = 2_500 11 12class DataClass(str, Enum): 13 PUBLIC = "public" 14 TENANT_PRIVATE = "tenant_private" 15 16@dataclass(frozen=True) 17class GatewayRequest: 18 request_id: str 19 task: str 20 data_class: DataClass 21 context_tokens: int 22 risk_amount_cents: int = 0 23 requires_citations: bool = False 24 requires_schema: bool = True 25 26@dataclass(frozen=True) 27class RouteContract: 28 request_id: str 29 data_class: DataClass 30 context_tokens: int 31 needs_citations: bool 32 needs_schema: bool 33 needs_human_review: bool 34 max_answer_cost_usd: Decimal 35 36@dataclass(frozen=True) 37class Lane: 38 name: str 39 provider: str 40 allowed_data_classes: frozenset[DataClass] 41 max_context_tokens: int 42 supports_schema: bool 43 supports_citations: bool 44 supports_human_review: bool 45 evaluated_answer_cost_usd: Decimal 46 expected_latency_ms: int 47 48LANES = [ 49 Lane( 50 "fast-public-json", "hosted-fast", frozenset({DataClass.PUBLIC}), 51 16_000, True, False, False, Decimal("0.001100"), 260, 52 ), 53 Lane( 54 "public-cited-review", "hosted-cited", frozenset({DataClass.PUBLIC}), 55 64_000, True, True, True, Decimal("0.003800"), 860, 56 ), 57 Lane( 58 "primary-private-cited-review", "hosted-private", frozenset({DataClass.TENANT_PRIVATE}), 59 64_000, True, True, True, Decimal("0.004200"), 940, 60 ), 61 Lane( 62 "local-private-cited-review", "local-private", frozenset({DataClass.TENANT_PRIVATE}), 63 32_000, True, True, True, Decimal("0.004500"), 1_100, 64 ), 65 Lane( 66 "regional-private-cited-review", "regional-private", frozenset({DataClass.TENANT_PRIVATE}), 67 64_000, True, True, True, Decimal("0.004560"), 1_300, 68 ), 69 Lane( 70 "cheap-text-fallback", "hosted-cheap", frozenset({DataClass.PUBLIC, DataClass.TENANT_PRIVATE}), 71 32_000, False, False, False, Decimal("0.001500"), 420, 72 ), 73] 74 75print(f"cost_release={COST_RELEASE_ID}") 76print(f"gateway_policy={GATEWAY_POLICY_ID}") 77print(f"max_generated_answer_usd={MAX_GENERATED_ANSWER_USD}") 78print(f"max_generation_attempts={MAX_GENERATION_ATTEMPTS}") 79print(f"request_deadline_ms={REQUEST_DEADLINE_MS}") 80print(f"registered_lanes={len(LANES)}")
Output
1cost_release=assistant-release-2026-05-cost-v1 2gateway_policy=gateway-policy-v1 3max_generated_answer_usd=0.004570 4max_generation_attempts=2 5request_deadline_ms=2500 6registered_lanes=6

Compile requirements before choosing a lane

A tempting router is a stack of early returns:

unsafe-first-match.py
1if request.data_class == "tenant_private": 2 return "local-private" 3if request.risk_amount_cents >= 50_000: 4 return "human-review"

For a private break-glass access request with 900 USD of risk, that code returns from the first branch and silently drops the second requirement. The safe sequence is different:

  1. Convert all request facts into one contract.
  2. Filter out every lane that violates any contract field.
  3. Rank only the remaining lanes by measured preferences such as cost or latency.

The next cell compiles the requirements. A break-glass access request at or above the lesson's policy threshold needs human review; private incident notes independently require a private data lane. Neither check erases the other.

02-compile-route-contract.py
1HIGH_RISK_ACCESS_CENTS = 50_000 2 3def compile_contract(request: GatewayRequest) -> RouteContract: 4 return RouteContract( 5 request_id=request.request_id, 6 data_class=request.data_class, 7 context_tokens=request.context_tokens, 8 needs_citations=request.requires_citations, 9 needs_schema=request.requires_schema, 10 needs_human_review=request.risk_amount_cents >= HIGH_RISK_ACCESS_CENTS, 11 max_answer_cost_usd=MAX_GENERATED_ANSWER_USD, 12 ) 13 14private_access = GatewayRequest( 15 request_id="access-R900", 16 task="prod_access_decision", 17 data_class=DataClass.TENANT_PRIVATE, 18 context_tokens=24_000, 19 risk_amount_cents=90_000, 20 requires_citations=True, 21) 22private_contract = compile_contract(private_access) 23 24print(f"request={private_contract.request_id}") 25print(f"data_class={private_contract.data_class.value}") 26print(f"needs_citations={private_contract.needs_citations}") 27print(f"needs_human_review={private_contract.needs_human_review}") 28print(f"max_answer_cost_usd={private_contract.max_answer_cost_usd}")
Output
1request=access-R900 2data_class=tenant_private 3needs_citations=True 4needs_human_review=True 5max_answer_cost_usd=0.004570

Now the filter can explain every rejection. This is more useful than returning a Boolean: if no lane survives, an operator needs to know whether the missing capability is private-data handling, context length, citations, review, or budget.

03-filter-compatible-lanes.py
1def contract_violations(lane: Lane, contract: RouteContract) -> list[str]: 2 failures: list[str] = [] 3 if contract.data_class not in lane.allowed_data_classes: 4 failures.append("data_boundary") 5 if lane.max_context_tokens < contract.context_tokens: 6 failures.append("context_length") 7 if contract.needs_schema and not lane.supports_schema: 8 failures.append("schema") 9 if contract.needs_citations and not lane.supports_citations: 10 failures.append("citations") 11 if contract.needs_human_review and not lane.supports_human_review: 12 failures.append("human_review") 13 if lane.evaluated_answer_cost_usd > contract.max_answer_cost_usd: 14 failures.append("budget") 15 return failures 16 17def compatible_lanes(contract: RouteContract) -> list[Lane]: 18 return [lane for lane in LANES if not contract_violations(lane, contract)] 19 20for lane in LANES: 21 failures = contract_violations(lane, private_contract) 22 status = "compatible" if not failures else "reject=" + ",".join(failures) 23 print(f"{lane.name}: {status}")
Output
1fast-public-json: reject=data_boundary,context_length,citations,human_review 2public-cited-review: reject=data_boundary 3primary-private-cited-review: compatible 4local-private-cited-review: compatible 5regional-private-cited-review: compatible 6cheap-text-fallback: reject=schema,citations,human_review

Three private cited review lanes survive. The gateway can now choose between them by a soft preference without weakening a hard requirement. Here it chooses the lower measured answer cost, then latency as a deterministic tie-breaker.

04-choose-primary-lane.py
1@dataclass(frozen=True) 2class RouteDecision: 3 request_id: str 4 lane: str | None 5 action: str 6 reasons: tuple[str, ...] 7 8def choose_primary(contract: RouteContract) -> RouteDecision: 9 feasible = compatible_lanes(contract) 10 if not feasible: 11 return RouteDecision(contract.request_id, None, "escalate", ("no_compatible_lane",)) 12 lane = min(feasible, key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms)) 13 reasons = ( 14 f"data={contract.data_class.value}", 15 f"review={str(contract.needs_human_review).lower()}", 16 f"citations={str(contract.needs_citations).lower()}", 17 f"budget<={contract.max_answer_cost_usd}", 18 ) 19 return RouteDecision(contract.request_id, lane.name, "generate", reasons) 20 21status_request = GatewayRequest( 22 "docs-Q102", "deploy_policy", DataClass.PUBLIC, 2_000, 23 requires_citations=False, 24) 25 26for request in (status_request, private_access): 27 decision = choose_primary(compile_contract(request)) 28 print(f"{decision.request_id} -> {decision.lane} action={decision.action}") 29 print(" " + " ".join(decision.reasons))
Output
1docs-Q102 -> fast-public-json action=generate 2 data=public review=false citations=false budget<=0.004570 3access-R900 -> primary-private-cited-review action=generate 4 data=tenant_private review=true citations=true budget<=0.004570

A route is a policy decision, not a provider popularity vote

The filter above is deliberately rule-based. Data boundaries, schema requirements, review requirements, and approved cost ceilings are policy constraints. Letting a probabilistic classifier relax any of them would make a plausible prediction more important than authorization.

Learned routing becomes useful inside the feasible set. RouteLLM trains routers from preference data to choose between a stronger and a weaker LLM; its paper reports cost reductions greater than two times in some evaluated settings without a reported response-quality loss on those settings.[2] That result supports experimenting with quality-cost routing. It doesn't turn a learned router into a privacy or authorization check.

This flow diagram shows the control boundary. A soft router may rank already-compatible candidates; a failure re-enters the same contract filter rather than jumping directly to a convenient provider.

Diagram showing Normal route, Pre-output failure path, 1. Request facts, and 2. Compile contract. Diagram showing Normal route, Pre-output failure path, 1. Request facts, and 2. Compile contract.
Normal route, Pre-output failure path, 1. Request facts, and 2. Compile contract.

Fallback is another contract decision

A fallback is a second route attempt after the first attempt fails. It isn't permission to drop requirements. If a cited access answer must be private, structured, reviewable, and within budget before an outage, it must remain so during an outage.

Gateway frameworks expose fallback mechanisms, but configuration doesn't prove semantic compatibility. For example, LiteLLM documents ordered fallbacks and separate regular, context-window, and content-policy fallback settings.[3] Those mechanisms decide when another lane may be attempted. Your contract still decides which replacement lane is acceptable.

Fallback flow where retryable failures pass a failure gate, compatible lanes pass a lane gate, and contract-breaking fallbacks stop. Fallback flow where retryable failures pass a failure gate, compatible lanes pass a lane gate, and contract-breaking fallbacks stop.
Retry needs two approvals. Failure must be transparently retryable, then replacement lane must preserve contract and have healthy circuit. Here local-private wins and cheap fallback dies on contract.

Failure type matters before the gateway even looks for a candidate. A rate limit or timeout before any output is shown can be retried against a compatible lane. Once a model has streamed visible text, silently switching models can produce a contradictory continuation. The safer product behavior is to stop that answer and ask the user to retry or hand it to an operator.

This fixture routes answer generation only. It doesn't replay side-effecting tool calls. A write action such as granting break-glass access needs its own authorization boundary and idempotency key; a gateway must not silently execute it again because generation failed. A context rejection also stops here: the filter already checked context capacity, so a provider rejection signals stale registry data or another mismatch that needs investigation rather than a hidden downgrade.

The next cell classifies failures by whether a transparent fallback is still allowed.

05-classify-fallback-eligible-failures.py
1class FailureKind(str, Enum): 2 RATE_LIMIT_BEFORE_OUTPUT = "rate_limit_before_output" 3 TIMEOUT_BEFORE_OUTPUT = "timeout_before_output" 4 CONTEXT_REJECTED = "context_rejected" 5 MID_STREAM_DROP = "mid_stream_drop" 6 SCHEMA_INVALID = "schema_invalid" 7 8def may_retry_transparently(failure: FailureKind) -> bool: 9 return failure in { 10 FailureKind.RATE_LIMIT_BEFORE_OUTPUT, 11 FailureKind.TIMEOUT_BEFORE_OUTPUT, 12 } 13 14for failure in FailureKind: 15 action = "try_compatible_fallback" if may_retry_transparently(failure) else "stop_or_escalate" 16 print(f"{failure.value}: {action}")
Output
1rate_limit_before_output: try_compatible_fallback 2timeout_before_output: try_compatible_fallback 3context_rejected: stop_or_escalate 4mid_stream_drop: stop_or_escalate 5schema_invalid: stop_or_escalate

For the private break-glass access request, suppose the lower-cost primary lane times out before emitting output. The fallback selector excludes the failed lane, applies the same contract filter, and then chooses from what remains.

06-select-contract-preserving-fallback.py
1def choose_fallback(contract: RouteContract, failed_lane: str) -> RouteDecision: 2 candidates = [ 3 lane for lane in compatible_lanes(contract) 4 if lane.name != failed_lane 5 ] 6 if not candidates: 7 return RouteDecision(contract.request_id, None, "escalate", ("fallback_contract_unmet",)) 8 lane = min(candidates, key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms)) 9 return RouteDecision( 10 contract.request_id, 11 lane.name, 12 "fallback_generate", 13 (f"primary_failed={failed_lane}", "contract_preserved=true"), 14 ) 15 16primary = choose_primary(private_contract) 17fallback = choose_fallback(private_contract, primary.lane or "") 18print(f"primary={primary.lane}") 19print(f"fallback={fallback.lane} action={fallback.action}") 20print("cheap_text_rejected=" + ",".join(contract_violations(LANES[-1], private_contract)))
Output
1primary=primary-private-cited-review 2fallback=local-private-cited-review action=fallback_generate 3cheap_text_rejected=schema,citations,human_review

Both candidate lanes fit the per-answer admission ceiling, but a failed primary attempt can still consume billable provider work before the timeout reaches the gateway. That means max_answer_cost_usd isn't a guarantee on total request spend during an incident. Production policy also needs a retry budget: a maximum attempt count, one wall-clock deadline shared across attempts, and post-call pricing for usage reported by every provider attempt. This lab allows at most one fallback (MAX_GENERATION_ATTEMPTS = 2) and carries a 2.5-second request deadline into the exported policy. The simulator doesn't advance a clock, so the downstream runtime must enforce that deadline.

Stop incidents from multiplying

Fallback can make an outage worse if every request keeps trying an unhealthy primary before moving to the backup. A circuit breaker tracks recent failures for an upstream target. Once that target reaches a failure threshold, its circuit opens and new requests skip it during a cooldown period. After cooldown, one probe is allowed through; success closes the circuit, while failure opens it again.

This compact implementation models those three states: closed, open, and half_open. Its key is provider because each fixture provider names one upstream target. A production registry may need a provider, endpoint, deployment, or lane key depending on failure scope, plus bounded retry attempts, backoff, and a deadline for the half-open probe.

07-circuit-breaker.py
1class CircuitStatus(str, Enum): 2 CLOSED = "closed" 3 OPEN = "open" 4 HALF_OPEN = "half_open" 5 6@dataclass 7class CircuitState: 8 status: CircuitStatus = CircuitStatus.CLOSED 9 failures: int = 0 10 opened_until: float = 0.0 11 12class CircuitBreaker: 13 def __init__(self, threshold: int = 2, cooldown_seconds: float = 10.0) -> None: 14 self.threshold = threshold 15 self.cooldown_seconds = cooldown_seconds 16 self.states: dict[str, CircuitState] = {} 17 18 def permit(self, provider: str, now: float) -> bool: 19 state = self.states.setdefault(provider, CircuitState()) 20 if state.status == CircuitStatus.OPEN: 21 if now < state.opened_until: 22 return False 23 state.status = CircuitStatus.HALF_OPEN 24 return True 25 return state.status == CircuitStatus.CLOSED 26 27 def failure(self, provider: str, now: float) -> None: 28 state = self.states.setdefault(provider, CircuitState()) 29 state.failures += 1 30 if state.status == CircuitStatus.HALF_OPEN or state.failures >= self.threshold: 31 state.status = CircuitStatus.OPEN 32 state.opened_until = now + self.cooldown_seconds 33 34 def success(self, provider: str) -> None: 35 self.states[provider] = CircuitState() 36 37breaker = CircuitBreaker() 38breaker.failure("hosted-private", 100.0) 39breaker.failure("hosted-private", 101.0) 40print(f"during_cooldown={breaker.permit('hosted-private', 105.0)}") 41print(f"probe_after_cooldown={breaker.permit('hosted-private', 112.0)}") 42breaker.success("hosted-private") 43print(f"after_success={breaker.states['hosted-private'].status.value}")
Output
1during_cooldown=False 2probe_after_cooldown=True 3after_success=closed

Execute one outage path and record why

Routing decisions are operational evidence. An audit row should carry the gateway-policy identifier, inherited cost-release identifier, selected lane, fallback event, evaluated cost, and every hard requirement that caused selection. That record lets a later incident review distinguish a model-quality problem from a bad policy decision.

The next cell simulates several attempts for the private break-glass access request. It consults circuit state before attempting the primary lane and records failures when an attempt breaks. A timeout before output uses the compatible local fallback. A mid-stream drop escalates because changing generators after output begins would hide an inconsistent user experience. The final attempt opens both earlier circuits and proves that the gateway scans ranked contract-compatible fallbacks until it finds a permitted regional lane.

08-execute-gateway-path.py
1@dataclass(frozen=True) 2class RouteEvent: 3 request_id: str 4 policy_id: str 5 cost_release_id: str 6 action: str 7 lane: str | None 8 contract_summary: str 9 reason: str 10 evaluated_cost_usd: Decimal 11 12def lane_by_name(name: str) -> Lane: 13 return next(lane for lane in LANES if lane.name == name) 14 15def summarize_contract(contract: RouteContract) -> str: 16 return ( 17 f"data={contract.data_class.value};" 18 f"schema={str(contract.needs_schema).lower()};" 19 f"citations={str(contract.needs_citations).lower()};" 20 f"review={str(contract.needs_human_review).lower()};" 21 f"budget<={contract.max_answer_cost_usd}" 22 ) 23 24def permitted_fallback_lane(contract: RouteContract, failed_lane: str, now: float) -> Lane | None: 25 candidates = sorted( 26 (lane for lane in compatible_lanes(contract) if lane.name != failed_lane), 27 key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms), 28 ) 29 for lane in candidates: 30 if breaker.permit(lane.provider, now): 31 return lane 32 return None 33 34def execute_with_failure(request: GatewayRequest, failure: FailureKind | None, now: float = 200.0) -> RouteEvent: 35 contract = compile_contract(request) 36 contract_summary = summarize_contract(contract) 37 primary_decision = choose_primary(contract) 38 if primary_decision.lane is None: 39 return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_primary_lane", Decimal("0")) 40 primary_lane = lane_by_name(primary_decision.lane) 41 if not breaker.permit(primary_lane.provider, now): 42 fallback_lane = permitted_fallback_lane(contract, primary_lane.name, now) 43 if fallback_lane is None: 44 return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_healthy_safe_fallback", Decimal("0")) 45 breaker.success(fallback_lane.provider) 46 return RouteEvent( 47 request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served_fallback", fallback_lane.name, 48 contract_summary, "primary_circuit_open;contract_preserved", fallback_lane.evaluated_answer_cost_usd, 49 ) 50 if failure is None: 51 breaker.success(primary_lane.provider) 52 return RouteEvent( 53 request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served", primary_lane.name, 54 contract_summary, "primary_contract_match", primary_lane.evaluated_answer_cost_usd, 55 ) 56 breaker.failure(primary_lane.provider, now) 57 if not may_retry_transparently(failure): 58 return RouteEvent( 59 request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, 60 contract_summary, f"primary_{failure.value}", Decimal("0"), 61 ) 62 fallback_lane = permitted_fallback_lane(contract, primary_lane.name, now) 63 if fallback_lane is None: 64 return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_healthy_safe_fallback", Decimal("0")) 65 breaker.success(fallback_lane.provider) 66 return RouteEvent( 67 request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served_fallback", fallback_lane.name, 68 contract_summary, f"primary_{failure.value};contract_preserved", fallback_lane.evaluated_answer_cost_usd, 69 ) 70 71breaker = CircuitBreaker() 72before_output = execute_with_failure(private_access, FailureKind.TIMEOUT_BEFORE_OUTPUT, now=200.0) 73mid_stream = execute_with_failure(private_access, FailureKind.MID_STREAM_DROP, now=201.0) 74breaker.failure("local-private", 201.0) 75breaker.failure("local-private", 202.0) 76skip_open_local = execute_with_failure(private_access, None, now=203.0) 77 78print(f"before_output={before_output.action} lane={before_output.lane} reason={before_output.reason}") 79print(f"audit_policy={before_output.policy_id} cost_release={before_output.cost_release_id}") 80print(f"audit_contract={before_output.contract_summary}") 81print(f"mid_stream={mid_stream.action} lane={mid_stream.lane} reason={mid_stream.reason}") 82print(f"circuit_after_failures={breaker.states['hosted-private'].status.value}") 83print(f"open_local_skipped={skip_open_local.action} lane={skip_open_local.lane} reason={skip_open_local.reason}")
Output
1before_output=served_fallback lane=local-private-cited-review reason=primary_timeout_before_output;contract_preserved 2audit_policy=gateway-policy-v1 cost_release=assistant-release-2026-05-cost-v1 3audit_contract=data=tenant_private;schema=true;citations=true;review=true;budget<=0.004570 4mid_stream=escalate lane=None reason=primary_mid_stream_drop 5circuit_after_failures=open 6open_local_skipped=served_fallback lane=regional-private-cited-review reason=primary_circuit_open;contract_preserved

Test policy before promoting it

A gateway policy shouldn't be promoted because three hand-picked examples look sensible. Replay an evaluated set that represents low-risk traffic, high-risk cases, private data, outage paths, and unsupported contracts. For every generated response, check that selected lane preserves the contract and stays inside the inherited budget.

This small replay adds a request with a private context too large for either approved private lane. It also opens the primary circuit on a simulated timeout and confirms that the next request uses a contract-preserving fallback during cooldown. An honest gateway escalates unsupported work rather than truncating evidence or routing private incident notes somewhere unapproved.

09-replay-gateway-policy.py
1too_large_private_request = GatewayRequest( 2 "access-long-context", "prod_access_decision", DataClass.TENANT_PRIVATE, 70_000, 3 risk_amount_cents=90_000, requires_citations=True, 4) 5 6breaker = CircuitBreaker(threshold=1, cooldown_seconds=10.0) 7replay_cases = [ 8 (status_request, None), 9 (private_access, None), 10 (private_access, FailureKind.TIMEOUT_BEFORE_OUTPUT), 11 (private_access, None), 12 (too_large_private_request, None), 13] 14events = [ 15 execute_with_failure(request, failure, now=300.0 + index) 16 for index, (request, failure) in enumerate(replay_cases) 17] 18 19served = [] 20for (request, _), event in zip(replay_cases, events): 21 if event.action.startswith("served"): 22 lane = lane_by_name(event.lane or "") 23 assert not contract_violations(lane, compile_contract(request)) 24 assert event.evaluated_cost_usd <= MAX_GENERATED_ANSWER_USD 25 served.append(event) 26 27for event in events: 28 print(f"{event.request_id}: {event.action} lane={event.lane}") 29print(f"generated_with_contract={len(served)}/{len(events)}") 30print(f"unsafe_generation_events=0")
Output
1docs-Q102: served lane=fast-public-json 2access-R900: served lane=primary-private-cited-review 3access-R900: served_fallback lane=local-private-cited-review 4access-R900: served_fallback lane=local-private-cited-review 5access-long-context: escalate lane=None 6generated_with_contract=4/5 7unsafe_generation_events=0

The replay isn't an answer-quality evaluation by itself. Its cost assertion checks preflight lane evidence, not the final bill. After generation, record provider-reported usage and price it with the inherited rate card in the ledger from the previous lesson. Before promotion, also join these route events with the answer-correctness and citation checks from the evaluation lessons. A routing policy can satisfy the mechanical contract and still send too many difficult public questions to a weak but formally compatible lane.

Only after hard requirements pass should you test learned or heuristic ranking for quality and cost. Useful comparison metrics are generated-answer correctness, citation correctness, human-escalation rate, fallback success rate, p95 latency, and actual spend by lane.

Hand the developer assistant a policy artifact

The next lesson assembles a complete developer assistant. It shouldn't re-create routing rules inside orchestration code. The gateway should export a small, versioned policy artifact that states what has been approved and what must escalate.

The final cell emits the artifact from the lesson: an inherited cost contract, two demonstrated routes, and failure behavior that refuses unsafe degradation.

10-export-gateway-policy.py
1policy_artifact = { 2 "policy_id": GATEWAY_POLICY_ID, 3 "cost_release_id": COST_RELEASE_ID, 4 "max_generated_answer_usd": str(MAX_GENERATED_ANSWER_USD), 5 "retry_limits": { 6 "max_generation_attempts": MAX_GENERATION_ATTEMPTS, 7 "request_deadline_ms": REQUEST_DEADLINE_MS, 8 }, 9 "approved_examples": { 10 "public_deploy_policy": "fast-public-json", 11 "private_high_risk_access": "primary-private-cited-review", 12 "private_high_risk_access_fallback": "local-private-cited-review", 13 }, 14 "escalate_when": [ 15 "no lane preserves all contract fields", 16 "failure occurs after visible output begins", 17 "approved private context capacity is exceeded", 18 "retry attempts or request deadline are exhausted", 19 ], 20} 21 22print(json.dumps(policy_artifact, indent=2))
Output
1{ 2 "policy_id": "gateway-policy-v1", 3 "cost_release_id": "assistant-release-2026-05-cost-v1", 4 "max_generated_answer_usd": "0.004570", 5 "retry_limits": { 6 "max_generation_attempts": 2, 7 "request_deadline_ms": 2500 8 }, 9 "approved_examples": { 10 "public_deploy_policy": "fast-public-json", 11 "private_high_risk_access": "primary-private-cited-review", 12 "private_high_risk_access_fallback": "local-private-cited-review" 13 }, 14 "escalate_when": [ 15 "no lane preserves all contract fields", 16 "failure occurs after visible output begins", 17 "approved private context capacity is exceeded", 18 "retry attempts or request deadline are exhausted" 19 ] 20}

Mastery check

What you built

  • A request-to-contract compiler that accumulates privacy, schema, citation, review, and budget requirements.
  • A filter that can explain why each unsafe lane is rejected.
  • A primary and fallback selector that ranks only compatible lanes.
  • A failure classifier and circuit breaker that avoid unsafe or amplifying retries.
  • Retry-limit policy fields that cap generation attempts and define one request deadline for the downstream runtime.
  • Audit rows that pin the gateway policy and inherited cost release used for each decision.
  • A replay check and versioned gateway-policy artifact for the developer-assistant design.

Evaluation rubric

  • Foundational: Explains why an adapter isn't a gateway and why OpenAI-shaped requests don't prove capability equality.
  • Intermediate: Compiles a private high-risk break-glass access request into one contract without dropping either privacy or review.
  • Intermediate: Rejects a low-cost fallback that breaks citations, schema, or data handling.
  • Advanced: Distinguishes a retryable pre-output failure from a mid-stream failure that must stop or escalate.
  • Advanced: Separates the per-answer admission ceiling from request-level retry cost, attempt, and deadline controls.
  • Advanced: Promotes routing only after route-contract checks are joined with answer-quality checks and post-generation cost ledgers.

Self-check questions

Common failures

Treating the first matching rule as the full contract

  • Symptom: Private break-glass access requests stay private but bypass required human review.
  • Cause: Router returned on the data classification branch before evaluating risk requirements.
  • Fix: Compile all requirements first and accept only lanes with an empty violation list.

Assuming interface compatibility means safe fallback

  • Symptom: Fallback returns text, but structured parsing or citation enforcement fails.
  • Cause: Adapter shape was mistaken for a capability guarantee.
  • Fix: Maintain tested lane capabilities and reject a fallback that can't meet the output contract.

Retrying through an incident

  • Symptom: Latency and error volume grow while a failing provider is still attempted on every request.
  • Cause: Gateway had fallback order but no circuit state.
  • Fix: Open the unhealthy circuit, permit a controlled probe after cooldown, and record every fallback route.

Stopping after the first open-circuit fallback

  • Symptom: Gateway escalates even though a later contract-compatible lane is healthy.
  • Cause: Fallback code selected one safe lane, discovered its circuit was open, and stopped scanning.
  • Fix: Rank contract-compatible fallbacks, then choose the first lane whose circuit permits a call.

Treating preflight cost evidence as the final bill

  • Symptom: Route replay passes, but actual generated-answer spend drifts above the approved ceiling.
  • Cause: Gateway checked the lane's evaluated fixture cost and skipped provider-usage pricing after generation.
  • Fix: Use evaluated cost for route admission, then record provider-reported usage and run the inherited ledger check on every generated answer.

Treating fallback attempts as free

  • Symptom: Every chosen lane fits the answer-cost ceiling, but incident spend and tail latency still exceed policy.
  • Cause: The gateway applied the ceiling independently to each lane and ignored work consumed by failed attempts.
  • Fix: Share one retry budget across the request: cap attempts, carry one deadline, and price usage from every provider attempt after the call.

Replaying side effects as if they were generation

  • Symptom: A transient model failure causes a write tool to run twice.
  • Cause: Retry policy treated answer generation and authorized side effects as one operation.
  • Fix: Keep gateway retries scoped to idempotent generation attempts. Put authorization, idempotency keys, and replay handling around each write tool.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An unsafe router returns "local-private" as soon as it sees tenant-private data, before checking whether a break-glass access request carries at least 50,000 cents of risk. For a tenant-private request with 900 USD of risk that requires citations, what is the failure?
2.Two lanes can be called through the same OpenAI-shaped chat adapter. A request is a private high-risk break-glass access request that requires citations and structured output. What evidence should the gateway use before admitting either lane?
3.A model has already streamed visible text to the user when the connection drops mid-answer. What should the gateway do?
4.A replay lets a learned router rank only lanes that already passed deterministic filtering. The replay shows every served event had no route-contract violations and evaluated lane cost was at most 0.004570 USD. What conclusion is supported before promotion?
5.A private high-risk break-glass access request's primary private cited-review lane times out before output. A healthy cheap fallback is under budget and allows tenant-private data, but it returns plain text and has no citation or human-review support. Should the gateway use it?
6.A private high-risk break-glass access request has compatible lanes ranked by cost: primary-private, local-private, then regional-private. The primary provider circuit is open, and the local-private circuit is also open during cooldown. Regional-private is healthy and within budget. What should the gateway do?
7.A contract requires tenant-private handling, 24,000 context tokens, schema, citations, human review, and cost at most 0.004570 USD. Primary, local, and regional lanes satisfy every field at costs of 0.004200, 0.004500, and 0.004560 USD. A cheaper lane lacks schema, citations, and review. A learned ranker recommends the cheaper lane. Which lane should the gateway select?
8.A tenant-private break-glass access request requires 70,000 context tokens, schema, citations, and human review. Every approved private lane supports at most 64,000 tokens, and no other lane satisfies all contract fields. What should the gateway do?
9.An authorized access write has already succeeded. The model then times out before visible output, and a compatible generation fallback exists. What should the gateway do?
10.After an outage, an incident reviewer must verify which approved policy and cost release governed a fallback, what contract applied, and why a lane was selected. Which audit record supports that review?

10 questions remaining.

Next Step
Continue to Design a Conversational AI Developer Assistant

You now have an auditable gateway policy that preserves cost and safety requirements across normal and failure paths. Next you'll place that policy inside a stateful assistant that retrieves evidence, invokes tools, and escalates real developer requests.

PreviousLLM Cost Engineering & Token Economics
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

OpenAI SDK compatibility

Anthropic · 2026

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., et al. · 2024

Fallbacks (Proxy Reliability)

BerriAI/LiteLLM · 2026