LearnApplied LLM EngineeringModel Gateways, Routing, and Fallbacks

⚙️MediumMLOps & Deployment

Model Gateways, Routing, and Fallbacks

Turn an audited cost contract into a model gateway that preserves data, schema, review, and budget requirements across routing and fallback.

18 min read

Learning path

Step 77 of 158 in the full curriculum

LLM Cost Engineering & Token Economics Design an Automated Support Agent

Your CodeAssist developer assistant can price each generated answer. The budget contract says an evaluated answer may cost at most 0.004570 USD and must still keep its evidence. A production app faces the harder routing question: which large language model (LLM) lane may answer each request, and what happens when that lane is unavailable?

A model gateway makes that decision. The app submits one request and one set of requirements. The gateway selects a compatible lane, invokes a model adapter, records why it chose that lane, and either finds a compatible fallback or escalates safely after failure.

The important word is compatible. A private, high-risk break-glass access request needs privacy, human review, cited structured output, and a budget ceiling together. A small gateway can enforce that full contract so code can't discard one requirement while satisfying another.

Gateway routing flow that hard-filters lanes by contract, keeps compatible private lanes, then ranks survivors by measured cost. — Gateway first filters on full contract. Only three private cited-review lanes survive, then measured answer cost ranks primary ahead of local and regional.

What belongs at the gateway boundary

In the previous lesson, the cost ledger measured model generation after cache decisions had already been made. The gateway consumes that budget contract alongside product and safety requirements. It doesn't decide whether a stored semantic-cache answer is trustworthy; it decides where a request that still requires generation may run.

Use three terms precisely:

Term	Job	Example here
Adapter	Translates a stable application request into one provider's API shape	Send messages and parse a structured reply
Lane	Names a model path with measured capabilities and cost	`local-private-cited-review`
Gateway	Compiles requirements, filters lanes, selects or falls back, and logs the decision	Reject a cheap lane that drops citations

An OpenAI-shaped adapter isn't proof that two lanes have the same contract. Anthropic's OpenAI SDK compatibility documentation describes that path as a comparison/testing option rather than its usual production path, and documents limitations including unsupported prompt caching and an ignored strict tool-calling parameter. Native structured outputs are required for guaranteed schema conformance there.^[1] A gateway therefore records capabilities explicitly instead of assuming interface similarity means behavior similarity.

The first lab cell defines the artifact arriving from cost engineering, the request facts the app knows, the contract the gateway must compile, and a registry of candidate lanes. The dollar figures are teaching fixtures measured before promotion, not live provider prices. They are preflight evidence, not the final bill: after generation, the gateway still records provider-reported usage and sends it back through the cost ledger.

01-gateway-types-and-lanes.py

from dataclasses import dataclass
from decimal import Decimal
from enum import Enum
import json

COST_RELEASE_ID = "assistant-release-2026-05-cost-v1"
GATEWAY_POLICY_ID = "gateway-policy-v1"
MAX_GENERATED_ANSWER_USD = Decimal("0.004570")
MAX_GENERATION_ATTEMPTS = 2
REQUEST_DEADLINE_MS = 2_500

class DataClass(str, Enum):
    PUBLIC = "public"
    TENANT_PRIVATE = "tenant_private"

@dataclass(frozen=True)
class GatewayRequest:
    request_id: str
    task: str
    data_class: DataClass
    context_tokens: int
    risk_amount_cents: int = 0
    requires_citations: bool = False
    requires_schema: bool = True

@dataclass(frozen=True)
class RouteContract:
    request_id: str
    data_class: DataClass
    context_tokens: int
    needs_citations: bool
    needs_schema: bool
    needs_human_review: bool
    max_answer_cost_usd: Decimal

@dataclass(frozen=True)
class Lane:
    name: str
    provider: str
    allowed_data_classes: frozenset[DataClass]
    max_context_tokens: int
    supports_schema: bool
    supports_citations: bool
    supports_human_review: bool
    evaluated_answer_cost_usd: Decimal
    expected_latency_ms: int

LANES = [
    Lane(
        "fast-public-json", "hosted-fast", frozenset({DataClass.PUBLIC}),
        16_000, True, False, False, Decimal("0.001100"), 260,
    ),
    Lane(
        "public-cited-review", "hosted-cited", frozenset({DataClass.PUBLIC}),
        64_000, True, True, True, Decimal("0.003800"), 860,
    ),
    Lane(
        "primary-private-cited-review", "hosted-private", frozenset({DataClass.TENANT_PRIVATE}),
        64_000, True, True, True, Decimal("0.004200"), 940,
    ),
    Lane(
        "local-private-cited-review", "local-private", frozenset({DataClass.TENANT_PRIVATE}),
        32_000, True, True, True, Decimal("0.004500"), 1_100,
    ),
    Lane(
        "regional-private-cited-review", "regional-private", frozenset({DataClass.TENANT_PRIVATE}),
        64_000, True, True, True, Decimal("0.004560"), 1_300,
    ),
    Lane(
        "cheap-text-fallback", "hosted-cheap", frozenset({DataClass.PUBLIC, DataClass.TENANT_PRIVATE}),
        32_000, False, False, False, Decimal("0.001500"), 420,
    ),
]

print(f"cost_release={COST_RELEASE_ID}")
print(f"gateway_policy={GATEWAY_POLICY_ID}")
print(f"max_generated_answer_usd={MAX_GENERATED_ANSWER_USD}")
print(f"max_generation_attempts={MAX_GENERATION_ATTEMPTS}")
print(f"request_deadline_ms={REQUEST_DEADLINE_MS}")
print(f"registered_lanes={len(LANES)}")

Output

cost_release=assistant-release-2026-05-cost-v1
gateway_policy=gateway-policy-v1
max_generated_answer_usd=0.004570
max_generation_attempts=2
request_deadline_ms=2500
registered_lanes=6

Compile requirements before choosing a lane

A tempting router is a stack of early returns:

unsafe-first-match.py

if request.data_class == "tenant_private":
    return "local-private"
if request.risk_amount_cents >= 50_000:
    return "human-review"

For a private break-glass access request with 900 USD of risk, that code returns from the first branch and silently drops the second requirement. The safe sequence is different:

Convert all request facts into one contract.
Filter out every lane that violates any contract field.
Rank only the remaining lanes by measured preferences such as cost or latency.

The next cell compiles the requirements. A break-glass access request at or above the lesson's policy threshold needs human review; private incident notes independently require a private data lane. Neither check erases the other.

02-compile-route-contract.py

HIGH_RISK_ACCESS_CENTS = 50_000

def compile_contract(request: GatewayRequest) -> RouteContract:
    return RouteContract(
        request_id=request.request_id,
        data_class=request.data_class,
        context_tokens=request.context_tokens,
        needs_citations=request.requires_citations,
        needs_schema=request.requires_schema,
        needs_human_review=request.risk_amount_cents >= HIGH_RISK_ACCESS_CENTS,
        max_answer_cost_usd=MAX_GENERATED_ANSWER_USD,
    )

private_access = GatewayRequest(
    request_id="access-R900",
    task="prod_access_decision",
    data_class=DataClass.TENANT_PRIVATE,
    context_tokens=24_000,
    risk_amount_cents=90_000,
    requires_citations=True,
)
private_contract = compile_contract(private_access)

print(f"request={private_contract.request_id}")
print(f"data_class={private_contract.data_class.value}")
print(f"needs_citations={private_contract.needs_citations}")
print(f"needs_human_review={private_contract.needs_human_review}")
print(f"max_answer_cost_usd={private_contract.max_answer_cost_usd}")

Output

request=access-R900
data_class=tenant_private
needs_citations=True
needs_human_review=True
max_answer_cost_usd=0.004570

Now the filter can explain every rejection. This is more useful than returning a Boolean: if no lane survives, an operator needs to know whether the missing capability is private-data handling, context length, citations, review, or budget.

03-filter-compatible-lanes.py

def contract_violations(lane: Lane, contract: RouteContract) -> list[str]:
    failures: list[str] = []
    if contract.data_class not in lane.allowed_data_classes:
        failures.append("data_boundary")
    if lane.max_context_tokens < contract.context_tokens:
        failures.append("context_length")
    if contract.needs_schema and not lane.supports_schema:
        failures.append("schema")
    if contract.needs_citations and not lane.supports_citations:
        failures.append("citations")
    if contract.needs_human_review and not lane.supports_human_review:
        failures.append("human_review")
    if lane.evaluated_answer_cost_usd > contract.max_answer_cost_usd:
        failures.append("budget")
    return failures

def compatible_lanes(contract: RouteContract) -> list[Lane]:
    return [lane for lane in LANES if not contract_violations(lane, contract)]

for lane in LANES:
    failures = contract_violations(lane, private_contract)
    status = "compatible" if not failures else "reject=" + ",".join(failures)
    print(f"{lane.name}: {status}")

Output

fast-public-json: reject=data_boundary,context_length,citations,human_review
public-cited-review: reject=data_boundary
primary-private-cited-review: compatible
local-private-cited-review: compatible
regional-private-cited-review: compatible
cheap-text-fallback: reject=schema,citations,human_review

Three private cited review lanes survive. The gateway can now choose between them by a soft preference without weakening a hard requirement. Here it chooses the lower measured answer cost, then latency as a deterministic tie-breaker.

04-choose-primary-lane.py

@dataclass(frozen=True)
class RouteDecision:
    request_id: str
    lane: str | None
    action: str
    reasons: tuple[str, ...]

def choose_primary(contract: RouteContract) -> RouteDecision:
    feasible = compatible_lanes(contract)
    if not feasible:
        return RouteDecision(contract.request_id, None, "escalate", ("no_compatible_lane",))
    lane = min(feasible, key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms))
    reasons = (
        f"data={contract.data_class.value}",
        f"review={str(contract.needs_human_review).lower()}",
        f"citations={str(contract.needs_citations).lower()}",
        f"budget<={contract.max_answer_cost_usd}",
    )
    return RouteDecision(contract.request_id, lane.name, "generate", reasons)

status_request = GatewayRequest(
    "docs-Q102", "deploy_policy", DataClass.PUBLIC, 2_000,
    requires_citations=False,
)

for request in (status_request, private_access):
    decision = choose_primary(compile_contract(request))
    print(f"{decision.request_id} -> {decision.lane} action={decision.action}")
    print("  " + " ".join(decision.reasons))

Output

docs-Q102 -> fast-public-json action=generate
  data=public review=false citations=false budget<=0.004570
access-R900 -> primary-private-cited-review action=generate
  data=tenant_private review=true citations=true budget<=0.004570

A route is a policy decision, not a provider popularity vote

The filter above is deliberately rule-based. Data boundaries, schema requirements, review requirements, and approved cost ceilings are policy constraints. Letting a probabilistic classifier relax any of them would make a plausible prediction more important than authorization.

Learned routing becomes useful inside the feasible set. RouteLLM trains routers from preference data to choose between a stronger and a weaker LLM; its paper reports cost reductions greater than two times in some evaluated settings without a reported response-quality loss on those settings.^[2] That result supports experimenting with quality-cost routing. It doesn't turn a learned router into a privacy or authorization check.

This flow diagram shows the control boundary. A soft router may rank already-compatible candidates; a failure re-enters the same contract filter rather than jumping directly to a convenient provider.

Diagram showing Normal route, Pre-output failure path, 1. Request facts, and 2. Compile contract. — Normal route, Pre-output failure path, 1. Request facts, and 2. Compile contract.

Fallback is another contract decision

A fallback is a second route attempt after the first attempt fails. It isn't permission to drop requirements. If a cited access answer must be private, structured, reviewable, and within budget before an outage, it must remain so during an outage.

Gateway frameworks expose fallback mechanisms, but configuration doesn't prove semantic compatibility. For example, LiteLLM documents ordered fallbacks and separate regular, context-window, and content-policy fallback settings.^[3] Those mechanisms decide when another lane may be attempted. Your contract still decides which replacement lane is acceptable.

Fallback flow where retryable failures pass a failure gate, compatible lanes pass a lane gate, and contract-breaking fallbacks stop. — Retry needs two approvals. Failure must be transparently retryable, then replacement lane must preserve contract and have healthy circuit. Here local-private wins and cheap fallback dies on contract.

Failure type matters before the gateway even looks for a candidate. A rate limit or timeout before any output is shown can be retried against a compatible lane. Once a model has streamed visible text, silently switching models can produce a contradictory continuation. The safer product behavior is to stop that answer and ask the user to retry or hand it to an operator.

This fixture routes answer generation only. It doesn't replay side-effecting tool calls. A write action such as granting break-glass access needs its own authorization boundary and idempotency key; a gateway must not silently execute it again because generation failed. A context rejection also stops here: the filter already checked context capacity, so a provider rejection signals stale registry data or another mismatch that needs investigation rather than a hidden downgrade.

The next cell classifies failures by whether a transparent fallback is still allowed.

05-classify-fallback-eligible-failures.py

class FailureKind(str, Enum):
    RATE_LIMIT_BEFORE_OUTPUT = "rate_limit_before_output"
    TIMEOUT_BEFORE_OUTPUT = "timeout_before_output"
    CONTEXT_REJECTED = "context_rejected"
    MID_STREAM_DROP = "mid_stream_drop"
    SCHEMA_INVALID = "schema_invalid"

def may_retry_transparently(failure: FailureKind) -> bool:
    return failure in {
        FailureKind.RATE_LIMIT_BEFORE_OUTPUT,
        FailureKind.TIMEOUT_BEFORE_OUTPUT,
    }

for failure in FailureKind:
    action = "try_compatible_fallback" if may_retry_transparently(failure) else "stop_or_escalate"
    print(f"{failure.value}: {action}")

Output

rate_limit_before_output: try_compatible_fallback
timeout_before_output: try_compatible_fallback
context_rejected: stop_or_escalate
mid_stream_drop: stop_or_escalate
schema_invalid: stop_or_escalate

For the private break-glass access request, suppose the lower-cost primary lane times out before emitting output. The fallback selector excludes the failed lane, applies the same contract filter, and then chooses from what remains.

06-select-contract-preserving-fallback.py

def choose_fallback(contract: RouteContract, failed_lane: str) -> RouteDecision:
    candidates = [
        lane for lane in compatible_lanes(contract)
        if lane.name != failed_lane
    ]
    if not candidates:
        return RouteDecision(contract.request_id, None, "escalate", ("fallback_contract_unmet",))
    lane = min(candidates, key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms))
    return RouteDecision(
        contract.request_id,
        lane.name,
        "fallback_generate",
        (f"primary_failed={failed_lane}", "contract_preserved=true"),
    )

primary = choose_primary(private_contract)
fallback = choose_fallback(private_contract, primary.lane or "")
print(f"primary={primary.lane}")
print(f"fallback={fallback.lane} action={fallback.action}")
print("cheap_text_rejected=" + ",".join(contract_violations(LANES[-1], private_contract)))

Output

primary=primary-private-cited-review
fallback=local-private-cited-review action=fallback_generate
cheap_text_rejected=schema,citations,human_review

Both candidate lanes fit the per-answer admission ceiling, but a failed primary attempt can still consume billable provider work before the timeout reaches the gateway. That means max_answer_cost_usd isn't a guarantee on total request spend during an incident. Production policy also needs a retry budget: a maximum attempt count, one wall-clock deadline shared across attempts, and post-call pricing for usage reported by every provider attempt. This lab allows at most one fallback (MAX_GENERATION_ATTEMPTS = 2) and carries a 2.5-second request deadline into the exported policy. The simulator doesn't advance a clock, so the downstream runtime must enforce that deadline.

Stop incidents from multiplying

Fallback can make an outage worse if every request keeps trying an unhealthy primary before moving to the backup. A circuit breaker tracks recent failures for an upstream target. Once that target reaches a failure threshold, its circuit opens and new requests skip it during a cooldown period. After cooldown, one probe is allowed through; success closes the circuit, while failure opens it again.

This compact implementation models those three states: closed, open, and half_open. Its key is provider because each fixture provider names one upstream target. A production registry may need a provider, endpoint, deployment, or lane key depending on failure scope, plus bounded retry attempts, backoff, and a deadline for the half-open probe.

07-circuit-breaker.py

class CircuitStatus(str, Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitState:
    status: CircuitStatus = CircuitStatus.CLOSED
    failures: int = 0
    opened_until: float = 0.0

class CircuitBreaker:
    def __init__(self, threshold: int = 2, cooldown_seconds: float = 10.0) -> None:
        self.threshold = threshold
        self.cooldown_seconds = cooldown_seconds
        self.states: dict[str, CircuitState] = {}

    def permit(self, provider: str, now: float) -> bool:
        state = self.states.setdefault(provider, CircuitState())
        if state.status == CircuitStatus.OPEN:
            if now < state.opened_until:
                return False
            state.status = CircuitStatus.HALF_OPEN
            return True
        return state.status == CircuitStatus.CLOSED

    def failure(self, provider: str, now: float) -> None:
        state = self.states.setdefault(provider, CircuitState())
        state.failures += 1
        if state.status == CircuitStatus.HALF_OPEN or state.failures >= self.threshold:
            state.status = CircuitStatus.OPEN
            state.opened_until = now + self.cooldown_seconds

    def success(self, provider: str) -> None:
        self.states[provider] = CircuitState()

breaker = CircuitBreaker()
breaker.failure("hosted-private", 100.0)
breaker.failure("hosted-private", 101.0)
print(f"during_cooldown={breaker.permit('hosted-private', 105.0)}")
print(f"probe_after_cooldown={breaker.permit('hosted-private', 112.0)}")
breaker.success("hosted-private")
print(f"after_success={breaker.states['hosted-private'].status.value}")

Output

during_cooldown=False
probe_after_cooldown=True
after_success=closed

Execute one outage path and record why

Routing decisions are operational evidence. An audit row should carry the gateway-policy identifier, inherited cost-release identifier, selected lane, fallback event, evaluated cost, and every hard requirement that caused selection. That record lets a later incident review distinguish a model-quality problem from a bad policy decision.

The next cell simulates several attempts for the private break-glass access request. It consults circuit state before attempting the primary lane and records failures when an attempt breaks. A timeout before output uses the compatible local fallback. A mid-stream drop escalates because changing generators after output begins would hide an inconsistent user experience. The final attempt opens both earlier circuits and proves that the gateway scans ranked contract-compatible fallbacks until it finds a permitted regional lane.

08-execute-gateway-path.py

@dataclass(frozen=True)
class RouteEvent:
    request_id: str
    policy_id: str
    cost_release_id: str
    action: str
    lane: str | None
    contract_summary: str
    reason: str
    evaluated_cost_usd: Decimal

def lane_by_name(name: str) -> Lane:
    return next(lane for lane in LANES if lane.name == name)

def summarize_contract(contract: RouteContract) -> str:
    return (
        f"data={contract.data_class.value};"
        f"schema={str(contract.needs_schema).lower()};"
        f"citations={str(contract.needs_citations).lower()};"
        f"review={str(contract.needs_human_review).lower()};"
        f"budget<={contract.max_answer_cost_usd}"
    )

def permitted_fallback_lane(contract: RouteContract, failed_lane: str, now: float) -> Lane | None:
    candidates = sorted(
        (lane for lane in compatible_lanes(contract) if lane.name != failed_lane),
        key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms),
    )
    for lane in candidates:
        if breaker.permit(lane.provider, now):
            return lane
    return None

def execute_with_failure(request: GatewayRequest, failure: FailureKind | None, now: float = 200.0) -> RouteEvent:
    contract = compile_contract(request)
    contract_summary = summarize_contract(contract)
    primary_decision = choose_primary(contract)
    if primary_decision.lane is None:
        return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_primary_lane", Decimal("0"))
    primary_lane = lane_by_name(primary_decision.lane)
    if not breaker.permit(primary_lane.provider, now):
        fallback_lane = permitted_fallback_lane(contract, primary_lane.name, now)
        if fallback_lane is None:
            return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_healthy_safe_fallback", Decimal("0"))
        breaker.success(fallback_lane.provider)
        return RouteEvent(
            request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served_fallback", fallback_lane.name,
            contract_summary, "primary_circuit_open;contract_preserved", fallback_lane.evaluated_answer_cost_usd,
        )
    if failure is None:
        breaker.success(primary_lane.provider)
        return RouteEvent(
            request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served", primary_lane.name,
            contract_summary, "primary_contract_match", primary_lane.evaluated_answer_cost_usd,
        )
    breaker.failure(primary_lane.provider, now)
    if not may_retry_transparently(failure):
        return RouteEvent(
            request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None,
            contract_summary, f"primary_{failure.value}", Decimal("0"),
        )
    fallback_lane = permitted_fallback_lane(contract, primary_lane.name, now)
    if fallback_lane is None:
        return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_healthy_safe_fallback", Decimal("0"))
    breaker.success(fallback_lane.provider)
    return RouteEvent(
        request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served_fallback", fallback_lane.name,
        contract_summary, f"primary_{failure.value};contract_preserved", fallback_lane.evaluated_answer_cost_usd,
    )

breaker = CircuitBreaker()
before_output = execute_with_failure(private_access, FailureKind.TIMEOUT_BEFORE_OUTPUT, now=200.0)
mid_stream = execute_with_failure(private_access, FailureKind.MID_STREAM_DROP, now=201.0)
breaker.failure("local-private", 201.0)
breaker.failure("local-private", 202.0)
skip_open_local = execute_with_failure(private_access, None, now=203.0)

print(f"before_output={before_output.action} lane={before_output.lane} reason={before_output.reason}")
print(f"audit_policy={before_output.policy_id} cost_release={before_output.cost_release_id}")
print(f"audit_contract={before_output.contract_summary}")
print(f"mid_stream={mid_stream.action} lane={mid_stream.lane} reason={mid_stream.reason}")
print(f"circuit_after_failures={breaker.states['hosted-private'].status.value}")
print(f"open_local_skipped={skip_open_local.action} lane={skip_open_local.lane} reason={skip_open_local.reason}")

Output

before_output=served_fallback lane=local-private-cited-review reason=primary_timeout_before_output;contract_preserved
audit_policy=gateway-policy-v1 cost_release=assistant-release-2026-05-cost-v1
audit_contract=data=tenant_private;schema=true;citations=true;review=true;budget<=0.004570
mid_stream=escalate lane=None reason=primary_mid_stream_drop
circuit_after_failures=open
open_local_skipped=served_fallback lane=regional-private-cited-review reason=primary_circuit_open;contract_preserved

Test policy before promoting it

A gateway policy shouldn't be promoted because three hand-picked examples look sensible. Replay an evaluated set that represents low-risk traffic, high-risk cases, private data, outage paths, and unsupported contracts. For every generated response, check that selected lane preserves the contract and stays inside the inherited budget.

This small replay adds a request with a private context too large for either approved private lane. It also opens the primary circuit on a simulated timeout and confirms that the next request uses a contract-preserving fallback during cooldown. An honest gateway escalates unsupported work rather than truncating evidence or routing private incident notes somewhere unapproved.

09-replay-gateway-policy.py

too_large_private_request = GatewayRequest(
    "access-long-context", "prod_access_decision", DataClass.TENANT_PRIVATE, 70_000,
    risk_amount_cents=90_000, requires_citations=True,
)

breaker = CircuitBreaker(threshold=1, cooldown_seconds=10.0)
replay_cases = [
    (status_request, None),
    (private_access, None),
    (private_access, FailureKind.TIMEOUT_BEFORE_OUTPUT),
    (private_access, None),
    (too_large_private_request, None),
]
events = [
    execute_with_failure(request, failure, now=300.0 + index)
    for index, (request, failure) in enumerate(replay_cases)
]

served = []
for (request, _), event in zip(replay_cases, events):
    if event.action.startswith("served"):
        lane = lane_by_name(event.lane or "")
        assert not contract_violations(lane, compile_contract(request))
        assert event.evaluated_cost_usd <= MAX_GENERATED_ANSWER_USD
        served.append(event)

for event in events:
    print(f"{event.request_id}: {event.action} lane={event.lane}")
print(f"generated_with_contract={len(served)}/{len(events)}")
print(f"unsafe_generation_events=0")

Output

docs-Q102: served lane=fast-public-json
access-R900: served lane=primary-private-cited-review
access-R900: served_fallback lane=local-private-cited-review
access-R900: served_fallback lane=local-private-cited-review
access-long-context: escalate lane=None
generated_with_contract=4/5
unsafe_generation_events=0

The replay isn't an answer-quality evaluation by itself. Its cost assertion checks preflight lane evidence, not the final bill. After generation, record provider-reported usage and price it with the inherited rate card in the ledger from the previous lesson. Before promotion, also join these route events with the answer-correctness and citation checks from the evaluation lessons. A routing policy can satisfy the mechanical contract and still send too many difficult public questions to a weak but formally compatible lane.

Only after hard requirements pass should you test learned or heuristic ranking for quality and cost. Useful comparison metrics are generated-answer correctness, citation correctness, human-escalation rate, fallback success rate, p95 latency, and actual spend by lane.

Hand the developer assistant a policy artifact

The next lesson assembles a complete developer assistant. It shouldn't re-create routing rules inside orchestration code. The gateway should export a small, versioned policy artifact that states what has been approved and what must escalate.

The final cell emits the artifact from the lesson: an inherited cost contract, two demonstrated routes, and failure behavior that refuses unsafe degradation.

10-export-gateway-policy.py

policy_artifact = {
    "policy_id": GATEWAY_POLICY_ID,
    "cost_release_id": COST_RELEASE_ID,
    "max_generated_answer_usd": str(MAX_GENERATED_ANSWER_USD),
    "retry_limits": {
        "max_generation_attempts": MAX_GENERATION_ATTEMPTS,
        "request_deadline_ms": REQUEST_DEADLINE_MS,
    },
    "approved_examples": {
        "public_deploy_policy": "fast-public-json",
        "private_high_risk_access": "primary-private-cited-review",
        "private_high_risk_access_fallback": "local-private-cited-review",
    },
    "escalate_when": [
        "no lane preserves all contract fields",
        "failure occurs after visible output begins",
        "approved private context capacity is exceeded",
        "retry attempts or request deadline are exhausted",
    ],
}

print(json.dumps(policy_artifact, indent=2))

Output

{
  "policy_id": "gateway-policy-v1",
  "cost_release_id": "assistant-release-2026-05-cost-v1",
  "max_generated_answer_usd": "0.004570",
  "retry_limits": {
    "max_generation_attempts": 2,
    "request_deadline_ms": 2500
  },
  "approved_examples": {
    "public_deploy_policy": "fast-public-json",
    "private_high_risk_access": "primary-private-cited-review",
    "private_high_risk_access_fallback": "local-private-cited-review"
  },
  "escalate_when": [
    "no lane preserves all contract fields",
    "failure occurs after visible output begins",
    "approved private context capacity is exceeded",
    "retry attempts or request deadline are exhausted"
  ]
}

Mastery check

What you built

A request-to-contract compiler that accumulates privacy, schema, citation, review, and budget requirements.
A filter that can explain why each unsafe lane is rejected.
A primary and fallback selector that ranks only compatible lanes.
A failure classifier and circuit breaker that avoid unsafe or amplifying retries.
Retry-limit policy fields that cap generation attempts and define one request deadline for the downstream runtime.
Audit rows that pin the gateway policy and inherited cost release used for each decision.
A replay check and versioned gateway-policy artifact for the developer-assistant design.

Evaluation rubric

Foundational: Explains why an adapter isn't a gateway and why OpenAI-shaped requests don't prove capability equality.
Intermediate: Compiles a private high-risk break-glass access request into one contract without dropping either privacy or review.
Intermediate: Rejects a low-cost fallback that breaks citations, schema, or data handling.
Advanced: Distinguishes a retryable pre-output failure from a mid-stream failure that must stop or escalate.
Advanced: Separates the per-answer admission ceiling from request-level retry cost, attempt, and deadline controls.
Advanced: Promotes routing only after route-contract checks are joined with answer-quality checks and post-generation cost ledgers.

Self-check questions

Common failures

Treating the first matching rule as the full contract

Symptom: Private break-glass access requests stay private but bypass required human review.
Cause: Router returned on the data classification branch before evaluating risk requirements.
Fix: Compile all requirements first and accept only lanes with an empty violation list.

Assuming interface compatibility means safe fallback

Symptom: Fallback returns text, but structured parsing or citation enforcement fails.
Cause: Adapter shape was mistaken for a capability guarantee.
Fix: Maintain tested lane capabilities and reject a fallback that can't meet the output contract.

Retrying through an incident

Symptom: Latency and error volume grow while a failing provider is still attempted on every request.
Cause: Gateway had fallback order but no circuit state.
Fix: Open the unhealthy circuit, permit a controlled probe after cooldown, and record every fallback route.

Stopping after the first open-circuit fallback

Symptom: Gateway escalates even though a later contract-compatible lane is healthy.
Cause: Fallback code selected one safe lane, discovered its circuit was open, and stopped scanning.
Fix: Rank contract-compatible fallbacks, then choose the first lane whose circuit permits a call.

Treating preflight cost evidence as the final bill

Symptom: Route replay passes, but actual generated-answer spend drifts above the approved ceiling.
Cause: Gateway checked the lane's evaluated fixture cost and skipped provider-usage pricing after generation.
Fix: Use evaluated cost for route admission, then record provider-reported usage and run the inherited ledger check on every generated answer.

Treating fallback attempts as free

Symptom: Every chosen lane fits the answer-cost ceiling, but incident spend and tail latency still exceed policy.
Cause: The gateway applied the ceiling independently to each lane and ignored work consumed by failed attempts.
Fix: Share one retry budget across the request: cap attempts, carry one deadline, and price usage from every provider attempt after the call.

Replaying side effects as if they were generation

Symptom: A transient model failure causes a write tool to run twice.
Cause: Retry policy treated answer generation and authorized side effects as one operation.
Fix: Keep gateway retries scoped to idempotent generation attempts. Put authorization, idempotency keys, and replay handling around each write tool.

Next Step

Continue to Design a Conversational AI Developer Assistant

You now have an auditable gateway policy that preserves cost and safety requirements across normal and failure paths. Next you'll place that policy inside a stateful assistant that retrieves evidence, invokes tools, and escalates real developer requests.

PreviousLLM Cost Engineering & Token Economics

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

OpenAI SDK compatibility

Anthropic · 2026

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., et al. · 2024

Fallbacks (Proxy Reliability)

BerriAI/LiteLLM · 2026

Back to Topics

LearnApplied LLM EngineeringModel Gateways, Routing, and Fallbacks

⚙️MediumMLOps & Deployment

Model Gateways, Routing, and Fallbacks

Turn an audited cost contract into a model gateway that preserves data, schema, review, and budget requirements across routing and fallback.

18 min read

Learning path

Step 77 of 158 in the full curriculum

LLM Cost Engineering & Token Economics Design an Automated Support Agent

What belongs at the gateway boundary

Use three terms precisely:

Term	Job	Example here
Adapter	Translates a stable application request into one provider's API shape	Send messages and parse a structured reply
Lane	Names a model path with measured capabilities and cost	`local-private-cited-review`
Gateway	Compiles requirements, filters lanes, selects or falls back, and logs the decision	Reject a cheap lane that drops citations

01-gateway-types-and-lanes.py

from dataclasses import dataclass
from decimal import Decimal
from enum import Enum
import json

COST_RELEASE_ID = "assistant-release-2026-05-cost-v1"
GATEWAY_POLICY_ID = "gateway-policy-v1"
MAX_GENERATED_ANSWER_USD = Decimal("0.004570")
MAX_GENERATION_ATTEMPTS = 2
REQUEST_DEADLINE_MS = 2_500

class DataClass(str, Enum):
    PUBLIC = "public"
    TENANT_PRIVATE = "tenant_private"

@dataclass(frozen=True)
class GatewayRequest:
    request_id: str
    task: str
    data_class: DataClass
    context_tokens: int
    risk_amount_cents: int = 0
    requires_citations: bool = False
    requires_schema: bool = True

@dataclass(frozen=True)
class RouteContract:
    request_id: str
    data_class: DataClass
    context_tokens: int
    needs_citations: bool
    needs_schema: bool
    needs_human_review: bool
    max_answer_cost_usd: Decimal

@dataclass(frozen=True)
class Lane:
    name: str
    provider: str
    allowed_data_classes: frozenset[DataClass]
    max_context_tokens: int
    supports_schema: bool
    supports_citations: bool
    supports_human_review: bool
    evaluated_answer_cost_usd: Decimal
    expected_latency_ms: int

LANES = [
    Lane(
        "fast-public-json", "hosted-fast", frozenset({DataClass.PUBLIC}),
        16_000, True, False, False, Decimal("0.001100"), 260,
    ),
    Lane(
        "public-cited-review", "hosted-cited", frozenset({DataClass.PUBLIC}),
        64_000, True, True, True, Decimal("0.003800"), 860,
    ),
    Lane(
        "primary-private-cited-review", "hosted-private", frozenset({DataClass.TENANT_PRIVATE}),
        64_000, True, True, True, Decimal("0.004200"), 940,
    ),
    Lane(
        "local-private-cited-review", "local-private", frozenset({DataClass.TENANT_PRIVATE}),
        32_000, True, True, True, Decimal("0.004500"), 1_100,
    ),
    Lane(
        "regional-private-cited-review", "regional-private", frozenset({DataClass.TENANT_PRIVATE}),
        64_000, True, True, True, Decimal("0.004560"), 1_300,
    ),
    Lane(
        "cheap-text-fallback", "hosted-cheap", frozenset({DataClass.PUBLIC, DataClass.TENANT_PRIVATE}),
        32_000, False, False, False, Decimal("0.001500"), 420,
    ),
]

print(f"cost_release={COST_RELEASE_ID}")
print(f"gateway_policy={GATEWAY_POLICY_ID}")
print(f"max_generated_answer_usd={MAX_GENERATED_ANSWER_USD}")
print(f"max_generation_attempts={MAX_GENERATION_ATTEMPTS}")
print(f"request_deadline_ms={REQUEST_DEADLINE_MS}")
print(f"registered_lanes={len(LANES)}")

Output

cost_release=assistant-release-2026-05-cost-v1
gateway_policy=gateway-policy-v1
max_generated_answer_usd=0.004570
max_generation_attempts=2
request_deadline_ms=2500
registered_lanes=6

Compile requirements before choosing a lane

A tempting router is a stack of early returns:

unsafe-first-match.py

if request.data_class == "tenant_private":
    return "local-private"
if request.risk_amount_cents >= 50_000:
    return "human-review"

For a private break-glass access request with 900 USD of risk, that code returns from the first branch and silently drops the second requirement. The safe sequence is different:

Convert all request facts into one contract.
Filter out every lane that violates any contract field.
Rank only the remaining lanes by measured preferences such as cost or latency.

02-compile-route-contract.py

HIGH_RISK_ACCESS_CENTS = 50_000

def compile_contract(request: GatewayRequest) -> RouteContract:
    return RouteContract(
        request_id=request.request_id,
        data_class=request.data_class,
        context_tokens=request.context_tokens,
        needs_citations=request.requires_citations,
        needs_schema=request.requires_schema,
        needs_human_review=request.risk_amount_cents >= HIGH_RISK_ACCESS_CENTS,
        max_answer_cost_usd=MAX_GENERATED_ANSWER_USD,
    )

private_access = GatewayRequest(
    request_id="access-R900",
    task="prod_access_decision",
    data_class=DataClass.TENANT_PRIVATE,
    context_tokens=24_000,
    risk_amount_cents=90_000,
    requires_citations=True,
)
private_contract = compile_contract(private_access)

print(f"request={private_contract.request_id}")
print(f"data_class={private_contract.data_class.value}")
print(f"needs_citations={private_contract.needs_citations}")
print(f"needs_human_review={private_contract.needs_human_review}")
print(f"max_answer_cost_usd={private_contract.max_answer_cost_usd}")

Output

request=access-R900
data_class=tenant_private
needs_citations=True
needs_human_review=True
max_answer_cost_usd=0.004570

03-filter-compatible-lanes.py

def contract_violations(lane: Lane, contract: RouteContract) -> list[str]:
    failures: list[str] = []
    if contract.data_class not in lane.allowed_data_classes:
        failures.append("data_boundary")
    if lane.max_context_tokens < contract.context_tokens:
        failures.append("context_length")
    if contract.needs_schema and not lane.supports_schema:
        failures.append("schema")
    if contract.needs_citations and not lane.supports_citations:
        failures.append("citations")
    if contract.needs_human_review and not lane.supports_human_review:
        failures.append("human_review")
    if lane.evaluated_answer_cost_usd > contract.max_answer_cost_usd:
        failures.append("budget")
    return failures

def compatible_lanes(contract: RouteContract) -> list[Lane]:
    return [lane for lane in LANES if not contract_violations(lane, contract)]

for lane in LANES:
    failures = contract_violations(lane, private_contract)
    status = "compatible" if not failures else "reject=" + ",".join(failures)
    print(f"{lane.name}: {status}")

Output

fast-public-json: reject=data_boundary,context_length,citations,human_review
public-cited-review: reject=data_boundary
primary-private-cited-review: compatible
local-private-cited-review: compatible
regional-private-cited-review: compatible
cheap-text-fallback: reject=schema,citations,human_review

04-choose-primary-lane.py

@dataclass(frozen=True)
class RouteDecision:
    request_id: str
    lane: str | None
    action: str
    reasons: tuple[str, ...]

def choose_primary(contract: RouteContract) -> RouteDecision:
    feasible = compatible_lanes(contract)
    if not feasible:
        return RouteDecision(contract.request_id, None, "escalate", ("no_compatible_lane",))
    lane = min(feasible, key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms))
    reasons = (
        f"data={contract.data_class.value}",
        f"review={str(contract.needs_human_review).lower()}",
        f"citations={str(contract.needs_citations).lower()}",
        f"budget<={contract.max_answer_cost_usd}",
    )
    return RouteDecision(contract.request_id, lane.name, "generate", reasons)

status_request = GatewayRequest(
    "docs-Q102", "deploy_policy", DataClass.PUBLIC, 2_000,
    requires_citations=False,
)

for request in (status_request, private_access):
    decision = choose_primary(compile_contract(request))
    print(f"{decision.request_id} -> {decision.lane} action={decision.action}")
    print("  " + " ".join(decision.reasons))

Output

docs-Q102 -> fast-public-json action=generate
  data=public review=false citations=false budget<=0.004570
access-R900 -> primary-private-cited-review action=generate
  data=tenant_private review=true citations=true budget<=0.004570

A route is a policy decision, not a provider popularity vote

This flow diagram shows the control boundary. A soft router may rank already-compatible candidates; a failure re-enters the same contract filter rather than jumping directly to a convenient provider.

Fallback is another contract decision

The next cell classifies failures by whether a transparent fallback is still allowed.

05-classify-fallback-eligible-failures.py

class FailureKind(str, Enum):
    RATE_LIMIT_BEFORE_OUTPUT = "rate_limit_before_output"
    TIMEOUT_BEFORE_OUTPUT = "timeout_before_output"
    CONTEXT_REJECTED = "context_rejected"
    MID_STREAM_DROP = "mid_stream_drop"
    SCHEMA_INVALID = "schema_invalid"

def may_retry_transparently(failure: FailureKind) -> bool:
    return failure in {
        FailureKind.RATE_LIMIT_BEFORE_OUTPUT,
        FailureKind.TIMEOUT_BEFORE_OUTPUT,
    }

for failure in FailureKind:
    action = "try_compatible_fallback" if may_retry_transparently(failure) else "stop_or_escalate"
    print(f"{failure.value}: {action}")

Output

rate_limit_before_output: try_compatible_fallback
timeout_before_output: try_compatible_fallback
context_rejected: stop_or_escalate
mid_stream_drop: stop_or_escalate
schema_invalid: stop_or_escalate

06-select-contract-preserving-fallback.py

def choose_fallback(contract: RouteContract, failed_lane: str) -> RouteDecision:
    candidates = [
        lane for lane in compatible_lanes(contract)
        if lane.name != failed_lane
    ]
    if not candidates:
        return RouteDecision(contract.request_id, None, "escalate", ("fallback_contract_unmet",))
    lane = min(candidates, key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms))
    return RouteDecision(
        contract.request_id,
        lane.name,
        "fallback_generate",
        (f"primary_failed={failed_lane}", "contract_preserved=true"),
    )

primary = choose_primary(private_contract)
fallback = choose_fallback(private_contract, primary.lane or "")
print(f"primary={primary.lane}")
print(f"fallback={fallback.lane} action={fallback.action}")
print("cheap_text_rejected=" + ",".join(contract_violations(LANES[-1], private_contract)))

Output

primary=primary-private-cited-review
fallback=local-private-cited-review action=fallback_generate
cheap_text_rejected=schema,citations,human_review

Stop incidents from multiplying

07-circuit-breaker.py

class CircuitStatus(str, Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitState:
    status: CircuitStatus = CircuitStatus.CLOSED
    failures: int = 0
    opened_until: float = 0.0

class CircuitBreaker:
    def __init__(self, threshold: int = 2, cooldown_seconds: float = 10.0) -> None:
        self.threshold = threshold
        self.cooldown_seconds = cooldown_seconds
        self.states: dict[str, CircuitState] = {}

    def permit(self, provider: str, now: float) -> bool:
        state = self.states.setdefault(provider, CircuitState())
        if state.status == CircuitStatus.OPEN:
            if now < state.opened_until:
                return False
            state.status = CircuitStatus.HALF_OPEN
            return True
        return state.status == CircuitStatus.CLOSED

    def failure(self, provider: str, now: float) -> None:
        state = self.states.setdefault(provider, CircuitState())
        state.failures += 1
        if state.status == CircuitStatus.HALF_OPEN or state.failures >= self.threshold:
            state.status = CircuitStatus.OPEN
            state.opened_until = now + self.cooldown_seconds

    def success(self, provider: str) -> None:
        self.states[provider] = CircuitState()

breaker = CircuitBreaker()
breaker.failure("hosted-private", 100.0)
breaker.failure("hosted-private", 101.0)
print(f"during_cooldown={breaker.permit('hosted-private', 105.0)}")
print(f"probe_after_cooldown={breaker.permit('hosted-private', 112.0)}")
breaker.success("hosted-private")
print(f"after_success={breaker.states['hosted-private'].status.value}")

Output

during_cooldown=False
probe_after_cooldown=True
after_success=closed

Execute one outage path and record why

08-execute-gateway-path.py

@dataclass(frozen=True)
class RouteEvent:
    request_id: str
    policy_id: str
    cost_release_id: str
    action: str
    lane: str | None
    contract_summary: str
    reason: str
    evaluated_cost_usd: Decimal

def lane_by_name(name: str) -> Lane:
    return next(lane for lane in LANES if lane.name == name)

def summarize_contract(contract: RouteContract) -> str:
    return (
        f"data={contract.data_class.value};"
        f"schema={str(contract.needs_schema).lower()};"
        f"citations={str(contract.needs_citations).lower()};"
        f"review={str(contract.needs_human_review).lower()};"
        f"budget<={contract.max_answer_cost_usd}"
    )

def permitted_fallback_lane(contract: RouteContract, failed_lane: str, now: float) -> Lane | None:
    candidates = sorted(
        (lane for lane in compatible_lanes(contract) if lane.name != failed_lane),
        key=lambda candidate: (candidate.evaluated_answer_cost_usd, candidate.expected_latency_ms),
    )
    for lane in candidates:
        if breaker.permit(lane.provider, now):
            return lane
    return None

def execute_with_failure(request: GatewayRequest, failure: FailureKind | None, now: float = 200.0) -> RouteEvent:
    contract = compile_contract(request)
    contract_summary = summarize_contract(contract)
    primary_decision = choose_primary(contract)
    if primary_decision.lane is None:
        return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_primary_lane", Decimal("0"))
    primary_lane = lane_by_name(primary_decision.lane)
    if not breaker.permit(primary_lane.provider, now):
        fallback_lane = permitted_fallback_lane(contract, primary_lane.name, now)
        if fallback_lane is None:
            return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_healthy_safe_fallback", Decimal("0"))
        breaker.success(fallback_lane.provider)
        return RouteEvent(
            request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served_fallback", fallback_lane.name,
            contract_summary, "primary_circuit_open;contract_preserved", fallback_lane.evaluated_answer_cost_usd,
        )
    if failure is None:
        breaker.success(primary_lane.provider)
        return RouteEvent(
            request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served", primary_lane.name,
            contract_summary, "primary_contract_match", primary_lane.evaluated_answer_cost_usd,
        )
    breaker.failure(primary_lane.provider, now)
    if not may_retry_transparently(failure):
        return RouteEvent(
            request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None,
            contract_summary, f"primary_{failure.value}", Decimal("0"),
        )
    fallback_lane = permitted_fallback_lane(contract, primary_lane.name, now)
    if fallback_lane is None:
        return RouteEvent(request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "escalate", None, contract_summary, "no_healthy_safe_fallback", Decimal("0"))
    breaker.success(fallback_lane.provider)
    return RouteEvent(
        request.request_id, GATEWAY_POLICY_ID, COST_RELEASE_ID, "served_fallback", fallback_lane.name,
        contract_summary, f"primary_{failure.value};contract_preserved", fallback_lane.evaluated_answer_cost_usd,
    )

breaker = CircuitBreaker()
before_output = execute_with_failure(private_access, FailureKind.TIMEOUT_BEFORE_OUTPUT, now=200.0)
mid_stream = execute_with_failure(private_access, FailureKind.MID_STREAM_DROP, now=201.0)
breaker.failure("local-private", 201.0)
breaker.failure("local-private", 202.0)
skip_open_local = execute_with_failure(private_access, None, now=203.0)

print(f"before_output={before_output.action} lane={before_output.lane} reason={before_output.reason}")
print(f"audit_policy={before_output.policy_id} cost_release={before_output.cost_release_id}")
print(f"audit_contract={before_output.contract_summary}")
print(f"mid_stream={mid_stream.action} lane={mid_stream.lane} reason={mid_stream.reason}")
print(f"circuit_after_failures={breaker.states['hosted-private'].status.value}")
print(f"open_local_skipped={skip_open_local.action} lane={skip_open_local.lane} reason={skip_open_local.reason}")

Output

before_output=served_fallback lane=local-private-cited-review reason=primary_timeout_before_output;contract_preserved
audit_policy=gateway-policy-v1 cost_release=assistant-release-2026-05-cost-v1
audit_contract=data=tenant_private;schema=true;citations=true;review=true;budget<=0.004570
mid_stream=escalate lane=None reason=primary_mid_stream_drop
circuit_after_failures=open
open_local_skipped=served_fallback lane=regional-private-cited-review reason=primary_circuit_open;contract_preserved

Test policy before promoting it

09-replay-gateway-policy.py

too_large_private_request = GatewayRequest(
    "access-long-context", "prod_access_decision", DataClass.TENANT_PRIVATE, 70_000,
    risk_amount_cents=90_000, requires_citations=True,
)

breaker = CircuitBreaker(threshold=1, cooldown_seconds=10.0)
replay_cases = [
    (status_request, None),
    (private_access, None),
    (private_access, FailureKind.TIMEOUT_BEFORE_OUTPUT),
    (private_access, None),
    (too_large_private_request, None),
]
events = [
    execute_with_failure(request, failure, now=300.0 + index)
    for index, (request, failure) in enumerate(replay_cases)
]

served = []
for (request, _), event in zip(replay_cases, events):
    if event.action.startswith("served"):
        lane = lane_by_name(event.lane or "")
        assert not contract_violations(lane, compile_contract(request))
        assert event.evaluated_cost_usd <= MAX_GENERATED_ANSWER_USD
        served.append(event)

for event in events:
    print(f"{event.request_id}: {event.action} lane={event.lane}")
print(f"generated_with_contract={len(served)}/{len(events)}")
print(f"unsafe_generation_events=0")

Output

docs-Q102: served lane=fast-public-json
access-R900: served lane=primary-private-cited-review
access-R900: served_fallback lane=local-private-cited-review
access-R900: served_fallback lane=local-private-cited-review
access-long-context: escalate lane=None
generated_with_contract=4/5
unsafe_generation_events=0

Hand the developer assistant a policy artifact

The final cell emits the artifact from the lesson: an inherited cost contract, two demonstrated routes, and failure behavior that refuses unsafe degradation.

10-export-gateway-policy.py

policy_artifact = {
    "policy_id": GATEWAY_POLICY_ID,
    "cost_release_id": COST_RELEASE_ID,
    "max_generated_answer_usd": str(MAX_GENERATED_ANSWER_USD),
    "retry_limits": {
        "max_generation_attempts": MAX_GENERATION_ATTEMPTS,
        "request_deadline_ms": REQUEST_DEADLINE_MS,
    },
    "approved_examples": {
        "public_deploy_policy": "fast-public-json",
        "private_high_risk_access": "primary-private-cited-review",
        "private_high_risk_access_fallback": "local-private-cited-review",
    },
    "escalate_when": [
        "no lane preserves all contract fields",
        "failure occurs after visible output begins",
        "approved private context capacity is exceeded",
        "retry attempts or request deadline are exhausted",
    ],
}

print(json.dumps(policy_artifact, indent=2))

Output

{
  "policy_id": "gateway-policy-v1",
  "cost_release_id": "assistant-release-2026-05-cost-v1",
  "max_generated_answer_usd": "0.004570",
  "retry_limits": {
    "max_generation_attempts": 2,
    "request_deadline_ms": 2500
  },
  "approved_examples": {
    "public_deploy_policy": "fast-public-json",
    "private_high_risk_access": "primary-private-cited-review",
    "private_high_risk_access_fallback": "local-private-cited-review"
  },
  "escalate_when": [
    "no lane preserves all contract fields",
    "failure occurs after visible output begins",
    "approved private context capacity is exceeded",
    "retry attempts or request deadline are exhausted"
  ]
}

Mastery check

What you built

A request-to-contract compiler that accumulates privacy, schema, citation, review, and budget requirements.
A filter that can explain why each unsafe lane is rejected.
A primary and fallback selector that ranks only compatible lanes.
A failure classifier and circuit breaker that avoid unsafe or amplifying retries.
Retry-limit policy fields that cap generation attempts and define one request deadline for the downstream runtime.
Audit rows that pin the gateway policy and inherited cost release used for each decision.
A replay check and versioned gateway-policy artifact for the developer-assistant design.

Evaluation rubric

Foundational: Explains why an adapter isn't a gateway and why OpenAI-shaped requests don't prove capability equality.
Intermediate: Compiles a private high-risk break-glass access request into one contract without dropping either privacy or review.
Intermediate: Rejects a low-cost fallback that breaks citations, schema, or data handling.
Advanced: Distinguishes a retryable pre-output failure from a mid-stream failure that must stop or escalate.
Advanced: Separates the per-answer admission ceiling from request-level retry cost, attempt, and deadline controls.
Advanced: Promotes routing only after route-contract checks are joined with answer-quality checks and post-generation cost ledgers.

Self-check questions

Common failures

Treating the first matching rule as the full contract

Symptom: Private break-glass access requests stay private but bypass required human review.
Cause: Router returned on the data classification branch before evaluating risk requirements.
Fix: Compile all requirements first and accept only lanes with an empty violation list.

Assuming interface compatibility means safe fallback

Symptom: Fallback returns text, but structured parsing or citation enforcement fails.
Cause: Adapter shape was mistaken for a capability guarantee.
Fix: Maintain tested lane capabilities and reject a fallback that can't meet the output contract.

Retrying through an incident

Symptom: Latency and error volume grow while a failing provider is still attempted on every request.
Cause: Gateway had fallback order but no circuit state.
Fix: Open the unhealthy circuit, permit a controlled probe after cooldown, and record every fallback route.

Stopping after the first open-circuit fallback

Symptom: Gateway escalates even though a later contract-compatible lane is healthy.
Cause: Fallback code selected one safe lane, discovered its circuit was open, and stopped scanning.
Fix: Rank contract-compatible fallbacks, then choose the first lane whose circuit permits a call.

Treating preflight cost evidence as the final bill

Symptom: Route replay passes, but actual generated-answer spend drifts above the approved ceiling.
Cause: Gateway checked the lane's evaluated fixture cost and skipped provider-usage pricing after generation.
Fix: Use evaluated cost for route admission, then record provider-reported usage and run the inherited ledger check on every generated answer.

Treating fallback attempts as free

Symptom: Every chosen lane fits the answer-cost ceiling, but incident spend and tail latency still exceed policy.
Cause: The gateway applied the ceiling independently to each lane and ignored work consumed by failed attempts.
Fix: Share one retry budget across the request: cap attempts, carry one deadline, and price usage from every provider attempt after the call.

Replaying side effects as if they were generation

Symptom: A transient model failure causes a write tool to run twice.
Cause: Retry policy treated answer generation and authorized side effects as one operation.
Fix: Keep gateway retries scoped to idempotent generation attempts. Put authorization, idempotency keys, and replay handling around each write tool.

Next Step

Continue to Design a Conversational AI Developer Assistant

PreviousLLM Cost Engineering & Token Economics

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

OpenAI SDK compatibility

Anthropic · 2026

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., et al. · 2024

Fallbacks (Proxy Reliability)

BerriAI/LiteLLM · 2026

Model Gateways, Routing, and Fallbacks

What belongs at the gateway boundary

Compile requirements before choosing a lane

A route is a policy decision, not a provider popularity vote

A low-cost learned router recommends a public lane for a request containing private incident notes. What should happen?

Fallback is another contract decision

The primary private lane times out. A cheap lane is healthy and under budget, but it returns plain text without citations or human-review support. Is it an acceptable fallback?

Stop incidents from multiplying

Execute one outage path and record why

Test policy before promoting it

Hand the developer assistant a policy artifact

Mastery check

What you built

Evaluation rubric

Self-check questions

Why is a chain of early-return routing rules dangerous?

Where may a learned router safely participate?

Why can't a gateway transparently fall back after visible output has already streamed?

Why can two lanes that each fit the answer-cost ceiling still make one request exceed that amount?

Common failures

Treating the first matching rule as the full contract

Assuming interface compatibility means safe fallback

Retrying through an incident

Stopping after the first open-circuit fallback

Treating preflight cost evidence as the final bill

Treating fallback attempts as free

Replaying side effects as if they were generation

Mastery Check

Model Gateways, Routing, and Fallbacks

What belongs at the gateway boundary

Compile requirements before choosing a lane

A route is a policy decision, not a provider popularity vote

A low-cost learned router recommends a public lane for a request containing private incident notes. What should happen?

Fallback is another contract decision

The primary private lane times out. A cheap lane is healthy and under budget, but it returns plain text without citations or human-review support. Is it an acceptable fallback?

Stop incidents from multiplying

Execute one outage path and record why

Test policy before promoting it

Hand the developer assistant a policy artifact

Mastery check

What you built

Evaluation rubric

Self-check questions

Why is a chain of early-return routing rules dangerous?

Where may a learned router safely participate?

Why can't a gateway transparently fall back after visible output has already streamed?

Why can two lanes that each fit the answer-cost ceiling still make one request exceed that amount?

Common failures

Treating the first matching rule as the full contract

Assuming interface compatibility means safe fallback

Retrying through an incident

Stopping after the first open-circuit fallback

Treating preflight cost evidence as the final bill

Treating fallback attempts as free

Replaying side effects as if they were generation

Mastery Check