LearnApplied LLM EngineeringModel Versioning & Deployment

⚙️MediumMLOps & Deployment

Model Versioning & Deployment

Turn an evaluated LLM change into an immutable release bundle, promote it through measured traffic, and roll back without losing lineage.

21 min read

Learning path

Step 75 of 158 in the full curriculum

Prompt Optimization with DSPy Semantic Caching & Cost Optimization

A precision candidate can stay blocked until a target-GPU profile proves speed and stability. Suppose that missing profile now passes. You still don't have something safe to deploy. A live answer can change because of weights, an evidence gate, a prompt, a tokenizer, a policy corpus, a container image, or a schema.

For the incident-response assistant you've been building, deployment means answering one exact question: which complete, evaluated release produced this response, and how quickly can traffic return to the last known good release?

Release identity changes when any behavior-producing field changes. The stable and candidate bundles keep the answer model, policy corpus, and evaluation contract pinned, but change the evidence gate, prompt version, and serving image, producing different release IDs. — Pinned fields stay shared and reviewable. Changing the evidence gate, prompt, or serving image creates a new release bundle and a new release ID.

A release is more than model weights

The service now contains an answer model plus the runbook-evidence-classifier-v2 gate trained in the previous lesson. That gate blocks answers whose incident claim isn't supported by a runbook excerpt. If you roll back only its weights but leave a newer prompt, policy, or corpus active, you haven't restored previous behavior.

This is the release-bundle idea: store every input that can change visible output or operational safety in one immutable manifest. Keep the declared evaluation contract there too, then attach the resulting report artifact to the promotion decision. ML systems accumulate hidden dependencies between data, code, configuration, and serving infrastructure; deployment records need to make those dependencies reviewable.^{[1]Reference 1Hidden Technical Debt in Machine Learning Systems.https://research.google/pubs/hidden-technical-debt-in-machine-learning-systems/}^{[2]Reference 2Challenges in Deploying Machine Learning: a Survey of Case Studies.https://arxiv.org/abs/2011.09926}

Bundle field	Why it belongs in the release
Answer-model and evidence-gate identifiers	Determine model behavior
Training run and precision policy	Explain how the new gate was produced
Tokenizer and prompt version	Change the text the model sees
Policy and corpus versions	Decide which evidence is available and when it's sufficient to serve
Serving image and schema	Change runtime behavior and API compatibility
Evaluation-suite hash and evaluator version	Declare the comparable scoring contract required before promotion

Start with two full bundles. stable represents what's serving today. candidate changes the evidence gate after the profiled BF16 training run has passed outside the previous lesson's blocked fixture.

define-release-bundles.py

from dataclasses import asdict, dataclass, replace
import hashlib
import json

@dataclass(frozen=True)
class ReleaseBundle:
    service: str
    answer_model: str
    evidence_gate: str
    gate_training_run: str
    precision_policy: str
    tokenizer: str
    prompt_version: str
    policy_version: str
    corpus_version: str
    serving_image: str
    input_schema: str
    eval_suite: str
    evaluator_version: str

stable = ReleaseBundle(
    service="incident-evidence-answerer",
    answer_model="answer-model@sha256:3b07",
    evidence_gate="runbook-evidence-classifier-v1@sha256:0f12",
    gate_training_run="run_fp32_baseline",
    precision_policy="fp32",
    tokenizer="incident-tokenizer@sha256:91aa",
    prompt_version="[email protected]",
    policy_version="[email protected]",
    corpus_version="incident-runbooks@sha256:corpus-5",
    serving_image="registry.example/answerer@sha256:image-a",
    input_schema="incident-answer.v2",
    eval_suite="incident-grounding-suite@sha256:suite-7",
    evaluator_version="claim-evidence-eval-v2",
)

candidate = replace(
    stable,
    evidence_gate="runbook-evidence-classifier-v2@sha256:41d8",
    gate_training_run="run_bf16_profiled_target_gpu",
    precision_policy="bf16",
    prompt_version="[email protected]",
    serving_image="registry.example/answerer@sha256:image-b",
)

print(f"stable_gate={stable.evidence_gate}")
print(f"candidate_gate={candidate.evidence_gate}")
print(f"candidate_training_run={candidate.gate_training_run}")
print(f"candidate_precision={candidate.precision_policy}")

Output

stable_gate=runbook-evidence-classifier-v1@sha256:0f12
candidate_gate=runbook-evidence-classifier-v2@sha256:41d8
candidate_training_run=run_bf16_profiled_target_gpu
candidate_precision=bf16

A version name such as v2 is useful for people, but it doesn't prove contents stayed fixed. The next cell derives identity from the canonical manifest. Change a prompt or image digest and the release ID changes too.

content-address-the-release.py

def release_id(bundle: ReleaseBundle) -> str:
    payload = json.dumps(asdict(bundle), sort_keys=True, separators=(",", ":"))
    # Short prefix keeps the teaching output readable. Retain the full digest in production.
    digest = hashlib.sha256(payload.encode("utf-8")).hexdigest()[:12]
    return f"{bundle.service}@sha256:{digest}"

stable_id = release_id(stable)
candidate_id = release_id(candidate)
prompt_patch_id = release_id(replace(candidate, prompt_version="[email protected]"))

print(f"stable={stable_id}")
print(f"candidate={candidate_id}")
print(f"prompt_patch={prompt_patch_id}")
print(f"prompt_patch_is_new_release={prompt_patch_id != candidate_id}")

Output

stable=incident-evidence-answerer@sha256:026746ee8fb8
candidate=incident-evidence-answerer@sha256:fa60321b1dce
prompt_patch=incident-evidence-answerer@sha256:e016f9a2b187
prompt_patch_is_new_release=True

This lab abbreviates each SHA-256 digest to 12 hexadecimal characters so the state transitions stay readable. A production registry should retain the full digest as identity and use short prefixes only for display.

Artifacts stay fixed; aliases move

A registry record stores an immutable bundle. An alias such as production or canary is a mutable pointer used by traffic. This separation is what makes rollback simple: preserve both bundles and move the pointer back.

MLflow's current Model Registry workflow provides version aliases and tags, and its documentation marks fixed Model Stages as deprecated. That distinction matters: a model version is history; an alias expresses the current deployment decision.^{[3]Reference 3Model Registry Workflows | MLflow AI Platformhttps://mlflow.org/docs/latest/ml/model-registry/workflow/}

An MLflow model alias still points to one registered model version, not automatically to the prompt, corpus, policy, schema, and serving image in the release bundle. Treat it as one component pointer, or register a wrapper artifact whose manifest resolves the complete bundle. Don't mistake a movable model alias for complete release identity.

Alias step chart over two fixed release-bundle rails: canary points to candidate release fa60321b1dce before promotion, production starts on stable release 026746ee8fb8, moves to candidate at promotion, and returns to the retained stable release after an incident; both immutable bundle records remain available throughout. — The fixed rails are stored release IDs; the colored lines are aliases. Canary reaches the candidate first, production moves only after promotion, and rollback returns production to the retained stable bundle without rewriting either manifest.

The small registry below enforces that rule. register() keeps a deep copy of the bundle, and move_alias() only points at a registered ID.

registry-and-aliases.py

from copy import deepcopy

class ReleaseRegistry:
    def __init__(self) -> None:
        self._bundles: dict[str, ReleaseBundle] = {}
        self._aliases: dict[str, str] = {}

    def register(self, bundle: ReleaseBundle) -> str:
        bundle_id = release_id(bundle)
        existing = self._bundles.get(bundle_id)
        if existing is not None and existing != bundle:
            raise ValueError("release digest collision")
        self._bundles[bundle_id] = deepcopy(bundle)
        return bundle_id

    def move_alias(self, alias: str, bundle_id: str) -> None:
        if bundle_id not in self._bundles:
            raise KeyError(f"unregistered release: {bundle_id}")
        self._aliases[alias] = bundle_id

    def resolve(self, alias: str) -> str:
        return self._aliases[alias]

registry = ReleaseRegistry()
assert registry.register(stable) == stable_id
assert registry.register(candidate) == candidate_id
registry.move_alias("production", stable_id)

print(f"registered={len(registry._bundles)}")
print(f"production={registry.resolve('production')}")
print(f"candidate_registered={candidate_id in registry._bundles}")

Output

registered=2
production=incident-evidence-answerer@sha256:026746ee8fb8
candidate_registered=True

The teaching registry keeps move_alias() deliberately small. A production control plane should make that update atomic, record the expected previous target, and reject a stale promotion if another rollout moved the alias first.

Promotion begins with controlled evidence

Registration isn't approval. Continuous delivery for machine learning adds evaluation gates and monitoring to ordinary build-and-deploy practices because a valid artifact can still produce unacceptable behavior.^{[4]Reference 4Continuous Delivery for Machine Learning.https://martinfowler.com/articles/cd4ml.html}

In this running system, don't introduce a generic "quality score" after spending several lessons defining a grounded evidence contract. The release gate should use the same measurements the incident, evaluation, and experiment lessons established:

supported_evidence_f1 measures whether supported incident claims are served correctly.
unsupported_serve_count must remain zero in the frozen high-risk suite.
p95_latency_ms (p95 latency) prevents a behaviorally acceptable gate from breaking the response budget.
Schema hash, evaluation-suite hash, and evaluator version prevent incomparable evidence from entering the decision.
evaluation_report preserves the report artifact a reviewer or incident responder can inspect later.

Offline promotion gate matches one candidate bundle to a pinned contract, passes release metrics, opens canary, and leaves production pinned. — Offline evidence counts only when release identity and evaluation contract match. Passing that gate can open canary traffic, while production stays pinned to the stable bundle.

offline-promotion-gate.py

@dataclass(frozen=True)
class OfflineEvidence:
    release_id: str
    eval_suite: str
    evaluator_version: str
    input_schema: str
    evaluation_report: str
    supported_evidence_f1: float
    unsupported_serve_count: int
    p95_latency_ms: int

@dataclass(frozen=True)
class Decision:
    allowed: bool
    reason: str

def offline_gate(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision:
    if evidence.release_id != release_id(bundle):
        return Decision(False, "evidence belongs to another release")
    if evidence.eval_suite != bundle.eval_suite:
        return Decision(False, "evaluation suite changed")
    if evidence.evaluator_version != bundle.evaluator_version:
        return Decision(False, "evaluator version changed")
    if evidence.input_schema != bundle.input_schema:
        return Decision(False, "schema mismatch")
    if not evidence.evaluation_report:
        return Decision(False, "evaluation report missing")
    if evidence.supported_evidence_f1 < 0.92:
        return Decision(False, "supported_evidence_f1 below 0.92")
    if evidence.unsupported_serve_count != 0:
        return Decision(False, "unsupported answer was served")
    if evidence.p95_latency_ms > 500:
        return Decision(False, "p95 latency exceeds 500 ms")
    return Decision(True, "offline gate passed")

candidate_offline = OfflineEvidence(
    release_id=candidate_id,
    eval_suite=candidate.eval_suite,
    evaluator_version=candidate.evaluator_version,
    input_schema=candidate.input_schema,
    evaluation_report="reports/candidate-suite-7-redacted.json",
    supported_evidence_f1=0.93,
    unsupported_serve_count=0,
    p95_latency_ms=472,
)
weaker_candidate = replace(candidate_offline, supported_evidence_f1=0.89)
changed_evaluator = replace(candidate_offline, evaluator_version="claim-evidence-eval-v3")

print(f"candidate={offline_gate(candidate, candidate_offline)}")
print(f"weak_metric={offline_gate(candidate, weaker_candidate)}")
print(f"changed_evaluator={offline_gate(candidate, changed_evaluator)}")

Output

candidate=Decision(allowed=True, reason='offline gate passed')
weak_metric=Decision(allowed=False, reason='supported_evidence_f1 below 0.92')
changed_evaluator=Decision(allowed=False, reason='evaluator version changed')

Passing the offline gate permits further evaluation; it doesn't immediately replace production. Open a canary alias while production still points to the known-good bundle.

open-canary-only-after-gate.py

def open_canary(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision:
    decision = offline_gate(bundle, evidence)
    if decision.allowed:
        registry.move_alias("canary", release_id(bundle))
    return decision

canary_decision = open_canary(candidate, candidate_offline)

print(f"canary_opened={canary_decision.allowed}")
print(f"canary={registry.resolve('canary')}")
print(f"production_still_stable={registry.resolve('production') == stable_id}")

Output

canary_opened=True
canary=incident-evidence-answerer@sha256:fa60321b1dce
production_still_stable=True

Managed models need a documented pin

When your team owns weights, a digest can identify them directly. If a service calls a hosted model, the bundle must instead record the strongest fixed identifier that provider documents. For example, OpenAI model pages that expose snapshots describe snapshots as locking a particular model version for consistent behavior.^{[5]Reference 5Models | OpenAI APIhttps://platform.openai.com/docs/models}

A provider can change the model behind a name you thought was stable, shifting your outputs with no deploy on your side. A controlled study comparing the March and June 2023 releases of GPT-4 and GPT-3.5 reported large behavior swings on the same tasks between snapshots, so the "same" service was not always the same service.^{[6]Reference 6How Is ChatGPT's Behavior Changing over Time?https://arxiv.org/abs/2307.09009} Pinning a documented snapshot reduces that risk but doesn't remove it: snapshots get deprecated, and a golden-set monitor against production still earns its keep.^{[7]Reference 7Deprecations | OpenAI APIhttps://developers.openai.com/api/docs/deprecations}

Don't generalize that guarantee to every provider or every alias. Verify the exact provider documentation, store the chosen model identifier in the bundle, monitor deprecation notices, and rerun release gates before changing it.

Replay a failed production trace against the candidate

Frozen fixtures test the failures you already imagined. A deterministic replay tests a failure you actually shipped: pull the recorded trace for a request that went wrong under the current release, then re-run its exact prompt and tool-call sequence against the candidate release while holding the recorded evidence snapshot fixed. The only thing that varies is the release under test, so a difference in behavior is attributable to the candidate, not to a new request or a moved corpus.

This answers a precise regression question: would the candidate have made the same mistake on this real request? It sits between the offline gate and the shadow because it reuses recorded inputs rather than inventing fixtures or spending live traffic. The replay is read-only: it feeds recorded tool results back in rather than re-executing tools, so no incident state changes.

The stable release served a claim unsupported by admitted evidence. Replay that trace against the candidate's evidence gate:

deterministic-replay.py

@dataclass(frozen=True)
class RecordedTrace:
    request_id: str
    origin_release: str
    prompt_version: str
    evidence_version: str
    tool_sequence: tuple[str, ...]
    evidence_supports_claim: bool
    served_unsupported_claim: bool

def candidate_gate_serves(evidence_supports_claim: bool) -> bool:
    # The v2 evidence gate serves a factual claim only when admitted evidence supports it.
    return evidence_supports_claim

def replay_against(trace: RecordedTrace, bundle: ReleaseBundle) -> dict[str, object]:
    would_serve = candidate_gate_serves(trace.evidence_supports_claim)
    reproduces = would_serve and not trace.evidence_supports_claim
    return {
        "candidate_release": release_id(bundle),
        "tool_sequence": trace.tool_sequence,
        "candidate_reproduces_failure": reproduces,
    }

failed_trace = RecordedTrace(
    request_id="req_88213",
    origin_release=stable_id,
    prompt_version=stable.prompt_version,
    evidence_version="incident-runbooks@sha256:corpus-5",
    tool_sequence=("lookup_incident", "fetch_runbook", "answer"),
    evidence_supports_claim=False,
    served_unsupported_claim=True,
)
replay_result = replay_against(failed_trace, candidate)

print(f"origin_release={failed_trace.origin_release}")
print(f"replayed_against={replay_result['candidate_release']}")
print(f"inputs_held_fixed={replay_result['tool_sequence']}")
print(f"candidate_reproduces_failure={replay_result['candidate_reproduces_failure']}")

Output

origin_release=incident-evidence-answerer@sha256:026746ee8fb8
replayed_against=incident-evidence-answerer@sha256:fa60321b1dce
inputs_held_fixed=('lookup_incident', 'fetch_runbook', 'answer')
candidate_reproduces_failure=False

A clean replay is real evidence that the candidate fixes this specific incident, but it isn't a general guarantee. A production replay harness needs the recorded prompt, retrieved evidence version, tool inputs and outputs, and any non-deterministic settings pinned (temperature, seed, and model snapshot), or the "replay" quietly becomes a fresh run whose difference you can't attribute. Keep a growing library of failed traces and add each new incident to it, so a future candidate must clear every past production mistake before promotion.

Shadow evaluation must be read-only

Offline fixtures can't cover every real request shape. A shadow sends a sanitized copy of a production request to the candidate while the stable release alone supplies the user-visible answer. It can reveal latency or evidence-support problems without exposing the candidate's text to customers.

The safety rule is easy to miss: a shadow isn't permitted to execute tools, send messages, change incident state, or write production state. Its output is evaluation data only. The envelope below also redacts an incident identifier before queuing the shadow request.

shadow-envelope.py

import re

@dataclass(frozen=True)
class ShadowEnvelope:
    candidate_release: str
    sanitized_text: str
    side_effects_enabled: bool
    response_visible_to_user: bool

def make_shadow(text: str) -> ShadowEnvelope:
    sanitized = re.sub(r"INC-\d+", "[INCIDENT_ID]", text)
    return ShadowEnvelope(
        candidate_release=registry.resolve("canary"),
        sanitized_text=sanitized,
        side_effects_enabled=False,
        response_visible_to_user=False,
    )

shadow = make_shadow("Can INC-48192 roll back based on runbook RB-7?")

print(f"shadow_text={shadow.sanitized_text}")
print(f"candidate={shadow.candidate_release}")
print(f"side_effects_enabled={shadow.side_effects_enabled}")
print(f"response_visible={shadow.response_visible_to_user}")

Output

shadow_text=Can [INCIDENT_ID] roll back based on runbook RB-7?
candidate=incident-evidence-answerer@sha256:fa60321b1dce
side_effects_enabled=False
response_visible=False

The one-pattern redaction is only a teaching fixture. A production shadow path needs schema-aware data minimization that covers every customer identifier and secret before the request reaches a queue, log, or candidate service.

In a deployed service, enqueue this envelope to a bounded worker queue and count dropped or failed comparisons. Don't start an untracked background task in request scope and assume its evaluation record will survive process restarts.

The boolean in this teaching envelope is metadata, not an authorization boundary. Its worker still needs a read-only tool allowlist and credentials that can't write production state. Reject a shadow request if it asks for a side effect.

Canary traffic is visible and sticky

Shadow results can justify limited exposure, not automatic promotion. A canary sends a small share of real conversations to the candidate and returns those candidate responses to users. For conversational systems, one thread must remain on one bundle throughout the rollout. Otherwise a user can receive conflicting answers from stable and candidate releases in adjacent turns.

Use deterministic hashing of a conversation ID to choose new conversations reproducibly across workers. Python's built-in hash() is intentionally process-dependent, so it's the wrong bucketing function for that job. Hashing alone isn't enough, though: when canary traffic widens from 1% to 10%, the higher threshold could move an existing conversation from stable to candidate. Persist the first resolved release ID for the conversation lifetime.

sticky-canary-routing.py

def bucket(conversation_id: str) -> int:
    digest = hashlib.sha256(conversation_id.encode("utf-8")).hexdigest()
    return int(digest[:8], 16) % 100

conversation_assignments: dict[str, str] = {}
aborted_releases: set[str] = set()

def assigned_release(conversation_id: str, canary_percent: int) -> str:
    existing = conversation_assignments.get(conversation_id)
    if existing is not None and existing not in aborted_releases:
        return existing
    alias = "canary" if bucket(conversation_id) < canary_percent else "production"
    bundle_id = registry.resolve(alias)
    if bundle_id in aborted_releases:
        bundle_id = registry.resolve("production")
    conversation_assignments[conversation_id] = bundle_id
    return bundle_id

canary_thread = next(
    f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") < 10
)
assignments = [assigned_release(canary_thread, canary_percent=10) for _ in range(3)]
stable_thread = next(
    f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") >= 10
)
stable_before_widen = assigned_release(stable_thread, canary_percent=10)
stable_after_widen = assigned_release(stable_thread, canary_percent=100)

print(f"canary_thread={canary_thread}")
print(f"bucket={bucket(canary_thread)}")
print(f"same_release_each_turn={len(set(assignments)) == 1}")
print(f"assigned_to_candidate={assignments[0] == candidate_id}")
print(f"existing_stable_thread_pinned_after_widen={stable_before_widen == stable_after_widen == stable_id}")

Output

canary_thread=thread-6
bucket=6
same_release_each_turn=True
assigned_to_candidate=True
existing_stable_thread_pinned_after_widen=True

The dictionary is a teaching fixture. A real router persists assignments in conversation state or a rollout store, writes the first assignment atomically so concurrent opening turns can't disagree, expires it when the conversation ends, and records the resolved release ID in traces. Stickiness is normal-routing behavior, not permission to keep serving a failed candidate: an abort must override it.

Progressive rollout trace where 1 percent passes, 10 percent records an unsupported serve against a zero-tolerance gate, and canary traffic rolls back to zero. — Unsupported serves stay at 0% in the 1% window, then appear at 10%. The zero-tolerance support gate aborts the canary and restores every conversation to stable.

A canary window answers a new question

Offline evidence proves behavior on frozen examples. A live window tests traffic mix, serving latency, error rate, and evidence failures under actual request volume. Define those thresholds and a minimum sample size before sending any candidate traffic. A rate computed from a handful of requests isn't enough evidence to widen exposure. For this incident assistant, support is a hard safety invariant: one live unsupported serve fails the window rather than being absorbed into an error budget.

live-canary-window.py

@dataclass(frozen=True)
class LiveWindow:
    name: str
    request_count: int
    p95_latency_ms: int
    error_rate: float
    unsupported_serve_rate: float
    shadow_drop_rate: float

def live_gate(window: LiveWindow) -> Decision:
    if window.request_count < 1_000:
        return Decision(False, "fewer than 1000 requests observed")
    if window.p95_latency_ms > 550:
        return Decision(False, "live latency exceeded 550 ms")
    if window.error_rate > 0.01:
        return Decision(False, "error rate exceeded 1%")
    if window.unsupported_serve_rate > 0.0:
        return Decision(False, "unsupported serve rate must remain zero")
    if window.shadow_drop_rate > 0.02:
        return Decision(False, "shadow telemetry incomplete")
    return Decision(True, "live window passed")

window_1_percent = LiveWindow("1%", 1_200, 481, 0.002, 0.000, 0.001)
window_10_percent = LiveWindow("10%", 8_000, 493, 0.003, 0.014, 0.001)

print(f"one_percent={live_gate(window_1_percent)}")
print(f"ten_percent={live_gate(window_10_percent)}")

Output

one_percent=Decision(allowed=True, reason='live window passed')
ten_percent=Decision(allowed=False, reason='unsupported serve rate must remain zero')

Both windows exceed the 1,000-request minimum, so the metric verdicts are meaningful under this teaching contract. A low-volume window would remain blocked even if every observed rate happened to be zero.

Abort a canary; roll back a promotion

These actions sound similar but refer to different alias states:

If the candidate fails while only the canary alias receives traffic, abort the rollout. production never moved.
If a candidate was already promoted and later fails, roll back by repointing production to the retained stable release.

Keeping the distinction explicit prevents an incident report from claiming production was rolled back when the candidate was never production.

abort-and-rollback.py

canary_percent = 10
failed_window = live_gate(window_10_percent)
if not failed_window.allowed:
    aborted_releases.add(candidate_id)
    canary_percent = 0

print(f"canary_percent_after_abort={canary_percent}")
print(f"production_after_abort={registry.resolve('production')}")
print(f"pinned_canary_thread_restored_stable={assigned_release(canary_thread, canary_percent) == stable_id}")

# Separate rollback drill: a promoted candidate later regresses at wider traffic.
rollback_drill = deepcopy(registry)
previous_production = rollback_drill.resolve("production")
rollback_drill.move_alias("production", candidate_id)
post_promotion_incident = replace(window_10_percent, name="100%")

if not live_gate(post_promotion_incident).allowed:
    rollback_drill.move_alias("production", previous_production)

print(f"drill_production_after_rollback={rollback_drill.resolve('production')}")
print(f"drill_restored_stable={rollback_drill.resolve('production') == stable_id}")
print(f"actual_production_unchanged={registry.resolve('production') == stable_id}")

Output

canary_percent_after_abort=0
production_after_abort=incident-evidence-answerer@sha256:026746ee8fb8
pinned_canary_thread_restored_stable=True
drill_production_after_rollback=incident-evidence-answerer@sha256:026746ee8fb8
drill_restored_stable=True
actual_production_unchanged=True

Real progressive-delivery controllers encode the same operational mechanics. Argo Rollouts supports weighted canary steps and pauses; its background analysis can abort an unsuccessful rollout. With traffic routing, keeping the stable replica set available allows traffic to move back immediately on abort, at the cost of additional capacity during rollout.^{[8]Reference 8Argo Rollouts - Kubernetes Progressive Delivery Controllerhttps://argoproj.github.io/argo-rollouts/}

Record the decision a later engineer can reconstruct

A pipeline that moves aliases but doesn't persist why it moved them still creates mystery during an incident. Store the release IDs, evidence references, gate verdicts, rollout windows, actor or controller identity, decision timestamp, and final alias state as append-only events.

The candidate below is correctly rejected for production after its 10% canary window serves unsupported incident claims. The release isn't deleted. It remains available for diagnosis, while traffic stays with the known-good bundle.

release-decision-record.py

@dataclass(frozen=True)
class ReleaseEvent:
    stage: str
    release_id: str
    decision: str
    evidence: str

events = (
    ReleaseEvent("register", candidate_id, "RECORDED", "manifest_sha"),
    ReleaseEvent("offline_gate", candidate_id, "PASSED", candidate_offline.evaluation_report),
    ReleaseEvent("canary_1_percent", candidate_id, "PASSED", "live:window-001"),
    ReleaseEvent("canary_10_percent", candidate_id, "ABORTED", "live:window-010"),
    ReleaseEvent("production", stable_id, "UNCHANGED", "rollback:not-needed"),
)

for event in events:
    print(f"{event.stage}:{event.decision}:{event.evidence}")
print("release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION")
print(f"active_production={registry.resolve('production')}")

Output

register:RECORDED:manifest_sha
offline_gate:PASSED:reports/candidate-suite-7-redacted.json
canary_1_percent:PASSED:live:window-001
canary_10_percent:ABORTED:live:window-010
production:UNCHANGED:rollback:not-needed
release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION
active_production=incident-evidence-answerer@sha256:026746ee8fb8

Production mapping

The lab deliberately uses plain Python so the state transitions are visible. A deployed stack usually splits the same responsibilities:

Responsibility	Production form
Store immutable model or component version	Artifact store plus model registry
Store prompt, policy, corpus, tokenizer, schema, and evaluator pins	Release manifest in source control or deployment registry
Move candidate/production pointers	Registry aliases, deployment config, or feature flags limited to registered releases
Run offline evidence gates	CI job tied to exact manifest digest with a retained report artifact
Shift live traffic and pause on regressions	Progressive-delivery controller and metric analysis
Reconstruct impact	Request trace logs with resolved release ID and rollout event log

Feature flags remain useful, but their values must resolve to registered immutable release IDs. A flag that points at an arbitrary model name makes rapid changes easy and incident reconstruction impossible.

Mastery check

Key concepts

Release bundle: all pinned components that determine served behavior and operational compatibility, plus the declared evaluation contract.
Release ID: a content-derived identity for one immutable bundle.
Alias: a movable traffic pointer such as production or canary.
Offline gate: comparable controlled evidence required before any user exposure.
Shadow: read-only candidate evaluation on sanitized production-shaped traffic.
Canary: limited user-visible candidate traffic with predefined live thresholds.
Abort versus rollback: stop a not-yet-promoted candidate versus restore production after promotion.

Practice tasks

Change corpus_version in ReleaseBundle. Show that a new policy-document snapshot creates a different release ID even if weights and prompts remain fixed.
Add a gate that blocks candidates with a missing gate_training_run. Explain why this prevents a registered model from escaping experiment lineage.
Add a rollback_ready check that refuses canary exposure unless the stable serving target is healthy, warm, and compatible with the active request schema.
Write a release trace for a candidate that passes at 1% and 10%, is promoted, then rolls back after a latency incident at 100%.
Build a small library of recorded failed traces and require a candidate to pass each replay as part of the offline gate before shadow or canary exposure. Explain why pinning temperature, seed, and model snapshot is what makes the replay a replay.

Evaluation rubric

Foundational: Explains why weights alone don't identify served behavior and names the prompt, policy, corpus, tokenizer, runtime, schema, evaluation suite, and evaluator version as bundle dependencies.
Foundational: Distinguishes an immutable release ID from mutable production and canary aliases.
Intermediate: Implements controlled offline and live gates that compare the candidate against declared support, latency, telemetry, and minimum-volume requirements.
Intermediate: Explains why shadow execution is read-only and why canary routing must be sticky by conversation.
Advanced: Distinguishes a canary abort from a production rollback using alias state and retained evidence.
Advanced: Produces an append-only decision record from which an incident reviewer can reconstruct what was exposed and why traffic stayed or moved.

Self-check questions

Common pitfalls

Only weights are versioned

Symptom: A rollback restores model files, but answers still differ from the last known good release.
Cause: Prompt, policy, corpus, tokenizer, schema, or serving image continued to float.
Fix: Content-address the complete release bundle and log its resolved ID per request.

Offline evidence changes between candidates

Symptom: Candidate appears better, but its score came from a newer suite, a changed evaluator, or a missing report artifact.
Cause: Promotion compared different evidence contracts or discarded the evidence a reviewer needs to inspect.
Fix: Pin eval_suite and evaluator_version in the bundle, then retain the exact evaluation_report artifact used for promotion.

Shadow evaluation performs actions

Symptom: A candidate that users never saw still duplicates a workflow action or touches customer state.
Cause: Shadow execution reused the live tool path instead of a read-only evaluation path.
Fix: Disable side effects in the shadow envelope, sanitize payloads, and monitor dropped comparisons.

Conversation routing flickers

Symptom: One customer thread switches between answers during canary rollout.
Cause: Request-level randomness, process-local hashing, or a widening threshold reassigns an existing thread.
Fix: Use deterministic bucketing for new conversations, then persist the first resolved release ID for each conversation lifetime.

Rollback target is cold

Symptom: The pointer moves back immediately, but recovery still produces timeouts.
Cause: The stable deployment was scaled down or unloaded too early.
Fix: Keep schema-compatible stable capacity ready until the candidate passes burn-in, then test rollback in a drill.

Canary evidence is too sparse

Symptom: A tiny canary window reports zero errors and is promoted before it has seen enough traffic to reveal rare failures.
Cause: The live gate checked rates but didn't require a minimum request count or complete observation window.
Fix: Declare minimum volume and duration before exposure, then keep the rollout paused until both requirements and every metric threshold pass.

A "replay" is actually a fresh run

Symptom: Replaying a failed production trace against the candidate gives a different answer each time, so it can't confirm a fix.
Cause: The prompt, retrieved evidence version, tool outputs, temperature, seed, or model snapshot weren't pinned, so the replay re-derived inputs instead of reusing them.
Fix: Record and feed back the exact inputs and non-deterministic settings, and keep a growing trace library the candidate must clear before promotion.

Next Step

Continue to Semantic Caching & Cost Optimization

You now know that every response must resolve to an exact release bundle. Next you'll reuse prior responses safely by making cache scope and invalidation depend on the model, prompt, policy, corpus, and tenant context that produced them.

PreviousPrompt Optimization with DSPy

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Hidden Technical Debt in Machine Learning Systems.

Sculley et al. · 2015

Challenges in Deploying Machine Learning: a Survey of Case Studies.

Paleyes, A., Urma, R. G., & Lawrence, N. D. · 2022 · ACM Computing Surveys

Model Registry Workflows | MLflow AI Platform

MLflow · 2026

Continuous Delivery for Machine Learning.

Sato, D., Wider, A., & Windheuser, C. · 2019

Models | OpenAI API

OpenAI · 2026

How Is ChatGPT's Behavior Changing over Time?

Chen, L., Zaharia, M., & Zou, J. · 2023

Deprecations | OpenAI API

OpenAI · 2026

Argo Rollouts - Kubernetes Progressive Delivery Controller

Argo Project · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnApplied LLM EngineeringModel Versioning & Deployment

⚙️MediumMLOps & Deployment

Model Versioning & Deployment

Turn an evaluated LLM change into an immutable release bundle, promote it through measured traffic, and roll back without losing lineage.

21 min read

Learning path

Step 75 of 158 in the full curriculum

Prompt Optimization with DSPy Semantic Caching & Cost Optimization

A release is more than model weights

Bundle field	Why it belongs in the release
Answer-model and evidence-gate identifiers	Determine model behavior
Training run and precision policy	Explain how the new gate was produced
Tokenizer and prompt version	Change the text the model sees
Policy and corpus versions	Decide which evidence is available and when it's sufficient to serve
Serving image and schema	Change runtime behavior and API compatibility
Evaluation-suite hash and evaluator version	Declare the comparable scoring contract required before promotion

define-release-bundles.py

from dataclasses import asdict, dataclass, replace
import hashlib
import json

@dataclass(frozen=True)
class ReleaseBundle:
    service: str
    answer_model: str
    evidence_gate: str
    gate_training_run: str
    precision_policy: str
    tokenizer: str
    prompt_version: str
    policy_version: str
    corpus_version: str
    serving_image: str
    input_schema: str
    eval_suite: str
    evaluator_version: str

stable = ReleaseBundle(
    service="incident-evidence-answerer",
    answer_model="answer-model@sha256:3b07",
    evidence_gate="runbook-evidence-classifier-v1@sha256:0f12",
    gate_training_run="run_fp32_baseline",
    precision_policy="fp32",
    tokenizer="incident-tokenizer@sha256:91aa",
    prompt_version="[email protected]",
    policy_version="[email protected]",
    corpus_version="incident-runbooks@sha256:corpus-5",
    serving_image="registry.example/answerer@sha256:image-a",
    input_schema="incident-answer.v2",
    eval_suite="incident-grounding-suite@sha256:suite-7",
    evaluator_version="claim-evidence-eval-v2",
)

candidate = replace(
    stable,
    evidence_gate="runbook-evidence-classifier-v2@sha256:41d8",
    gate_training_run="run_bf16_profiled_target_gpu",
    precision_policy="bf16",
    prompt_version="[email protected]",
    serving_image="registry.example/answerer@sha256:image-b",
)

print(f"stable_gate={stable.evidence_gate}")
print(f"candidate_gate={candidate.evidence_gate}")
print(f"candidate_training_run={candidate.gate_training_run}")
print(f"candidate_precision={candidate.precision_policy}")

Output

stable_gate=runbook-evidence-classifier-v1@sha256:0f12
candidate_gate=runbook-evidence-classifier-v2@sha256:41d8
candidate_training_run=run_bf16_profiled_target_gpu
candidate_precision=bf16

content-address-the-release.py

def release_id(bundle: ReleaseBundle) -> str:
    payload = json.dumps(asdict(bundle), sort_keys=True, separators=(",", ":"))
    # Short prefix keeps the teaching output readable. Retain the full digest in production.
    digest = hashlib.sha256(payload.encode("utf-8")).hexdigest()[:12]
    return f"{bundle.service}@sha256:{digest}"

stable_id = release_id(stable)
candidate_id = release_id(candidate)
prompt_patch_id = release_id(replace(candidate, prompt_version="[email protected]"))

print(f"stable={stable_id}")
print(f"candidate={candidate_id}")
print(f"prompt_patch={prompt_patch_id}")
print(f"prompt_patch_is_new_release={prompt_patch_id != candidate_id}")

Output

stable=incident-evidence-answerer@sha256:026746ee8fb8
candidate=incident-evidence-answerer@sha256:fa60321b1dce
prompt_patch=incident-evidence-answerer@sha256:e016f9a2b187
prompt_patch_is_new_release=True

Artifacts stay fixed; aliases move

The small registry below enforces that rule. register() keeps a deep copy of the bundle, and move_alias() only points at a registered ID.

registry-and-aliases.py

from copy import deepcopy

class ReleaseRegistry:
    def __init__(self) -> None:
        self._bundles: dict[str, ReleaseBundle] = {}
        self._aliases: dict[str, str] = {}

    def register(self, bundle: ReleaseBundle) -> str:
        bundle_id = release_id(bundle)
        existing = self._bundles.get(bundle_id)
        if existing is not None and existing != bundle:
            raise ValueError("release digest collision")
        self._bundles[bundle_id] = deepcopy(bundle)
        return bundle_id

    def move_alias(self, alias: str, bundle_id: str) -> None:
        if bundle_id not in self._bundles:
            raise KeyError(f"unregistered release: {bundle_id}")
        self._aliases[alias] = bundle_id

    def resolve(self, alias: str) -> str:
        return self._aliases[alias]

registry = ReleaseRegistry()
assert registry.register(stable) == stable_id
assert registry.register(candidate) == candidate_id
registry.move_alias("production", stable_id)

print(f"registered={len(registry._bundles)}")
print(f"production={registry.resolve('production')}")
print(f"candidate_registered={candidate_id in registry._bundles}")

Output

registered=2
production=incident-evidence-answerer@sha256:026746ee8fb8
candidate_registered=True

Promotion begins with controlled evidence

supported_evidence_f1 measures whether supported incident claims are served correctly.
unsupported_serve_count must remain zero in the frozen high-risk suite.
p95_latency_ms (p95 latency) prevents a behaviorally acceptable gate from breaking the response budget.
Schema hash, evaluation-suite hash, and evaluator version prevent incomparable evidence from entering the decision.
evaluation_report preserves the report artifact a reviewer or incident responder can inspect later.

offline-promotion-gate.py

@dataclass(frozen=True)
class OfflineEvidence:
    release_id: str
    eval_suite: str
    evaluator_version: str
    input_schema: str
    evaluation_report: str
    supported_evidence_f1: float
    unsupported_serve_count: int
    p95_latency_ms: int

@dataclass(frozen=True)
class Decision:
    allowed: bool
    reason: str

def offline_gate(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision:
    if evidence.release_id != release_id(bundle):
        return Decision(False, "evidence belongs to another release")
    if evidence.eval_suite != bundle.eval_suite:
        return Decision(False, "evaluation suite changed")
    if evidence.evaluator_version != bundle.evaluator_version:
        return Decision(False, "evaluator version changed")
    if evidence.input_schema != bundle.input_schema:
        return Decision(False, "schema mismatch")
    if not evidence.evaluation_report:
        return Decision(False, "evaluation report missing")
    if evidence.supported_evidence_f1 < 0.92:
        return Decision(False, "supported_evidence_f1 below 0.92")
    if evidence.unsupported_serve_count != 0:
        return Decision(False, "unsupported answer was served")
    if evidence.p95_latency_ms > 500:
        return Decision(False, "p95 latency exceeds 500 ms")
    return Decision(True, "offline gate passed")

candidate_offline = OfflineEvidence(
    release_id=candidate_id,
    eval_suite=candidate.eval_suite,
    evaluator_version=candidate.evaluator_version,
    input_schema=candidate.input_schema,
    evaluation_report="reports/candidate-suite-7-redacted.json",
    supported_evidence_f1=0.93,
    unsupported_serve_count=0,
    p95_latency_ms=472,
)
weaker_candidate = replace(candidate_offline, supported_evidence_f1=0.89)
changed_evaluator = replace(candidate_offline, evaluator_version="claim-evidence-eval-v3")

print(f"candidate={offline_gate(candidate, candidate_offline)}")
print(f"weak_metric={offline_gate(candidate, weaker_candidate)}")
print(f"changed_evaluator={offline_gate(candidate, changed_evaluator)}")

Output

candidate=Decision(allowed=True, reason='offline gate passed')
weak_metric=Decision(allowed=False, reason='supported_evidence_f1 below 0.92')
changed_evaluator=Decision(allowed=False, reason='evaluator version changed')

Passing the offline gate permits further evaluation; it doesn't immediately replace production. Open a canary alias while production still points to the known-good bundle.

open-canary-only-after-gate.py

def open_canary(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision:
    decision = offline_gate(bundle, evidence)
    if decision.allowed:
        registry.move_alias("canary", release_id(bundle))
    return decision

canary_decision = open_canary(candidate, candidate_offline)

print(f"canary_opened={canary_decision.allowed}")
print(f"canary={registry.resolve('canary')}")
print(f"production_still_stable={registry.resolve('production') == stable_id}")

Output

canary_opened=True
canary=incident-evidence-answerer@sha256:fa60321b1dce
production_still_stable=True

Managed models need a documented pin

Replay a failed production trace against the candidate

The stable release served a claim unsupported by admitted evidence. Replay that trace against the candidate's evidence gate:

deterministic-replay.py

@dataclass(frozen=True)
class RecordedTrace:
    request_id: str
    origin_release: str
    prompt_version: str
    evidence_version: str
    tool_sequence: tuple[str, ...]
    evidence_supports_claim: bool
    served_unsupported_claim: bool

def candidate_gate_serves(evidence_supports_claim: bool) -> bool:
    # The v2 evidence gate serves a factual claim only when admitted evidence supports it.
    return evidence_supports_claim

def replay_against(trace: RecordedTrace, bundle: ReleaseBundle) -> dict[str, object]:
    would_serve = candidate_gate_serves(trace.evidence_supports_claim)
    reproduces = would_serve and not trace.evidence_supports_claim
    return {
        "candidate_release": release_id(bundle),
        "tool_sequence": trace.tool_sequence,
        "candidate_reproduces_failure": reproduces,
    }

failed_trace = RecordedTrace(
    request_id="req_88213",
    origin_release=stable_id,
    prompt_version=stable.prompt_version,
    evidence_version="incident-runbooks@sha256:corpus-5",
    tool_sequence=("lookup_incident", "fetch_runbook", "answer"),
    evidence_supports_claim=False,
    served_unsupported_claim=True,
)
replay_result = replay_against(failed_trace, candidate)

print(f"origin_release={failed_trace.origin_release}")
print(f"replayed_against={replay_result['candidate_release']}")
print(f"inputs_held_fixed={replay_result['tool_sequence']}")
print(f"candidate_reproduces_failure={replay_result['candidate_reproduces_failure']}")

Output

origin_release=incident-evidence-answerer@sha256:026746ee8fb8
replayed_against=incident-evidence-answerer@sha256:fa60321b1dce
inputs_held_fixed=('lookup_incident', 'fetch_runbook', 'answer')
candidate_reproduces_failure=False

Shadow evaluation must be read-only

shadow-envelope.py

import re

@dataclass(frozen=True)
class ShadowEnvelope:
    candidate_release: str
    sanitized_text: str
    side_effects_enabled: bool
    response_visible_to_user: bool

def make_shadow(text: str) -> ShadowEnvelope:
    sanitized = re.sub(r"INC-\d+", "[INCIDENT_ID]", text)
    return ShadowEnvelope(
        candidate_release=registry.resolve("canary"),
        sanitized_text=sanitized,
        side_effects_enabled=False,
        response_visible_to_user=False,
    )

shadow = make_shadow("Can INC-48192 roll back based on runbook RB-7?")

print(f"shadow_text={shadow.sanitized_text}")
print(f"candidate={shadow.candidate_release}")
print(f"side_effects_enabled={shadow.side_effects_enabled}")
print(f"response_visible={shadow.response_visible_to_user}")

Output

shadow_text=Can [INCIDENT_ID] roll back based on runbook RB-7?
candidate=incident-evidence-answerer@sha256:fa60321b1dce
side_effects_enabled=False
response_visible=False

Canary traffic is visible and sticky

sticky-canary-routing.py

def bucket(conversation_id: str) -> int:
    digest = hashlib.sha256(conversation_id.encode("utf-8")).hexdigest()
    return int(digest[:8], 16) % 100

conversation_assignments: dict[str, str] = {}
aborted_releases: set[str] = set()

def assigned_release(conversation_id: str, canary_percent: int) -> str:
    existing = conversation_assignments.get(conversation_id)
    if existing is not None and existing not in aborted_releases:
        return existing
    alias = "canary" if bucket(conversation_id) < canary_percent else "production"
    bundle_id = registry.resolve(alias)
    if bundle_id in aborted_releases:
        bundle_id = registry.resolve("production")
    conversation_assignments[conversation_id] = bundle_id
    return bundle_id

canary_thread = next(
    f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") < 10
)
assignments = [assigned_release(canary_thread, canary_percent=10) for _ in range(3)]
stable_thread = next(
    f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") >= 10
)
stable_before_widen = assigned_release(stable_thread, canary_percent=10)
stable_after_widen = assigned_release(stable_thread, canary_percent=100)

print(f"canary_thread={canary_thread}")
print(f"bucket={bucket(canary_thread)}")
print(f"same_release_each_turn={len(set(assignments)) == 1}")
print(f"assigned_to_candidate={assignments[0] == candidate_id}")
print(f"existing_stable_thread_pinned_after_widen={stable_before_widen == stable_after_widen == stable_id}")

Output

canary_thread=thread-6
bucket=6
same_release_each_turn=True
assigned_to_candidate=True
existing_stable_thread_pinned_after_widen=True

A canary window answers a new question

live-canary-window.py

@dataclass(frozen=True)
class LiveWindow:
    name: str
    request_count: int
    p95_latency_ms: int
    error_rate: float
    unsupported_serve_rate: float
    shadow_drop_rate: float

def live_gate(window: LiveWindow) -> Decision:
    if window.request_count < 1_000:
        return Decision(False, "fewer than 1000 requests observed")
    if window.p95_latency_ms > 550:
        return Decision(False, "live latency exceeded 550 ms")
    if window.error_rate > 0.01:
        return Decision(False, "error rate exceeded 1%")
    if window.unsupported_serve_rate > 0.0:
        return Decision(False, "unsupported serve rate must remain zero")
    if window.shadow_drop_rate > 0.02:
        return Decision(False, "shadow telemetry incomplete")
    return Decision(True, "live window passed")

window_1_percent = LiveWindow("1%", 1_200, 481, 0.002, 0.000, 0.001)
window_10_percent = LiveWindow("10%", 8_000, 493, 0.003, 0.014, 0.001)

print(f"one_percent={live_gate(window_1_percent)}")
print(f"ten_percent={live_gate(window_10_percent)}")

Output

one_percent=Decision(allowed=True, reason='live window passed')
ten_percent=Decision(allowed=False, reason='unsupported serve rate must remain zero')

Abort a canary; roll back a promotion

These actions sound similar but refer to different alias states:

If the candidate fails while only the canary alias receives traffic, abort the rollout. production never moved.
If a candidate was already promoted and later fails, roll back by repointing production to the retained stable release.

Keeping the distinction explicit prevents an incident report from claiming production was rolled back when the candidate was never production.

abort-and-rollback.py

canary_percent = 10
failed_window = live_gate(window_10_percent)
if not failed_window.allowed:
    aborted_releases.add(candidate_id)
    canary_percent = 0

print(f"canary_percent_after_abort={canary_percent}")
print(f"production_after_abort={registry.resolve('production')}")
print(f"pinned_canary_thread_restored_stable={assigned_release(canary_thread, canary_percent) == stable_id}")

# Separate rollback drill: a promoted candidate later regresses at wider traffic.
rollback_drill = deepcopy(registry)
previous_production = rollback_drill.resolve("production")
rollback_drill.move_alias("production", candidate_id)
post_promotion_incident = replace(window_10_percent, name="100%")

if not live_gate(post_promotion_incident).allowed:
    rollback_drill.move_alias("production", previous_production)

print(f"drill_production_after_rollback={rollback_drill.resolve('production')}")
print(f"drill_restored_stable={rollback_drill.resolve('production') == stable_id}")
print(f"actual_production_unchanged={registry.resolve('production') == stable_id}")

Output

canary_percent_after_abort=0
production_after_abort=incident-evidence-answerer@sha256:026746ee8fb8
pinned_canary_thread_restored_stable=True
drill_production_after_rollback=incident-evidence-answerer@sha256:026746ee8fb8
drill_restored_stable=True
actual_production_unchanged=True

Record the decision a later engineer can reconstruct

release-decision-record.py

@dataclass(frozen=True)
class ReleaseEvent:
    stage: str
    release_id: str
    decision: str
    evidence: str

events = (
    ReleaseEvent("register", candidate_id, "RECORDED", "manifest_sha"),
    ReleaseEvent("offline_gate", candidate_id, "PASSED", candidate_offline.evaluation_report),
    ReleaseEvent("canary_1_percent", candidate_id, "PASSED", "live:window-001"),
    ReleaseEvent("canary_10_percent", candidate_id, "ABORTED", "live:window-010"),
    ReleaseEvent("production", stable_id, "UNCHANGED", "rollback:not-needed"),
)

for event in events:
    print(f"{event.stage}:{event.decision}:{event.evidence}")
print("release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION")
print(f"active_production={registry.resolve('production')}")

Output

register:RECORDED:manifest_sha
offline_gate:PASSED:reports/candidate-suite-7-redacted.json
canary_1_percent:PASSED:live:window-001
canary_10_percent:ABORTED:live:window-010
production:UNCHANGED:rollback:not-needed
release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION
active_production=incident-evidence-answerer@sha256:026746ee8fb8

Production mapping

The lab deliberately uses plain Python so the state transitions are visible. A deployed stack usually splits the same responsibilities:

Responsibility	Production form
Store immutable model or component version	Artifact store plus model registry
Store prompt, policy, corpus, tokenizer, schema, and evaluator pins	Release manifest in source control or deployment registry
Move candidate/production pointers	Registry aliases, deployment config, or feature flags limited to registered releases
Run offline evidence gates	CI job tied to exact manifest digest with a retained report artifact
Shift live traffic and pause on regressions	Progressive-delivery controller and metric analysis
Reconstruct impact	Request trace logs with resolved release ID and rollout event log

Mastery check

Key concepts

Release bundle: all pinned components that determine served behavior and operational compatibility, plus the declared evaluation contract.
Release ID: a content-derived identity for one immutable bundle.
Alias: a movable traffic pointer such as production or canary.
Offline gate: comparable controlled evidence required before any user exposure.
Shadow: read-only candidate evaluation on sanitized production-shaped traffic.
Canary: limited user-visible candidate traffic with predefined live thresholds.
Abort versus rollback: stop a not-yet-promoted candidate versus restore production after promotion.

Practice tasks

Change corpus_version in ReleaseBundle. Show that a new policy-document snapshot creates a different release ID even if weights and prompts remain fixed.
Add a gate that blocks candidates with a missing gate_training_run. Explain why this prevents a registered model from escaping experiment lineage.
Add a rollback_ready check that refuses canary exposure unless the stable serving target is healthy, warm, and compatible with the active request schema.
Write a release trace for a candidate that passes at 1% and 10%, is promoted, then rolls back after a latency incident at 100%.
Build a small library of recorded failed traces and require a candidate to pass each replay as part of the offline gate before shadow or canary exposure. Explain why pinning temperature, seed, and model snapshot is what makes the replay a replay.

Evaluation rubric

Foundational: Explains why weights alone don't identify served behavior and names the prompt, policy, corpus, tokenizer, runtime, schema, evaluation suite, and evaluator version as bundle dependencies.
Foundational: Distinguishes an immutable release ID from mutable production and canary aliases.
Intermediate: Implements controlled offline and live gates that compare the candidate against declared support, latency, telemetry, and minimum-volume requirements.
Intermediate: Explains why shadow execution is read-only and why canary routing must be sticky by conversation.
Advanced: Distinguishes a canary abort from a production rollback using alias state and retained evidence.
Advanced: Produces an append-only decision record from which an incident reviewer can reconstruct what was exposed and why traffic stayed or moved.

Self-check questions

Common pitfalls

Only weights are versioned

Symptom: A rollback restores model files, but answers still differ from the last known good release.
Cause: Prompt, policy, corpus, tokenizer, schema, or serving image continued to float.
Fix: Content-address the complete release bundle and log its resolved ID per request.

Offline evidence changes between candidates

Symptom: Candidate appears better, but its score came from a newer suite, a changed evaluator, or a missing report artifact.
Cause: Promotion compared different evidence contracts or discarded the evidence a reviewer needs to inspect.
Fix: Pin eval_suite and evaluator_version in the bundle, then retain the exact evaluation_report artifact used for promotion.

Shadow evaluation performs actions

Symptom: A candidate that users never saw still duplicates a workflow action or touches customer state.
Cause: Shadow execution reused the live tool path instead of a read-only evaluation path.
Fix: Disable side effects in the shadow envelope, sanitize payloads, and monitor dropped comparisons.

Conversation routing flickers

Symptom: One customer thread switches between answers during canary rollout.
Cause: Request-level randomness, process-local hashing, or a widening threshold reassigns an existing thread.
Fix: Use deterministic bucketing for new conversations, then persist the first resolved release ID for each conversation lifetime.

Rollback target is cold

Symptom: The pointer moves back immediately, but recovery still produces timeouts.
Cause: The stable deployment was scaled down or unloaded too early.
Fix: Keep schema-compatible stable capacity ready until the candidate passes burn-in, then test rollback in a drill.

Canary evidence is too sparse

Symptom: A tiny canary window reports zero errors and is promoted before it has seen enough traffic to reveal rare failures.
Cause: The live gate checked rates but didn't require a minimum request count or complete observation window.
Fix: Declare minimum volume and duration before exposure, then keep the rollout paused until both requirements and every metric threshold pass.

A "replay" is actually a fresh run

Symptom: Replaying a failed production trace against the candidate gives a different answer each time, so it can't confirm a fix.
Cause: The prompt, retrieved evidence version, tool outputs, temperature, seed, or model snapshot weren't pinned, so the replay re-derived inputs instead of reusing them.
Fix: Record and feed back the exact inputs and non-deterministic settings, and keep a growing trace library the candidate must clear before promotion.

Next Step

Continue to Semantic Caching & Cost Optimization

PreviousPrompt Optimization with DSPy

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Hidden Technical Debt in Machine Learning Systems.

Sculley et al. · 2015

Challenges in Deploying Machine Learning: a Survey of Case Studies.

Paleyes, A., Urma, R. G., & Lawrence, N. D. · 2022 · ACM Computing Surveys

Model Registry Workflows | MLflow AI Platform

MLflow · 2026

Continuous Delivery for Machine Learning.

Sato, D., Wider, A., & Windheuser, C. · 2019

Models | OpenAI API

OpenAI · 2026

How Is ChatGPT's Behavior Changing over Time?

Chen, L., Zaharia, M., & Zou, J. · 2023

Deprecations | OpenAI API

OpenAI · 2026

Argo Rollouts - Kubernetes Progressive Delivery Controller

Argo Project · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Model Versioning & Deployment

A release is more than model weights

Why should a one-line prompt correction produce a new release ID?

Artifacts stay fixed; aliases move

Promotion begins with controlled evidence

Managed models need a documented pin

Replay a failed production trace against the candidate

Shadow evaluation must be read-only

Canary traffic is visible and sticky

A canary window answers a new question

Abort a canary; roll back a promotion

Why might keeping the stable deployment warm during a canary be worth its GPU cost?

Record the decision a later engineer can reconstruct

Production mapping

Mastery check

Key concepts

Practice tasks

Evaluation rubric

Self-check questions

The model weights are unchanged, but a serving-image digest changes. Is that a new release?

An offline suite passes, then the 10% canary serves unsupported incident claims. Should production be marked rolled back?

Why is a floating hosted-model alias risky inside an otherwise immutable release bundle?

Common pitfalls

Only weights are versioned

Offline evidence changes between candidates

Shadow evaluation performs actions

Conversation routing flickers

Rollback target is cold

Canary evidence is too sparse

A "replay" is actually a fresh run

Mastery Check

Discussion

Model Versioning & Deployment

A release is more than model weights

Why should a one-line prompt correction produce a new release ID?

Artifacts stay fixed; aliases move

Promotion begins with controlled evidence

Managed models need a documented pin

Replay a failed production trace against the candidate

Shadow evaluation must be read-only

Canary traffic is visible and sticky

A canary window answers a new question

Abort a canary; roll back a promotion

Why might keeping the stable deployment warm during a canary be worth its GPU cost?

Record the decision a later engineer can reconstruct

Production mapping

Mastery check

Key concepts

Practice tasks

Evaluation rubric

Self-check questions

The model weights are unchanged, but a serving-image digest changes. Is that a new release?

An offline suite passes, then the 10% canary serves unsupported incident claims. Should production be marked rolled back?

Why is a floating hosted-model alias risky inside an otherwise immutable release bundle?

Common pitfalls

Only weights are versioned

Offline evidence changes between candidates

Shadow evaluation performs actions

Conversation routing flickers

Rollback target is cold

Canary evidence is too sparse

A "replay" is actually a fresh run

Mastery Check

Discussion