Turn an evaluated LLM change into an immutable release bundle, promote it through measured traffic, and roll back without losing lineage.
The previous lesson ended with a precision candidate blocked until a target-GPU profile could prove its speed and stability. Suppose that missing profile now passes. You still don't have something safe to deploy. A live answer can change because of weights, a support gate, a prompt, a tokenizer, a policy, a container image, or a schema.
For the delivery-support system you've been building, deployment means answering one exact question: which complete, evaluated release produced this response, and how quickly can traffic return to the last known good release?
The service now contains an answer model plus the carrier-evidence-classifier-v2 gate trained in the previous lesson. That gate blocks answers whose delivery claim isn't supported by a carrier scan. If you roll back only its weights but leave a newer prompt or policy active, you haven't restored previous behavior.
This is the release-bundle idea: store every input that can change visible output or operational safety in one immutable manifest. ML systems accumulate hidden dependencies between data, code, configuration, and serving infrastructure; deployment records need to make those dependencies reviewable.[1][2]
| Bundle field | Why it belongs in the release |
|---|---|
| Answer-model and evidence-gate identifiers | Determine model behavior |
| Training run and precision policy | Explain how the new gate was produced |
| Tokenizer and prompt version | Change the text the model sees |
| Policy version | Decides when evidence is sufficient to serve |
| Serving image and schema | Change runtime behavior and API compatibility |
| Evaluation-suite hash | Identifies the test set that justified promotion |
Start with two full bundles. stable represents what's serving today. candidate changes the evidence gate after the profiled BF16 training run has passed outside the previous lesson's blocked fixture.
1from dataclasses import asdict, dataclass, replace
2import hashlib
3import json
4
5@dataclass(frozen=True)
6class ReleaseBundle:
7 service: str
8 answer_model: str
9 evidence_gate: str
10 gate_training_run: str
11 precision_policy: str
12 tokenizer: str
13 prompt_version: str
14 policy_version: str
15 serving_image: str
16 input_schema: str
17 eval_suite: str
18
19stable = ReleaseBundle(
20 service="delivery-evidence-answerer",
21 answer_model="answer-model@sha256:3b07",
22 evidence_gate="carrier-evidence-classifier-v1@sha256:0f12",
23 gate_training_run="run_fp32_baseline",
24 precision_policy="fp32",
25 tokenizer="delivery-tokenizer@sha256:91aa",
26 prompt_version="[email protected]",
27 policy_version="[email protected]",
28 serving_image="registry.example/answerer@sha256:image-a",
29 input_schema="delivery-answer.v2",
30 eval_suite="delivery-grounding-suite@sha256:suite-7",
31)
32
33candidate = replace(
34 stable,
35 evidence_gate="carrier-evidence-classifier-v2@sha256:41d8",
36 gate_training_run="run_bf16_profiled_target_gpu",
37 precision_policy="bf16",
38 prompt_version="[email protected]",
39 serving_image="registry.example/answerer@sha256:image-b",
40)
41
42print(f"stable_gate={stable.evidence_gate}")
43print(f"candidate_gate={candidate.evidence_gate}")
44print(f"candidate_training_run={candidate.gate_training_run}")
45print(f"candidate_precision={candidate.precision_policy}")1stable_gate=carrier-evidence-classifier-v1@sha256:0f12
2candidate_gate=carrier-evidence-classifier-v2@sha256:41d8
3candidate_training_run=run_bf16_profiled_target_gpu
4candidate_precision=bf16A version name such as v2 is useful for people, but it doesn't prove contents stayed fixed. The next cell derives identity from the canonical manifest. Change a prompt or image digest and the release ID changes too.
1def release_id(bundle: ReleaseBundle) -> str:
2 payload = json.dumps(asdict(bundle), sort_keys=True, separators=(",", ":"))
3 # Short prefix keeps the teaching output readable. Retain the full digest in production.
4 digest = hashlib.sha256(payload.encode("utf-8")).hexdigest()[:12]
5 return f"{bundle.service}@sha256:{digest}"
6
7stable_id = release_id(stable)
8candidate_id = release_id(candidate)
9prompt_patch_id = release_id(replace(candidate, prompt_version="[email protected]"))
10
11print(f"stable={stable_id}")
12print(f"candidate={candidate_id}")
13print(f"prompt_patch={prompt_patch_id}")
14print(f"prompt_patch_is_new_release={prompt_patch_id != candidate_id}")1stable=delivery-evidence-answerer@sha256:df2d4fe7b0c5
2candidate=delivery-evidence-answerer@sha256:d9c7ecff4a9c
3prompt_patch=delivery-evidence-answerer@sha256:9a1e41a39815
4prompt_patch_is_new_release=TrueThis lab abbreviates each SHA-256 digest to 12 hexadecimal characters so the state transitions stay readable. A production registry should retain the full digest as identity and use short prefixes only for display.
A registry record stores an immutable bundle. An alias such as production or canary is a mutable pointer used by traffic. This separation is what makes rollback simple: preserve both bundles and move the pointer back.
MLflow's current Model Registry workflow provides version aliases and tags, and its documentation marks fixed Model Stages as deprecated. That distinction matters: a model version is history; an alias expresses the current deployment decision.[3]
The small registry below enforces that rule. register() keeps a deep copy of the bundle, and move_alias() only points at a registered ID.
1from copy import deepcopy
2
3class ReleaseRegistry:
4 def __init__(self) -> None:
5 self._bundles: dict[str, ReleaseBundle] = {}
6 self._aliases: dict[str, str] = {}
7
8 def register(self, bundle: ReleaseBundle) -> str:
9 bundle_id = release_id(bundle)
10 existing = self._bundles.get(bundle_id)
11 if existing is not None and existing != bundle:
12 raise ValueError("release digest collision")
13 self._bundles[bundle_id] = deepcopy(bundle)
14 return bundle_id
15
16 def move_alias(self, alias: str, bundle_id: str) -> None:
17 if bundle_id not in self._bundles:
18 raise KeyError(f"unregistered release: {bundle_id}")
19 self._aliases[alias] = bundle_id
20
21 def resolve(self, alias: str) -> str:
22 return self._aliases[alias]
23
24registry = ReleaseRegistry()
25assert registry.register(stable) == stable_id
26assert registry.register(candidate) == candidate_id
27registry.move_alias("production", stable_id)
28
29print(f"registered={len(registry._bundles)}")
30print(f"production={registry.resolve('production')}")
31print(f"candidate_registered={candidate_id in registry._bundles}")1registered=2
2production=delivery-evidence-answerer@sha256:df2d4fe7b0c5
3candidate_registered=TrueRegistration isn't approval. Continuous delivery for machine learning adds evaluation and monitoring to ordinary build-and-deploy practices because a valid artifact can still produce unacceptable behavior.[4]
In this running system, don't introduce a generic "quality score" after spending several lessons defining a grounded support contract. The release gate should use the same measurements the incident, evaluation, and experiment lessons established:
supported_evidence_f1 measures whether supported delivery claims are served correctly.unsupported_serve_count must remain zero in the frozen high-risk suite.p95_latency_ms prevents a behaviorally acceptable gate from breaking the response budget.
1@dataclass(frozen=True)
2class OfflineEvidence:
3 release_id: str
4 eval_suite: str
5 input_schema: str
6 supported_evidence_f1: float
7 unsupported_serve_count: int
8 p95_latency_ms: int
9
10@dataclass(frozen=True)
11class Decision:
12 allowed: bool
13 reason: str
14
15def offline_gate(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision:
16 if evidence.release_id != release_id(bundle):
17 return Decision(False, "evidence belongs to another release")
18 if evidence.eval_suite != bundle.eval_suite:
19 return Decision(False, "evaluation suite changed")
20 if evidence.input_schema != bundle.input_schema:
21 return Decision(False, "schema mismatch")
22 if evidence.supported_evidence_f1 < 0.92:
23 return Decision(False, "supported_evidence_f1 below 0.92")
24 if evidence.unsupported_serve_count != 0:
25 return Decision(False, "unsupported answer was served")
26 if evidence.p95_latency_ms > 500:
27 return Decision(False, "p95 latency exceeds 500 ms")
28 return Decision(True, "offline gate passed")
29
30candidate_offline = OfflineEvidence(
31 release_id=candidate_id,
32 eval_suite=candidate.eval_suite,
33 input_schema=candidate.input_schema,
34 supported_evidence_f1=0.93,
35 unsupported_serve_count=0,
36 p95_latency_ms=472,
37)
38weaker_candidate = replace(candidate_offline, supported_evidence_f1=0.89)
39
40print(f"candidate={offline_gate(candidate, candidate_offline)}")
41print(f"weak_metric={offline_gate(candidate, weaker_candidate)}")1candidate=Decision(allowed=True, reason='offline gate passed')
2weak_metric=Decision(allowed=False, reason='supported_evidence_f1 below 0.92')Passing the offline gate permits further evaluation; it doesn't immediately replace production. Open a canary alias while production still points to the known-good bundle.
1def open_canary(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision:
2 decision = offline_gate(bundle, evidence)
3 if decision.allowed:
4 registry.move_alias("canary", release_id(bundle))
5 return decision
6
7canary_decision = open_canary(candidate, candidate_offline)
8
9print(f"canary_opened={canary_decision.allowed}")
10print(f"canary={registry.resolve('canary')}")
11print(f"production_still_stable={registry.resolve('production') == stable_id}")1canary_opened=True
2canary=delivery-evidence-answerer@sha256:d9c7ecff4a9c
3production_still_stable=TrueWhen your team owns weights, a digest can identify them directly. If a service calls a hosted model, the bundle must instead record the strongest fixed identifier that provider documents. For example, OpenAI model pages that expose snapshots describe snapshots as locking a particular model version for consistent behavior.[5]
This matters because a provider can change the model behind a name you thought was stable, shifting your outputs with no deploy on your side. A controlled study comparing the March and June 2023 releases of GPT-4 and GPT-3.5 reported large behavior swings on the same tasks between snapshots, so the "same" service was not always the same service.[6] Pinning a documented snapshot reduces that risk but doesn't remove it: snapshots get deprecated, and a golden-set monitor against production still earns its keep.[7]
Don't generalize that guarantee to every provider or every alias. Verify the exact provider documentation, store the chosen model identifier in the bundle, monitor deprecation notices, and rerun release gates before changing it.
Offline fixtures can't cover every real request shape. A shadow sends a sanitized copy of a production request to the candidate while the stable release alone supplies the user-visible answer. It can reveal latency or evidence-support problems without exposing the candidate's text to customers.
The critical safety rule is easy to miss: a shadow isn't permitted to execute tools, send messages, change orders, or write customer state. Its output is evaluation data only. The envelope below also redacts an order identifier before queuing the shadow request.
1import re
2
3@dataclass(frozen=True)
4class ShadowEnvelope:
5 candidate_release: str
6 sanitized_text: str
7 side_effects_enabled: bool
8 response_visible_to_user: bool
9
10def make_shadow(text: str) -> ShadowEnvelope:
11 sanitized = re.sub(r"ORD-\d+", "[ORDER_ID]", text)
12 return ShadowEnvelope(
13 candidate_release=registry.resolve("canary"),
14 sanitized_text=sanitized,
15 side_effects_enabled=False,
16 response_visible_to_user=False,
17 )
18
19shadow = make_shadow("Will ORD-48192 arrive Friday based on scan S3?")
20
21print(f"shadow_text={shadow.sanitized_text}")
22print(f"candidate={shadow.candidate_release}")
23print(f"side_effects_enabled={shadow.side_effects_enabled}")
24print(f"response_visible={shadow.response_visible_to_user}")1shadow_text=Will [ORDER_ID] arrive Friday based on scan S3?
2candidate=delivery-evidence-answerer@sha256:d9c7ecff4a9c
3side_effects_enabled=False
4response_visible=FalseIn a deployed service, enqueue this envelope to a bounded worker queue and count dropped or failed comparisons. Don't start an untracked background task in request scope and assume its evaluation record will survive process restarts.
The boolean in this teaching envelope is metadata, not an authorization boundary. Its worker still needs a read-only tool allowlist and credentials that can't write production state. Reject a shadow request if it asks for a side effect.
Shadow results can justify limited exposure, not automatic promotion. A canary sends a small share of real conversations to the candidate and returns those candidate responses to users. For conversational systems, one thread must remain on one bundle throughout the rollout. Otherwise a user can receive conflicting answers from stable and candidate releases in adjacent turns.
Use deterministic hashing of a conversation ID to choose new conversations reproducibly across workers. Python's built-in hash() is intentionally process-dependent, so it's the wrong bucketing function for that job. Hashing alone isn't enough, though: when canary traffic widens from 1% to 10%, the higher threshold could move an existing conversation from stable to candidate. Persist the first resolved release ID for the conversation lifetime.
1def bucket(conversation_id: str) -> int:
2 digest = hashlib.sha256(conversation_id.encode("utf-8")).hexdigest()
3 return int(digest[:8], 16) % 100
4
5conversation_assignments: dict[str, str] = {}
6aborted_releases: set[str] = set()
7
8def assigned_release(conversation_id: str, canary_percent: int) -> str:
9 existing = conversation_assignments.get(conversation_id)
10 if existing is not None and existing not in aborted_releases:
11 return existing
12 alias = "canary" if bucket(conversation_id) < canary_percent else "production"
13 bundle_id = registry.resolve(alias)
14 if bundle_id in aborted_releases:
15 bundle_id = registry.resolve("production")
16 conversation_assignments[conversation_id] = bundle_id
17 return bundle_id
18
19canary_thread = next(
20 f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") < 10
21)
22assignments = [assigned_release(canary_thread, canary_percent=10) for _ in range(3)]
23stable_thread = next(
24 f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") >= 10
25)
26stable_before_widen = assigned_release(stable_thread, canary_percent=10)
27stable_after_widen = assigned_release(stable_thread, canary_percent=100)
28
29print(f"canary_thread={canary_thread}")
30print(f"bucket={bucket(canary_thread)}")
31print(f"same_release_each_turn={len(set(assignments)) == 1}")
32print(f"assigned_to_candidate={assignments[0] == candidate_id}")
33print(f"existing_stable_thread_pinned_after_widen={stable_before_widen == stable_after_widen == stable_id}")1canary_thread=thread-6
2bucket=6
3same_release_each_turn=True
4assigned_to_candidate=True
5existing_stable_thread_pinned_after_widen=TrueThe dictionary is a teaching fixture. A real router persists assignments in conversation state or a rollout store, expires them when each conversation ends, and records the resolved release ID in traces. Stickiness is normal-routing behavior, not permission to keep serving a failed candidate: an abort must override it.
Offline evidence proves behavior on frozen examples. A live window tests traffic mix, serving latency, error rate, and evidence failures under actual request volume. Define those thresholds before sending any candidate traffic.
1@dataclass(frozen=True)
2class LiveWindow:
3 name: str
4 p95_latency_ms: int
5 error_rate: float
6 unsupported_serve_rate: float
7 shadow_drop_rate: float
8
9def live_gate(window: LiveWindow) -> Decision:
10 if window.p95_latency_ms > 550:
11 return Decision(False, "live latency exceeded 550 ms")
12 if window.error_rate > 0.01:
13 return Decision(False, "error rate exceeded 1%")
14 if window.unsupported_serve_rate > 0.005:
15 return Decision(False, "unsupported serve rate exceeded 0.5%")
16 if window.shadow_drop_rate > 0.02:
17 return Decision(False, "shadow telemetry incomplete")
18 return Decision(True, "live window passed")
19
20window_1_percent = LiveWindow("1%", 481, 0.002, 0.000, 0.001)
21window_10_percent = LiveWindow("10%", 493, 0.003, 0.014, 0.001)
22
23print(f"one_percent={live_gate(window_1_percent)}")
24print(f"ten_percent={live_gate(window_10_percent)}")1one_percent=Decision(allowed=True, reason='live window passed')
2ten_percent=Decision(allowed=False, reason='unsupported serve rate exceeded 0.5%')These actions sound similar but refer to different alias states:
canary alias receives traffic, abort the rollout. production never moved.production to the retained stable release.Keeping the distinction explicit prevents an incident report from claiming production was rolled back when the candidate was never production.
1canary_percent = 10
2failed_window = live_gate(window_10_percent)
3if not failed_window.allowed:
4 aborted_releases.add(candidate_id)
5 canary_percent = 0
6
7print(f"canary_percent_after_abort={canary_percent}")
8print(f"production_after_abort={registry.resolve('production')}")
9print(f"pinned_canary_thread_restored_stable={assigned_release(canary_thread, canary_percent) == stable_id}")
10
11# Separate rollback drill: a promoted candidate later regresses at wider traffic.
12rollback_drill = deepcopy(registry)
13previous_production = rollback_drill.resolve("production")
14rollback_drill.move_alias("production", candidate_id)
15post_promotion_incident = replace(window_10_percent, name="100%")
16
17if not live_gate(post_promotion_incident).allowed:
18 rollback_drill.move_alias("production", previous_production)
19
20print(f"drill_production_after_rollback={rollback_drill.resolve('production')}")
21print(f"drill_restored_stable={rollback_drill.resolve('production') == stable_id}")
22print(f"actual_production_unchanged={registry.resolve('production') == stable_id}")1canary_percent_after_abort=0
2production_after_abort=delivery-evidence-answerer@sha256:df2d4fe7b0c5
3pinned_canary_thread_restored_stable=True
4drill_production_after_rollback=delivery-evidence-answerer@sha256:df2d4fe7b0c5
5drill_restored_stable=True
6actual_production_unchanged=TrueReal progressive-delivery controllers encode the same operational mechanics. Argo Rollouts supports weighted canary steps and pauses; its background analysis can abort an unsuccessful rollout. With traffic routing, keeping the stable replica set available allows traffic to move back immediately on abort, at the cost of additional capacity during rollout.[8]
A pipeline that moves aliases but doesn't persist why it moved them still creates mystery during an incident. Store the release IDs, evidence references, gate verdicts, rollout windows, and final alias state as append-only events.
The candidate below is correctly rejected for production after its 10% canary window serves unsupported delivery claims. The release isn't deleted. It remains available for diagnosis, while traffic stays with the known-good bundle.
1@dataclass(frozen=True)
2class ReleaseEvent:
3 stage: str
4 release_id: str
5 decision: str
6 evidence: str
7
8events = (
9 ReleaseEvent("register", candidate_id, "RECORDED", "manifest_sha"),
10 ReleaseEvent("offline_gate", candidate_id, "PASSED", "offline:suite-7"),
11 ReleaseEvent("canary_1_percent", candidate_id, "PASSED", "live:window-001"),
12 ReleaseEvent("canary_10_percent", candidate_id, "ABORTED", "live:window-010"),
13 ReleaseEvent("production", stable_id, "UNCHANGED", "rollback:not-needed"),
14)
15
16for event in events:
17 print(f"{event.stage}:{event.decision}:{event.evidence}")
18print("release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION")
19print(f"active_production={registry.resolve('production')}")1register:RECORDED:manifest_sha
2offline_gate:PASSED:offline:suite-7
3canary_1_percent:PASSED:live:window-001
4canary_10_percent:ABORTED:live:window-010
5production:UNCHANGED:rollback:not-needed
6release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION
7active_production=delivery-evidence-answerer@sha256:df2d4fe7b0c5The lab deliberately uses plain Python so the state transitions are visible. A deployed stack usually splits the same responsibilities:
| Responsibility | Production form |
|---|---|
| Store immutable model or component version | Artifact store plus model registry |
| Store prompt, policy, tokenizer, and schema pins | Release manifest in source control or deployment registry |
| Move candidate/production pointers | Registry aliases, deployment config, or feature flags limited to registered releases |
| Run offline evidence gates | CI job tied to exact manifest digest |
| Shift live traffic and pause on regressions | Progressive-delivery controller and metric analysis |
| Reconstruct impact | Request trace logs with resolved release ID and rollout event log |
Feature flags remain useful, but their values must resolve to registered immutable release IDs. A flag that points at an arbitrary model name makes rapid changes easy and incident reconstruction impossible.
production or canary.corpus_version to ReleaseBundle. Show that a new policy-document snapshot creates a different release ID even if weights and prompts remain fixed.gate_training_run. Explain why this prevents a registered model from escaping experiment lineage.rollback_ready check that refuses canary exposure unless the stable serving target is healthy and warm.production and canary aliases.Symptom: A rollback restores model files, but answers still differ from the last known good release. Cause: Prompt, policy, tokenizer, schema, or serving image continued to float. Fix: Content-address the complete release bundle and log its resolved ID per request.
Symptom: Candidate appears better, but its score came from a newer or easier evaluation suite.
Cause: Promotion compared different evidence contracts.
Fix: Pin eval_suite in the bundle and reject evidence produced for another suite.
Symptom: A candidate that users never saw still duplicates a workflow action or touches customer state. Cause: Shadow execution reused the live tool path instead of a read-only evaluation path. Fix: Disable side effects in the shadow envelope, sanitize payloads, and monitor dropped comparisons.
Symptom: One customer thread switches between answers during canary rollout. Cause: Request-level randomness, process-local hashing, or a widening threshold reassigns an existing thread. Fix: Use deterministic bucketing for new conversations, then persist the first resolved release ID for each conversation lifetime.
Symptom: The pointer moves back immediately, but recovery still produces timeouts. Cause: The stable deployment was scaled down or unloaded too early. Fix: Keep stable capacity ready until the candidate passes burn-in, then test rollback in a drill.
Hidden Technical Debt in Machine Learning Systems.
Sculley et al. · 2015
Challenges in Deploying Machine Learning: a Survey of Case Studies.
Paleyes, A., Urma, R. G., & Lawrence, N. D. · 2022 · ACM Computing Surveys
Model Registry Workflows | MLflow AI Platform
MLflow · 2026
Continuous Delivery for Machine Learning.
Sato, D., Wider, A., & Windheuser, C. · 2019
Models | OpenAI API
OpenAI · 2026
How Is ChatGPT's Behavior Changing over Time?
Chen, L., Zaharia, M., & Zou, J. · 2023
Deprecations | OpenAI API
OpenAI · 2026
Argo Rollouts - Kubernetes Progressive Delivery Controller
Argo Project · 2026