LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringModel Versioning & Deployment
⚙️MediumMLOps & Deployment

Model Versioning & Deployment

Turn an evaluated LLM change into an immutable release bundle, promote it through measured traffic, and roll back without losing lineage.

16 min read
Learning path
Step 71 of 155 in the full curriculum
Mixed Precision TrainingSemantic Caching & Cost Optimization

Model Versioning & Deployment

The previous lesson ended with a precision candidate blocked until a target-GPU profile could prove its speed and stability. Suppose that missing profile now passes. You still don't have something safe to deploy. A live answer can change because of weights, a support gate, a prompt, a tokenizer, a policy, a container image, or a schema.

For the delivery-support system you've been building, deployment means answering one exact question: which complete, evaluated release produced this response, and how quickly can traffic return to the last known good release?

Release bundle history showing pinned gate, prompt, and runtime snapshots beside a production alias that moves for promotion or rollback. Release bundle history showing pinned gate, prompt, and runtime snapshots beside a production alias that moves for promotion or rollback.
A deployable release is a fixed bundle of behavior-producing components. Production changes by moving a traffic pointer to a different bundle, not by editing a deployed bundle in place.

A release is more than model weights

The service now contains an answer model plus the carrier-evidence-classifier-v2 gate trained in the previous lesson. That gate blocks answers whose delivery claim isn't supported by a carrier scan. If you roll back only its weights but leave a newer prompt or policy active, you haven't restored previous behavior.

This is the release-bundle idea: store every input that can change visible output or operational safety in one immutable manifest. ML systems accumulate hidden dependencies between data, code, configuration, and serving infrastructure; deployment records need to make those dependencies reviewable.[1][2]

Bundle fieldWhy it belongs in the release
Answer-model and evidence-gate identifiersDetermine model behavior
Training run and precision policyExplain how the new gate was produced
Tokenizer and prompt versionChange the text the model sees
Policy versionDecides when evidence is sufficient to serve
Serving image and schemaChange runtime behavior and API compatibility
Evaluation-suite hashIdentifies the test set that justified promotion

Start with two full bundles. stable represents what's serving today. candidate changes the evidence gate after the profiled BF16 training run has passed outside the previous lesson's blocked fixture.

define-release-bundles.py
1from dataclasses import asdict, dataclass, replace 2import hashlib 3import json 4 5@dataclass(frozen=True) 6class ReleaseBundle: 7 service: str 8 answer_model: str 9 evidence_gate: str 10 gate_training_run: str 11 precision_policy: str 12 tokenizer: str 13 prompt_version: str 14 policy_version: str 15 serving_image: str 16 input_schema: str 17 eval_suite: str 18 19stable = ReleaseBundle( 20 service="delivery-evidence-answerer", 21 answer_model="answer-model@sha256:3b07", 22 evidence_gate="carrier-evidence-classifier-v1@sha256:0f12", 23 gate_training_run="run_fp32_baseline", 24 precision_policy="fp32", 25 tokenizer="delivery-tokenizer@sha256:91aa", 26 prompt_version="[email protected]", 27 policy_version="[email protected]", 28 serving_image="registry.example/answerer@sha256:image-a", 29 input_schema="delivery-answer.v2", 30 eval_suite="delivery-grounding-suite@sha256:suite-7", 31) 32 33candidate = replace( 34 stable, 35 evidence_gate="carrier-evidence-classifier-v2@sha256:41d8", 36 gate_training_run="run_bf16_profiled_target_gpu", 37 precision_policy="bf16", 38 prompt_version="[email protected]", 39 serving_image="registry.example/answerer@sha256:image-b", 40) 41 42print(f"stable_gate={stable.evidence_gate}") 43print(f"candidate_gate={candidate.evidence_gate}") 44print(f"candidate_training_run={candidate.gate_training_run}") 45print(f"candidate_precision={candidate.precision_policy}")
Output
1stable_gate=carrier-evidence-classifier-v1@sha256:0f12 2candidate_gate=carrier-evidence-classifier-v2@sha256:41d8 3candidate_training_run=run_bf16_profiled_target_gpu 4candidate_precision=bf16

A version name such as v2 is useful for people, but it doesn't prove contents stayed fixed. The next cell derives identity from the canonical manifest. Change a prompt or image digest and the release ID changes too.

content-address-the-release.py
1def release_id(bundle: ReleaseBundle) -> str: 2 payload = json.dumps(asdict(bundle), sort_keys=True, separators=(",", ":")) 3 # Short prefix keeps the teaching output readable. Retain the full digest in production. 4 digest = hashlib.sha256(payload.encode("utf-8")).hexdigest()[:12] 5 return f"{bundle.service}@sha256:{digest}" 6 7stable_id = release_id(stable) 8candidate_id = release_id(candidate) 9prompt_patch_id = release_id(replace(candidate, prompt_version="[email protected]")) 10 11print(f"stable={stable_id}") 12print(f"candidate={candidate_id}") 13print(f"prompt_patch={prompt_patch_id}") 14print(f"prompt_patch_is_new_release={prompt_patch_id != candidate_id}")
Output
1stable=delivery-evidence-answerer@sha256:df2d4fe7b0c5 2candidate=delivery-evidence-answerer@sha256:d9c7ecff4a9c 3prompt_patch=delivery-evidence-answerer@sha256:9a1e41a39815 4prompt_patch_is_new_release=True

This lab abbreviates each SHA-256 digest to 12 hexadecimal characters so the state transitions stay readable. A production registry should retain the full digest as identity and use short prefixes only for display.

Artifacts stay fixed; aliases move

A registry record stores an immutable bundle. An alias such as production or canary is a mutable pointer used by traffic. This separation is what makes rollback simple: preserve both bundles and move the pointer back.

MLflow's current Model Registry workflow provides version aliases and tags, and its documentation marks fixed Model Stages as deprecated. That distinction matters: a model version is history; an alias expresses the current deployment decision.[3]

Immutable release bundles on left, deployment aliases on right, and promotion or rollback moving aliases without rewriting stored manifests. Immutable release bundles on left, deployment aliases on right, and promotion or rollback moving aliases without rewriting stored manifests.
Both stable and candidate releases remain available for audit. Promotion and rollback update named pointers while the stored manifests remain unchanged.

The small registry below enforces that rule. register() keeps a deep copy of the bundle, and move_alias() only points at a registered ID.

registry-and-aliases.py
1from copy import deepcopy 2 3class ReleaseRegistry: 4 def __init__(self) -> None: 5 self._bundles: dict[str, ReleaseBundle] = {} 6 self._aliases: dict[str, str] = {} 7 8 def register(self, bundle: ReleaseBundle) -> str: 9 bundle_id = release_id(bundle) 10 existing = self._bundles.get(bundle_id) 11 if existing is not None and existing != bundle: 12 raise ValueError("release digest collision") 13 self._bundles[bundle_id] = deepcopy(bundle) 14 return bundle_id 15 16 def move_alias(self, alias: str, bundle_id: str) -> None: 17 if bundle_id not in self._bundles: 18 raise KeyError(f"unregistered release: {bundle_id}") 19 self._aliases[alias] = bundle_id 20 21 def resolve(self, alias: str) -> str: 22 return self._aliases[alias] 23 24registry = ReleaseRegistry() 25assert registry.register(stable) == stable_id 26assert registry.register(candidate) == candidate_id 27registry.move_alias("production", stable_id) 28 29print(f"registered={len(registry._bundles)}") 30print(f"production={registry.resolve('production')}") 31print(f"candidate_registered={candidate_id in registry._bundles}")
Output
1registered=2 2production=delivery-evidence-answerer@sha256:df2d4fe7b0c5 3candidate_registered=True

Promotion begins with controlled evidence

Registration isn't approval. Continuous delivery for machine learning adds evaluation and monitoring to ordinary build-and-deploy practices because a valid artifact can still produce unacceptable behavior.[4]

In this running system, don't introduce a generic "quality score" after spending several lessons defining a grounded support contract. The release gate should use the same measurements the incident, evaluation, and experiment lessons established:

  • supported_evidence_f1 measures whether supported delivery claims are served correctly.
  • unsupported_serve_count must remain zero in the frozen high-risk suite.
  • p95_latency_ms prevents a behaviorally acceptable gate from breaking the response budget.
  • Schema and evaluation-suite hashes prevent incomparable evidence from entering the decision.
Release ladder from immutable delivery bundle through offline evidence and limited live exposure to promotion or rollback. Release ladder from immutable delivery bundle through offline evidence and limited live exposure to promotion or rollback.
Build the immutable candidate once, then gather progressively more realistic evidence. A release doesn't earn live traffic until its controlled gate passes.
offline-promotion-gate.py
1@dataclass(frozen=True) 2class OfflineEvidence: 3 release_id: str 4 eval_suite: str 5 input_schema: str 6 supported_evidence_f1: float 7 unsupported_serve_count: int 8 p95_latency_ms: int 9 10@dataclass(frozen=True) 11class Decision: 12 allowed: bool 13 reason: str 14 15def offline_gate(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision: 16 if evidence.release_id != release_id(bundle): 17 return Decision(False, "evidence belongs to another release") 18 if evidence.eval_suite != bundle.eval_suite: 19 return Decision(False, "evaluation suite changed") 20 if evidence.input_schema != bundle.input_schema: 21 return Decision(False, "schema mismatch") 22 if evidence.supported_evidence_f1 < 0.92: 23 return Decision(False, "supported_evidence_f1 below 0.92") 24 if evidence.unsupported_serve_count != 0: 25 return Decision(False, "unsupported answer was served") 26 if evidence.p95_latency_ms > 500: 27 return Decision(False, "p95 latency exceeds 500 ms") 28 return Decision(True, "offline gate passed") 29 30candidate_offline = OfflineEvidence( 31 release_id=candidate_id, 32 eval_suite=candidate.eval_suite, 33 input_schema=candidate.input_schema, 34 supported_evidence_f1=0.93, 35 unsupported_serve_count=0, 36 p95_latency_ms=472, 37) 38weaker_candidate = replace(candidate_offline, supported_evidence_f1=0.89) 39 40print(f"candidate={offline_gate(candidate, candidate_offline)}") 41print(f"weak_metric={offline_gate(candidate, weaker_candidate)}")
Output
1candidate=Decision(allowed=True, reason='offline gate passed') 2weak_metric=Decision(allowed=False, reason='supported_evidence_f1 below 0.92')

Passing the offline gate permits further evaluation; it doesn't immediately replace production. Open a canary alias while production still points to the known-good bundle.

open-canary-only-after-gate.py
1def open_canary(bundle: ReleaseBundle, evidence: OfflineEvidence) -> Decision: 2 decision = offline_gate(bundle, evidence) 3 if decision.allowed: 4 registry.move_alias("canary", release_id(bundle)) 5 return decision 6 7canary_decision = open_canary(candidate, candidate_offline) 8 9print(f"canary_opened={canary_decision.allowed}") 10print(f"canary={registry.resolve('canary')}") 11print(f"production_still_stable={registry.resolve('production') == stable_id}")
Output
1canary_opened=True 2canary=delivery-evidence-answerer@sha256:d9c7ecff4a9c 3production_still_stable=True

Managed models need a documented pin

When your team owns weights, a digest can identify them directly. If a service calls a hosted model, the bundle must instead record the strongest fixed identifier that provider documents. For example, OpenAI model pages that expose snapshots describe snapshots as locking a particular model version for consistent behavior.[5]

This matters because a provider can change the model behind a name you thought was stable, shifting your outputs with no deploy on your side. A controlled study comparing the March and June 2023 releases of GPT-4 and GPT-3.5 reported large behavior swings on the same tasks between snapshots, so the "same" service was not always the same service.[6] Pinning a documented snapshot reduces that risk but doesn't remove it: snapshots get deprecated, and a golden-set monitor against production still earns its keep.[7]

Don't generalize that guarantee to every provider or every alias. Verify the exact provider documentation, store the chosen model identifier in the bundle, monitor deprecation notices, and rerun release gates before changing it.

Shadow evaluation must be read-only

Offline fixtures can't cover every real request shape. A shadow sends a sanitized copy of a production request to the candidate while the stable release alone supplies the user-visible answer. It can reveal latency or evidence-support problems without exposing the candidate's text to customers.

The critical safety rule is easy to miss: a shadow isn't permitted to execute tools, send messages, change orders, or write customer state. Its output is evaluation data only. The envelope below also redacts an order identifier before queuing the shadow request.

shadow-envelope.py
1import re 2 3@dataclass(frozen=True) 4class ShadowEnvelope: 5 candidate_release: str 6 sanitized_text: str 7 side_effects_enabled: bool 8 response_visible_to_user: bool 9 10def make_shadow(text: str) -> ShadowEnvelope: 11 sanitized = re.sub(r"ORD-\d+", "[ORDER_ID]", text) 12 return ShadowEnvelope( 13 candidate_release=registry.resolve("canary"), 14 sanitized_text=sanitized, 15 side_effects_enabled=False, 16 response_visible_to_user=False, 17 ) 18 19shadow = make_shadow("Will ORD-48192 arrive Friday based on scan S3?") 20 21print(f"shadow_text={shadow.sanitized_text}") 22print(f"candidate={shadow.candidate_release}") 23print(f"side_effects_enabled={shadow.side_effects_enabled}") 24print(f"response_visible={shadow.response_visible_to_user}")
Output
1shadow_text=Will [ORDER_ID] arrive Friday based on scan S3? 2candidate=delivery-evidence-answerer@sha256:d9c7ecff4a9c 3side_effects_enabled=False 4response_visible=False

In a deployed service, enqueue this envelope to a bounded worker queue and count dropped or failed comparisons. Don't start an untracked background task in request scope and assume its evaluation record will survive process restarts.

The boolean in this teaching envelope is metadata, not an authorization boundary. Its worker still needs a read-only tool allowlist and credentials that can't write production state. Reject a shadow request if it asks for a side effect.

Canary traffic is visible and sticky

Shadow results can justify limited exposure, not automatic promotion. A canary sends a small share of real conversations to the candidate and returns those candidate responses to users. For conversational systems, one thread must remain on one bundle throughout the rollout. Otherwise a user can receive conflicting answers from stable and candidate releases in adjacent turns.

Use deterministic hashing of a conversation ID to choose new conversations reproducibly across workers. Python's built-in hash() is intentionally process-dependent, so it's the wrong bucketing function for that job. Hashing alone isn't enough, though: when canary traffic widens from 1% to 10%, the higher threshold could move an existing conversation from stable to candidate. Persist the first resolved release ID for the conversation lifetime.

sticky-canary-routing.py
1def bucket(conversation_id: str) -> int: 2 digest = hashlib.sha256(conversation_id.encode("utf-8")).hexdigest() 3 return int(digest[:8], 16) % 100 4 5conversation_assignments: dict[str, str] = {} 6aborted_releases: set[str] = set() 7 8def assigned_release(conversation_id: str, canary_percent: int) -> str: 9 existing = conversation_assignments.get(conversation_id) 10 if existing is not None and existing not in aborted_releases: 11 return existing 12 alias = "canary" if bucket(conversation_id) < canary_percent else "production" 13 bundle_id = registry.resolve(alias) 14 if bundle_id in aborted_releases: 15 bundle_id = registry.resolve("production") 16 conversation_assignments[conversation_id] = bundle_id 17 return bundle_id 18 19canary_thread = next( 20 f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") < 10 21) 22assignments = [assigned_release(canary_thread, canary_percent=10) for _ in range(3)] 23stable_thread = next( 24 f"thread-{index}" for index in range(1000) if bucket(f"thread-{index}") >= 10 25) 26stable_before_widen = assigned_release(stable_thread, canary_percent=10) 27stable_after_widen = assigned_release(stable_thread, canary_percent=100) 28 29print(f"canary_thread={canary_thread}") 30print(f"bucket={bucket(canary_thread)}") 31print(f"same_release_each_turn={len(set(assignments)) == 1}") 32print(f"assigned_to_candidate={assignments[0] == candidate_id}") 33print(f"existing_stable_thread_pinned_after_widen={stable_before_widen == stable_after_widen == stable_id}")
Output
1canary_thread=thread-6 2bucket=6 3same_release_each_turn=True 4assigned_to_candidate=True 5existing_stable_thread_pinned_after_widen=True

The dictionary is a teaching fixture. A real router persists assignments in conversation state or a rollout store, expires them when each conversation ends, and records the resolved release ID in traces. Stickiness is normal-routing behavior, not permission to keep serving a failed candidate: an abort must override it.

Offline evidence gate feeding read-only shadow, sticky canary, production promotion, and rollback when support metrics regress. Offline evidence gate feeding read-only shadow, sticky canary, production promotion, and rollback when support metrics regress.
Canary traffic is real exposure. Widen traffic only while monitored windows pass, and keep the stable release available for an immediate abort or rollback.

A canary window answers a new question

Offline evidence proves behavior on frozen examples. A live window tests traffic mix, serving latency, error rate, and evidence failures under actual request volume. Define those thresholds before sending any candidate traffic.

live-canary-window.py
1@dataclass(frozen=True) 2class LiveWindow: 3 name: str 4 p95_latency_ms: int 5 error_rate: float 6 unsupported_serve_rate: float 7 shadow_drop_rate: float 8 9def live_gate(window: LiveWindow) -> Decision: 10 if window.p95_latency_ms > 550: 11 return Decision(False, "live latency exceeded 550 ms") 12 if window.error_rate > 0.01: 13 return Decision(False, "error rate exceeded 1%") 14 if window.unsupported_serve_rate > 0.005: 15 return Decision(False, "unsupported serve rate exceeded 0.5%") 16 if window.shadow_drop_rate > 0.02: 17 return Decision(False, "shadow telemetry incomplete") 18 return Decision(True, "live window passed") 19 20window_1_percent = LiveWindow("1%", 481, 0.002, 0.000, 0.001) 21window_10_percent = LiveWindow("10%", 493, 0.003, 0.014, 0.001) 22 23print(f"one_percent={live_gate(window_1_percent)}") 24print(f"ten_percent={live_gate(window_10_percent)}")
Output
1one_percent=Decision(allowed=True, reason='live window passed') 2ten_percent=Decision(allowed=False, reason='unsupported serve rate exceeded 0.5%')

Abort a canary; roll back a promotion

These actions sound similar but refer to different alias states:

  • If the candidate fails while only the canary alias receives traffic, abort the rollout. production never moved.
  • If a candidate was already promoted and later fails, roll back by repointing production to the retained stable release.

Keeping the distinction explicit prevents an incident report from claiming production was rolled back when the candidate was never production.

abort-and-rollback.py
1canary_percent = 10 2failed_window = live_gate(window_10_percent) 3if not failed_window.allowed: 4 aborted_releases.add(candidate_id) 5 canary_percent = 0 6 7print(f"canary_percent_after_abort={canary_percent}") 8print(f"production_after_abort={registry.resolve('production')}") 9print(f"pinned_canary_thread_restored_stable={assigned_release(canary_thread, canary_percent) == stable_id}") 10 11# Separate rollback drill: a promoted candidate later regresses at wider traffic. 12rollback_drill = deepcopy(registry) 13previous_production = rollback_drill.resolve("production") 14rollback_drill.move_alias("production", candidate_id) 15post_promotion_incident = replace(window_10_percent, name="100%") 16 17if not live_gate(post_promotion_incident).allowed: 18 rollback_drill.move_alias("production", previous_production) 19 20print(f"drill_production_after_rollback={rollback_drill.resolve('production')}") 21print(f"drill_restored_stable={rollback_drill.resolve('production') == stable_id}") 22print(f"actual_production_unchanged={registry.resolve('production') == stable_id}")
Output
1canary_percent_after_abort=0 2production_after_abort=delivery-evidence-answerer@sha256:df2d4fe7b0c5 3pinned_canary_thread_restored_stable=True 4drill_production_after_rollback=delivery-evidence-answerer@sha256:df2d4fe7b0c5 5drill_restored_stable=True 6actual_production_unchanged=True

Real progressive-delivery controllers encode the same operational mechanics. Argo Rollouts supports weighted canary steps and pauses; its background analysis can abort an unsuccessful rollout. With traffic routing, keeping the stable replica set available allows traffic to move back immediately on abort, at the cost of additional capacity during rollout.[8]

Record the decision a later engineer can reconstruct

A pipeline that moves aliases but doesn't persist why it moved them still creates mystery during an incident. Store the release IDs, evidence references, gate verdicts, rollout windows, and final alias state as append-only events.

The candidate below is correctly rejected for production after its 10% canary window serves unsupported delivery claims. The release isn't deleted. It remains available for diagnosis, while traffic stays with the known-good bundle.

release-decision-record.py
1@dataclass(frozen=True) 2class ReleaseEvent: 3 stage: str 4 release_id: str 5 decision: str 6 evidence: str 7 8events = ( 9 ReleaseEvent("register", candidate_id, "RECORDED", "manifest_sha"), 10 ReleaseEvent("offline_gate", candidate_id, "PASSED", "offline:suite-7"), 11 ReleaseEvent("canary_1_percent", candidate_id, "PASSED", "live:window-001"), 12 ReleaseEvent("canary_10_percent", candidate_id, "ABORTED", "live:window-010"), 13 ReleaseEvent("production", stable_id, "UNCHANGED", "rollback:not-needed"), 14) 15 16for event in events: 17 print(f"{event.stage}:{event.decision}:{event.evidence}") 18print("release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION") 19print(f"active_production={registry.resolve('production')}")
Output
1register:RECORDED:manifest_sha 2offline_gate:PASSED:offline:suite-7 3canary_1_percent:PASSED:live:window-001 4canary_10_percent:ABORTED:live:window-010 5production:UNCHANGED:rollback:not-needed 6release_decision=REJECT_CANDIDATE_AFTER_CANARY_REGRESSION 7active_production=delivery-evidence-answerer@sha256:df2d4fe7b0c5

Where this maps to production systems

The lab deliberately uses plain Python so the state transitions are visible. A deployed stack usually splits the same responsibilities:

ResponsibilityProduction form
Store immutable model or component versionArtifact store plus model registry
Store prompt, policy, tokenizer, and schema pinsRelease manifest in source control or deployment registry
Move candidate/production pointersRegistry aliases, deployment config, or feature flags limited to registered releases
Run offline evidence gatesCI job tied to exact manifest digest
Shift live traffic and pause on regressionsProgressive-delivery controller and metric analysis
Reconstruct impactRequest trace logs with resolved release ID and rollout event log

Feature flags remain useful, but their values must resolve to registered immutable release IDs. A flag that points at an arbitrary model name makes rapid changes easy and incident reconstruction impossible.

Mastery check

Key concepts

  • Release bundle: all pinned components that determine served behavior and operational compatibility.
  • Release ID: a content-derived identity for one immutable bundle.
  • Alias: a movable traffic pointer such as production or canary.
  • Offline gate: comparable controlled evidence required before any user exposure.
  • Shadow: read-only candidate evaluation on sanitized production-shaped traffic.
  • Canary: limited user-visible candidate traffic with predefined live thresholds.
  • Abort versus rollback: stop a not-yet-promoted candidate versus restore production after promotion.

Practice tasks

  1. Add corpus_version to ReleaseBundle. Show that a new policy-document snapshot creates a different release ID even if weights and prompts remain fixed.
  2. Add a gate that blocks candidates with a missing gate_training_run. Explain why this prevents a registered model from escaping experiment lineage.
  3. Add a rollback_ready check that refuses canary exposure unless the stable serving target is healthy and warm.
  4. Write a release trace for a candidate that passes at 1% and 10%, is promoted, then rolls back after a latency incident at 100%.

Evaluation rubric

  • Foundational: Explains why weights alone don't identify served behavior and names the prompt, policy, tokenizer, runtime, schema, and evidence suite as bundle dependencies.
  • Foundational: Distinguishes an immutable release ID from mutable production and canary aliases.
  • Intermediate: Implements a controlled offline gate that compares the candidate against the declared support and latency contract.
  • Intermediate: Explains why shadow execution is read-only and why canary routing must be sticky by conversation.
  • Advanced: Distinguishes a canary abort from a production rollback using alias state and retained evidence.
  • Advanced: Produces an append-only decision record from which an incident reviewer can reconstruct what was exposed and why traffic stayed or moved.

Self-check questions

Common pitfalls

Only weights are versioned

Symptom: A rollback restores model files, but answers still differ from the last known good release. Cause: Prompt, policy, tokenizer, schema, or serving image continued to float. Fix: Content-address the complete release bundle and log its resolved ID per request.

Offline metrics change between candidates

Symptom: Candidate appears better, but its score came from a newer or easier evaluation suite. Cause: Promotion compared different evidence contracts. Fix: Pin eval_suite in the bundle and reject evidence produced for another suite.

Shadow evaluation performs actions

Symptom: A candidate that users never saw still duplicates a workflow action or touches customer state. Cause: Shadow execution reused the live tool path instead of a read-only evaluation path. Fix: Disable side effects in the shadow envelope, sanitize payloads, and monitor dropped comparisons.

Conversation routing flickers

Symptom: One customer thread switches between answers during canary rollout. Cause: Request-level randomness, process-local hashing, or a widening threshold reassigns an existing thread. Fix: Use deterministic bucketing for new conversations, then persist the first resolved release ID for each conversation lifetime.

Rollback target is cold

Symptom: The pointer moves back immediately, but recovery still produces timeouts. Cause: The stable deployment was scaled down or unloaded too early. Fix: Keep stable capacity ready until the candidate passes burn-in, then test rollback in a drill.

Next Step
Continue to Semantic Caching & Cost Optimization

You now know that every response must resolve to an exact release bundle. Next you will reuse prior responses safely by making cache scope and invalidation depend on the model, prompt, policy, corpus, and tenant context that produced them.

PreviousMixed Precision Training
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Hidden Technical Debt in Machine Learning Systems.

Sculley et al. · 2015

Challenges in Deploying Machine Learning: a Survey of Case Studies.

Paleyes, A., Urma, R. G., & Lawrence, N. D. · 2022 · ACM Computing Surveys

Model Registry Workflows | MLflow AI Platform

MLflow · 2026

Continuous Delivery for Machine Learning.

Sato, D., Wider, A., & Windheuser, C. · 2019

Models | OpenAI API

OpenAI · 2026

How Is ChatGPT's Behavior Changing over Time?

Chen, L., Zaharia, M., & Zou, J. · 2023

Deprecations | OpenAI API

OpenAI · 2026

Argo Rollouts - Kubernetes Progressive Delivery Controller

Argo Project · 2026