LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringExperiment Tracking with MLflow and W&B
⚙️MediumMLOps & Deployment

Experiment Tracking with MLflow and W&B

Turn a live LLM regression into a reproducible candidate decision by logging inputs, metrics, artifacts, and promotion evidence.

16 min read
Learning path
Step 72 of 158 in the full curriculum
LLM Observability & MonitoringMixed Precision Training

A live failure anchors this lesson: ci-answerer-v1.1-regression served two unsupported CI root-cause claims. Monitoring created a proposed fix, ci-answerer-v1.2-fix, but correctly left its status as BLOCKED_PENDING_EVALUATION.

Now you need to answer a different question: did the proposed fix earn a controlled next step? Experiment tracking records the hypothesis, frozen test inputs, changed artifact, measured outputs, and decision so that someone else can reproduce that answer.

For a large language model (LLM) application, an experiment isn't limited to training new weights. A run can evaluate a changed prompt template, evidence policy, retriever, tool schema, model, or fine-tuned checkpoint. This lesson uses a prompt-and-policy fix because it's the immediate continuation of the production incident.

Experiment tracking board where one shared evaluation contract fans into three runs: run 001 is blocked for unsafe serves, run 002 is blocked for unnecessary abstention, and run 003 is the only candidate eligible for canary review. Experiment tracking board where one shared evaluation contract fans into three runs: run 001 is blocked for unsafe serves, run 002 is blocked for unnecessary abstention, and run 003 is the only candidate eligible for canary review.
Only shared-contract runs are comparable. Run 002 fixes safety but still fails usefulness, so only run 003 reaches canary review.

From alert to experiment

The monitoring page established what happened in production. It didn't test any fix. Keep that boundary clear:

SystemQuestion it answersRecord here
ObservabilityWhat went wrong on live requests?req_205 served an unsupported root-cause claim
Experiment trackingDoes a controlled candidate fix the failure without new regressions?three offline runs over frozen cases
Deployment controlMay an approved artifact receive limited traffic?canary alias only after review

The objects in an experiment record are straightforward once they're tied to this failure.

ObjectMeaningCI-answerer example
ExperimentRelated attempts to answer one questionci-claim-gate-repair
RunOne execution against fixed inputsevaluate ci-answerer-v1.2-fix-b
ParameterInput chosen before evaluationtemplate id, policy id, evaluator version
MetricMeasured resultunsafe root-cause claims served, unnecessary abstentions
ArtifactVersioned output or evidence filetemplate bundle, redacted failure report
TagSearchable contextincident id, hypothesis, reviewer status
Registry decisionWhich artifact may proceedcanary_candidate -> v1.2-fix-b

MLflow Tracking records runs with parameters, metrics, tags, artifacts, source-code versions, and dataset inputs.[1] Weights & Biases (W&B) uses runs with configuration, metrics, and artifacts for the same evidence workflow.[2] Start with a local version of that workflow so the logic is visible without credentials or a hosted service.

Freeze the question before running candidates

Changing both the candidate and the test cases at the same time makes a win hard to interpret. Start from the handoff that monitoring produced, then create a small regression suite from known behavior.

The production exemplar contains no raw log here. The experiment needs stable evidence categories and expected routes, not durable copies of log text.

incident-contract.py
1from dataclasses import dataclass, field 2import hashlib 3import json 4 5@dataclass(frozen=True) 6class IncidentHandoff: 7 incident_policy_id: str 8 failing_template_id: str 9 exemplar_request_id: str 10 evidence_version: str 11 candidate_template_id: str 12 required_metric: str 13 14@dataclass(frozen=True) 15class EvalCase: 16 case_id: str 17 slice_name: str 18 evidence_state: str 19 expected_route: str 20 source: str 21 22handoff = IncidentHandoff( 23 incident_policy_id="ci-grounding-slo-v1", 24 failing_template_id="ci-answerer-v1.1-regression", 25 exemplar_request_id="req_205", 26 evidence_version="ci-logs@v17", 27 candidate_template_id="ci-answerer-v1.2-fix", 28 required_metric="unsafe_claim_served == 0 on regression and holdout windows", 29) 30 31regression_cases = ( 32 EvalCase("req_205", "unsupported_root_cause", "no_source", "ABSTAIN", "incident"), 33 EvalCase("reg_002", "confirmed_failure", "test_failure_log", "SERVE", "known_good"), 34) 35 36print(f"failing_template={handoff.failing_template_id}") 37print(f"candidate={handoff.candidate_template_id}") 38print(f"regression_cases={len(regression_cases)}") 39print(f"incident_expected_route={regression_cases[0].expected_route}") 40print(f"required_metric={handoff.required_metric}")
Output
1failing_template=ci-answerer-v1.1-regression 2candidate=ci-answerer-v1.2-fix 3regression_cases=2 4incident_expected_route=ABSTAIN 5required_metric=unsafe_claim_served == 0 on regression and holdout windows

The first case checks the incident: evidence can't support a root-cause claim, so the application must abstain. The second case prevents the blunt "fix" of always refusing to answer when trustworthy evidence is present.

Fingerprint the evaluation inputs

A run id isn't enough to reproduce a decision. If someone quietly changes a label, adds a case, or updates the evidence snapshot, the comparison is a new experiment condition.

A fingerprint is a deterministic checksum of the evaluation inputs. Equal fingerprints mean two runs were measured against the same frozen representation. They don't prove the labels are correct; review still has to establish that.

evaluation-fingerprint.py
1def suite_fingerprint(cases: tuple[EvalCase, ...]) -> str: 2 normalized = [ 3 { 4 "case_id": case.case_id, 5 "evidence_state": case.evidence_state, 6 "expected_route": case.expected_route, 7 "slice_name": case.slice_name, 8 "source": case.source, 9 } 10 for case in sorted(cases, key=lambda item: item.case_id) 11 ] 12 payload = json.dumps(normalized, sort_keys=True, separators=(",", ":")).encode() 13 return hashlib.sha256(payload).hexdigest()[:16] 14 15regression_fingerprint = suite_fingerprint(regression_cases) 16corrupted_cases = ( 17 regression_cases[0], 18 EvalCase("reg_002", "confirmed_failure", "test_failure_log", "ABSTAIN", "known_good"), 19) 20 21print(f"regression_sha={regression_fingerprint}") 22print(f"same_rows_same_sha={suite_fingerprint(tuple(reversed(regression_cases))) == regression_fingerprint}") 23print(f"corrupted_label_detected={suite_fingerprint(corrupted_cases) != regression_fingerprint}")
Output
1regression_sha=2be42790a4f41355 2same_rows_same_sha=True 3corrupted_label_detected=True

The row order isn't meaningful, so the fingerprint normalizes order. The example deliberately corrupts the known-good test-failure-log label; that mistake must create a different checksum. In a real project, also record dataset version, evidence snapshot version, labeling policy, and the code commit that built the cases.

A regression pass is necessary, not sufficient

Suppose the first fix rejects unsupported root-cause claims and serves explicit test failure logs. It passes the two known cases. That doesn't show how it behaves when a dependency resolver provides a version-conflict error, which is safe to quote but has a different evidence shape.

Add representative holdout cases before selecting a candidate. Keep them separate from the original incident suite so reviewers can tell whether a candidate repaired the known failure or generalized beyond it.

holdout-suite.py
1holdout_cases = ( 2 EvalCase("hold_001", "dependency_error", "dependency_error", "SERVE", "holdout"), 3 EvalCase("hold_002", "missing_stacktrace", "no_source", "ABSTAIN", "holdout"), 4) 5all_cases = regression_cases + holdout_cases 6holdout_fingerprint = suite_fingerprint(holdout_cases) 7evaluation_fingerprint = suite_fingerprint(all_cases) 8 9print(f"holdout_cases={len(holdout_cases)}") 10print(f"holdout_sha={holdout_fingerprint}") 11print(f"evaluation_sha={evaluation_fingerprint}") 12print("guardrails=unsafe_claim_served==0, unnecessary_abstention==0")
Output
1holdout_cases=2 2holdout_sha=d47cdbbfe1caae89 3evaluation_sha=94dfc930d2829d2c 4guardrails=unsafe_claim_served==0, unnecessary_abstention==0

Notice the second guardrail. An application that abstains on every answer is factually safe on this metric, but it isn't useful. unsafe_claim_served protects trust; unnecessary_abstention protects the product from an overly restrictive repair.

Evaluate the behavior, not the proposed story

The next cell models three template versions. These functions stand in for deterministic offline evaluation results so you can focus on run comparison:

  • v1.1-regressed serves every claimed root cause, including unsupported ones.
  • v1.2-fix repairs the incident but recognizes only explicit test failure logs.
  • v1.2-fix-b handles both test failure logs and dependency-error evidence.

The candidate implementation doesn't get to read expected_route. It makes a rule from evidence_state, and the evaluator compares that output with the frozen expectation.

evaluate-templates.py
1@dataclass(frozen=True) 2class Evaluation: 3 template_id: str 4 case_count: int 5 unsafe_claim_served: int 6 unnecessary_abstention: int 7 exact_route_rate: float 8 9def predict_route(template_id: str, case: EvalCase) -> str: 10 if template_id == "ci-answerer-v1.1-regression": 11 return "SERVE" 12 if template_id == "ci-answerer-v1.2-fix": 13 return "SERVE" if case.evidence_state == "test_failure_log" else "ABSTAIN" 14 if template_id == "ci-answerer-v1.2-fix-b": 15 supported = {"test_failure_log", "dependency_error"} 16 return "SERVE" if case.evidence_state in supported else "ABSTAIN" 17 raise ValueError(f"unknown template: {template_id}") 18 19def evaluate(template_id: str, cases: tuple[EvalCase, ...]) -> Evaluation: 20 predicted = [predict_route(template_id, case) for case in cases] 21 unsafe = sum( 22 route == "SERVE" and case.expected_route == "ABSTAIN" 23 for route, case in zip(predicted, cases) 24 ) 25 unnecessary = sum( 26 route == "ABSTAIN" and case.expected_route == "SERVE" 27 for route, case in zip(predicted, cases) 28 ) 29 exact = sum(route == case.expected_route for route, case in zip(predicted, cases)) 30 return Evaluation(template_id, len(cases), unsafe, unnecessary, exact / len(cases)) 31 32known_fix = evaluate("ci-answerer-v1.2-fix", regression_cases) 33print(f"candidate={known_fix.template_id}") 34print(f"regression_unsafe={known_fix.unsafe_claim_served}") 35print(f"regression_unnecessary_abstention={known_fix.unnecessary_abstention}") 36print(f"regression_pass={known_fix.exact_route_rate == 1.0}") 37print("decision=CONTINUE_TO_HOLDOUT")
Output
1candidate=ci-answerer-v1.2-fix 2regression_unsafe=0 3regression_unnecessary_abstention=0 4regression_pass=True 5decision=CONTINUE_TO_HOLDOUT

A clean regression result is progress, not approval. It means the candidate deserves evaluation on the holdout contract that was already declared.

compare-candidate-runs.py
1templates = ( 2 "ci-answerer-v1.1-regression", 3 "ci-answerer-v1.2-fix", 4 "ci-answerer-v1.2-fix-b", 5) 6evaluations = [evaluate(template_id, all_cases) for template_id in templates] 7 8print("template unsafe unnecessary exact") 9for result in evaluations: 10 print( 11 f"{result.template_id:<33} " 12 f"{result.unsafe_claim_served:>6} " 13 f"{result.unnecessary_abstention:>12} " 14 f"{result.exact_route_rate:>6.0%}" 15 ) 16 17first_fix = evaluations[1] 18revised_fix = evaluations[2] 19print(f"first_fix_blocked={first_fix.unnecessary_abstention > 0}") 20print(f"revised_fix_passes_guardrails={revised_fix.unsafe_claim_served == 0 and revised_fix.unnecessary_abstention == 0}")
Output
1template unsafe unnecessary exact 2ci-answerer-v1.1-regression 2 0 50% 3ci-answerer-v1.2-fix 0 1 75% 4ci-answerer-v1.2-fix-b 0 0 100% 5first_fix_blocked=True 6revised_fix_passes_guardrails=True

Tracking isn't a scoreboard for flattering metrics. The first fix satisfies the incident's safety condition but introduces a holdout experience regression. Its run should remain visible and rejected, not silently replaced by the revised fix.

Store one evidence record per run

Now put the experiment structure around those results. A tracking service should let a reviewer answer:

  1. What changed?
  2. What exactly was evaluated?
  3. Which metrics and guardrails moved?
  4. Which evidence artifacts explain the decision?
  5. Which incident or hypothesis motivated the run?

The local tracker below uses the same conceptual buckets as MLflow and W&B. It deliberately logs stable identifiers and a redacted exemplar report, not raw logs.

local-run-records.py
1@dataclass 2class RunRecord: 3 run_id: str 4 params: dict[str, str] = field(default_factory=dict) 5 metrics: dict[str, float] = field(default_factory=dict) 6 artifacts: list[str] = field(default_factory=list) 7 tags: dict[str, str] = field(default_factory=dict) 8 9class LocalTracker: 10 def __init__(self) -> None: 11 self.runs: list[RunRecord] = [] 12 13 def start_run(self, run_id: str) -> RunRecord: 14 run = RunRecord(run_id) 15 self.runs.append(run) 16 return run 17 18 def log_params(self, run: RunRecord, params: dict[str, str]) -> None: 19 run.params.update(params) 20 21 def log_metrics(self, run: RunRecord, metrics: dict[str, float]) -> None: 22 run.metrics.update(metrics) 23 24 def log_artifact(self, run: RunRecord, path: str) -> None: 25 run.artifacts.append(path) 26 27 def set_tags(self, run: RunRecord, tags: dict[str, str]) -> None: 28 run.tags.update(tags) 29 30evaluator_version = "claim-route-eval-v2" 31tracker = LocalTracker() 32for index, result in enumerate(evaluations, start=1): 33 run = tracker.start_run(f"run_{index:03d}") 34 tracker.log_params(run, { 35 "template_id": result.template_id, 36 "policy_id": handoff.incident_policy_id, 37 "evaluator_version": evaluator_version, 38 "regression_sha": regression_fingerprint, 39 "holdout_sha": holdout_fingerprint, 40 "evaluation_sha": evaluation_fingerprint, 41 "evidence_version": handoff.evidence_version, 42 "code_commit": "fixture-commit-8b27f2", 43 }) 44 tracker.log_metrics(run, { 45 "unsafe_claim_served": float(result.unsafe_claim_served), 46 "unnecessary_abstention": float(result.unnecessary_abstention), 47 "exact_route_rate": result.exact_route_rate, 48 }) 49 tracker.log_artifact(run, f"reports/{result.template_id}-redacted-eval.json") 50 tracker.log_artifact(run, f"templates/{result.template_id}.json") 51 tracker.set_tags(run, { 52 "incident_request_id": handoff.exemplar_request_id, 53 "hypothesis": "restore evidence-gated CI root-cause claims", 54 "raw_log_text_stored": "false", 55 }) 56 57print(f"tracked_runs={len(tracker.runs)}") 58print(f"shared_evaluation_sha={tracker.runs[0].params['evaluation_sha']}") 59print(f"run_002_artifact={tracker.runs[1].artifacts[0]}") 60print(f"run_002_template_artifact={tracker.runs[1].artifacts[1]}") 61print(f"raw_log_text_stored={tracker.runs[2].tags['raw_log_text_stored']}")
Output
1tracked_runs=3 2shared_evaluation_sha=94dfc930d2829d2c 3run_002_artifact=reports/ci-answerer-v1.2-fix-redacted-eval.json 4run_002_template_artifact=templates/ci-answerer-v1.2-fix.json 5raw_log_text_stored=false

All three runs point to the same regression, holdout, and combined evaluation checksums plus policy, evidence snapshot, and evaluator version. Each run also preserves its redacted report and versioned template artifact. That makes their metric comparison meaningful. If a new test case or scoring rule is needed tomorrow, update the corresponding fingerprint or evaluator version and label subsequent runs as a new comparison set.

Before ranking runs, audit that they are genuinely comparable. A dashboard can display results with different evaluation inputs or scoring logic side by side, but the reviewer must refuse to call that a controlled comparison.

comparability-audit.py
1def comparable(runs: list[RunRecord]) -> bool: 2 comparison_fields = ( 3 "regression_sha", 4 "holdout_sha", 5 "evaluation_sha", 6 "policy_id", 7 "evidence_version", 8 "evaluator_version", 9 ) 10 return all( 11 len({run.params[field] for run in runs}) == 1 12 for field in comparison_fields 13 ) 14 15changed_suite_run = RunRecord( 16 run_id="run_004", 17 params={**tracker.runs[2].params, "holdout_sha": "different-holdout-sha"}, 18) 19changed_evaluator_run = RunRecord( 20 run_id="run_005", 21 params={**tracker.runs[2].params, "evaluator_version": "claim-route-eval-v3"}, 22) 23 24print(f"tracked_runs_comparable={comparable(tracker.runs)}") 25print(f"changed_suite_comparable={comparable(tracker.runs + [changed_suite_run])}") 26print(f"changed_evaluator_comparable={comparable(tracker.runs + [changed_evaluator_run])}") 27print("review_rule=never_rank_metrics_until_inputs_and_evaluator_match")
Output
1tracked_runs_comparable=True 2changed_suite_comparable=False 3changed_evaluator_comparable=False 4review_rule=never_rank_metrics_until_inputs_and_evaluator_match

How this maps to MLflow and W&B

The local tracker isn't a substitute for a shared tracking service. It exposes the record design first. Once that design is correct, the platform mappings are small.

Evidence in our runMLflow TrackingW&B
Experiment/run boundarymlflow.set_experiment(...), mlflow.start_run()wandb.init(project=..., name=...)
Candidate and policy configurationmlflow.log_params(...)run.config
Guardrail resultsmlflow.log_metrics(...)run.log(...)
Incident and decision contextmlflow.set_tags(...)run tags / config fields
Redacted report or template bundlemlflow.log_artifact(...)run.log_artifact(...)
Evaluation input lineagemlflow.log_input(...) with dataset metadatarun.use_artifact(...) for a versioned input artifact

Those APIs are documented by their respective tracking, artifact, and run-tag guides.[1][2][3][4] A real integration would require installing the chosen SDK and configuring storage and authentication, so these snippets are intentionally not executed here:

mlflow-shape.py
1import mlflow 2 3mlflow.set_experiment("ci-claim-gate-repair") 4with mlflow.start_run(run_name="ci-answerer-v1.2-fix-b"): 5 mlflow.log_params({"template_id": "...", "evaluation_sha": "..."}) 6 mlflow.log_metrics({"unsafe_claim_served": 0, "unnecessary_abstention": 0}) 7 mlflow.set_tags({"incident_request_id": "req_205", "decision": "canary_review"}) 8 mlflow.log_artifact("templates/ci-answerer-v1.2-fix-b.json") 9 mlflow.log_artifact("reports/redacted-eval.json")
wandb-shape.py
1import wandb 2 3candidate_bundle = wandb.Artifact("v1.2-fix-b-bundle", type="candidate") 4candidate_bundle.add_file("templates/ci-answerer-v1.2-fix-b.json") 5candidate_bundle.add_file("reports/redacted-eval.json") 6 7with wandb.init( 8 project="ci-claim-gate-repair", 9 name="ci-answerer-v1.2-fix-b", 10 config={"template_id": "...", "evaluation_sha": "..."}, 11 tags=["incident:req_205", "canary_review"], 12) as run: 13 run.log({"unsafe_claim_served": 0, "unnecessary_abstention": 0}) 14 run.log_artifact(candidate_bundle)

Choose a platform because it fits your team's storage, access, dashboards, and deployment workflow. Don't confuse tool selection with experimental rigor. Either platform can preserve an under-specified or misleading run if you don't define inputs and guardrails first.

Forked experiment review contract where one incident trace and one pinned evaluation suite feed two repair runs; run 002 stays recorded as rejected for unnecessary abstention, while run 003 keeps rollback target and canary-only scope because it passes both review gates. Forked experiment review contract where one incident trace and one pinned evaluation suite feed two repair runs; run 002 stays recorded as rejected for unnecessary abstention, while run 003 keeps rollback target and canary-only scope because it passes both review gates.
Selected branch doesn't erase failed repair. Both outcomes stay attached to same contract, while only run 003 carries rollback and canary scope.

Don't delete rejected experiments. Their failure reason is part of the evidence. A future reviewer should see that the first repair eliminated unsafe claims but failed a usefulness guardrail.

rejection-reasons.py
1def review_status(run: RunRecord) -> str: 2 if run.metrics["unsafe_claim_served"] > 0: 3 return "REJECT_UNSAFE_SERVE" 4 if run.metrics["unnecessary_abstention"] > 0: 5 return "REJECT_UNNECESSARY_ABSTENTION" 6 return "ELIGIBLE_FOR_CANARY_REVIEW" 7 8for run in tracker.runs: 9 print(f"{run.params['template_id']}={review_status(run)}")
Output
1ci-answerer-v1.1-regression=REJECT_UNSAFE_SERVE 2ci-answerer-v1.2-fix=REJECT_UNNECESSARY_ABSTENTION 3ci-answerer-v1.2-fix-b=ELIGIBLE_FOR_CANARY_REVIEW

Turn passing metrics into a limited decision

Even a passing offline run isn't authority to deploy broadly. This lab's cases are synthetic and tiny. A correct decision says exactly what the evidence supports: the revised candidate may enter canary review, with a rollback pointer and continued monitoring of the same safety invariant.

For this prompt-only change, the registered object in your application's artifact registry is a template artifact. Don't pretend that every LLM application change produces a new model. If a later experiment fine-tunes model weights, MLflow Model Registry can link the registered model version to its source run and use aliases to identify a deployment target.[5]

promotion-record.py
1@dataclass(frozen=True) 2class PromotionDecision: 3 artifact_name: str 4 alias: str 5 source_run_id: str 6 evaluation_sha: str 7 rollback_artifact: str 8 status: str 9 limitation: str 10 11def passes_contract(run: RunRecord) -> bool: 12 return ( 13 run.metrics["unsafe_claim_served"] == 0 14 and run.metrics["unnecessary_abstention"] == 0 15 and run.params["evaluation_sha"] == evaluation_fingerprint 16 and run.params["policy_id"] == handoff.incident_policy_id 17 and run.params["evidence_version"] == handoff.evidence_version 18 and run.params["evaluator_version"] == evaluator_version 19 and run.tags["incident_request_id"] == handoff.exemplar_request_id 20 ) 21 22def only_eligible_run(runs: list[RunRecord]) -> RunRecord: 23 eligible = [run for run in runs if passes_contract(run)] 24 if len(eligible) != 1: 25 raise ValueError(f"expected one eligible run, found {len(eligible)}") 26 return eligible[0] 27 28selected = only_eligible_run(tracker.runs) 29decision = PromotionDecision( 30 artifact_name=selected.params["template_id"], 31 alias="canary_candidate", 32 source_run_id=selected.run_id, 33 evaluation_sha=selected.params["evaluation_sha"], 34 rollback_artifact="ci-answerer-v1", 35 status="APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB", 36 limitation="requires production review and monitored canary", 37) 38 39print(f"selected={decision.artifact_name}") 40print(f"source_run={decision.source_run_id}") 41print(f"alias={decision.alias}") 42print(f"status={decision.status}") 43print(f"rollback={decision.rollback_artifact}") 44print(f"limitation={decision.limitation}")
Output
1selected=ci-answerer-v1.2-fix-b 2source_run=run_003 3alias=canary_candidate 4status=APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB 5rollback=ci-answerer-v1 6limitation=requires production review and monitored canary

The fixture has one eligible run. A real review must resolve ties explicitly rather than silently selecting the first passing candidate. The artifact, source run, fixed suite, rollback target, and limitation now travel together. If canary observability later reports another unsupported root-cause claim, the investigator can trace back to this exact decision rather than guessing which "fix" shipped.

Promotion evidence should also define the stop action before canary traffic begins. This canary check doesn't claim the revised candidate fails. It makes the rollback rule executable for any future unsafe serve.

canary-rollback-rule.py
1def canary_action(unsafe_claim_served: int, rollback_artifact: str) -> str: 2 if unsafe_claim_served > 0: 3 return f"ROLL_BACK_TO:{rollback_artifact}" 4 return "CONTINUE_CANARY_MONITORING" 5 6print(f"clean_window={canary_action(0, decision.rollback_artifact)}") 7print(f"unsafe_window={canary_action(1, decision.rollback_artifact)}")
Output
1clean_window=CONTINUE_CANARY_MONITORING 2unsafe_window=ROLL_BACK_TO:ci-answerer-v1

The same record applies to training runs

This lab changed a template because that's what the incident demanded. The experiment-tracking habit generalizes when the changed artifact is a trained model:

Prompt-and-policy candidate hereTraining candidate in a later run
template id and policy idmodel architecture, optimizer, precision mode
frozen route-evaluation contracttrain/validation dataset fingerprint and evaluator version
unsafe serves and unnecessary abstentionsloss, quality slices, numerical failures
template bundle and redacted reportcheckpoint and evaluation report
canary candidate aliasregistered model version and alias

The next lesson studies mixed precision. There, a run may change BF16 or FP16 settings and measure throughput, memory, NaN failures, and model-quality guardrails. Without a run record, a faster training job can look like progress even when numerical stability or final quality got worse.

Mastery check

Mastery outcomes

CapabilityWorking proof
Freeze a reproducible decision questionRegression, holdout, and combined suites receive deterministic fingerprints
Compare runs under one contractAudit rejects suite or evaluator drift before metrics are ranked
Preserve incident-to-candidate lineageTracker records parameters, metrics, redacted artifacts, and incident tag
Bound promotion scopeDecision links canary alias, source run, rollback artifact, and production limitation

Evaluation rubric

  • Foundational: Distinguishes a live incident from an offline experiment and a deployment decision.
  • Foundational: Explains why a candidate needs fixed inputs, a fingerprint, and a fixed evaluator version.
  • Intermediate: Reads the three-run comparison and rejects v1.2-fix despite zero unsafe serves.
  • Intermediate: Identifies parameters, metrics, tags, and artifacts that make the selected run reproducible.
  • Advanced: Explains why a prompt-only change shouldn't be presented as a registered model version.
  • Advanced: Produces a canary decision that names source run, rollback target, limitation, and monitored invariant.

Follow-up questions

Common pitfalls

The experiment forgets the incident that motivated it

  • Symptom: A candidate run has nice metrics but no link to the failed production request.
  • Cause: The team started a fresh dashboard instead of carrying forward the incident handoff.
  • Fix: Tag the run with the incident exemplar, failing artifact, evidence snapshot, and required safety metric.

The comparison contract changes silently between candidates

  • Symptom: A later run looks better, but it used different labels, cases, or scoring logic.
  • Cause: Inputs or evaluator semantics were saved informally rather than fingerprinted and versioned.
  • Fix: Log frozen regression and holdout fingerprints plus evaluator version with every comparable run.

Zero unsafe serves hides an unusable repair

  • Symptom: A candidate looks safe because it abstains on every CI question.
  • Cause: Review tracked a safety invariant without a usefulness guardrail.
  • Fix: Pair unsafe_claim_served == 0 with unnecessary_abstention == 0 on supported cases.

A prompt change is called a new model

  • Symptom: Registry history can't tell whether weights, retrieval, or templates changed.
  • Cause: Every application artifact was collapsed into the word "model."
  • Fix: Register and version the artifact that changed; use a model registry when model weights changed.

A passing offline run is treated as deployment approval

  • Symptom: A four-case test produces an immediate broad rollout.
  • Cause: Evidence scope and promotion scope were confused.
  • Fix: State the limitation in the decision record and require reviewed canary monitoring before wider traffic.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.Monitoring shows req_205 from ci-answerer-v1.1-regression served an unsupported root-cause claim, and a prompt-policy candidate ci-answerer-v1.2-fix has been created. Which next action keeps the incident, experiment, and deployment boundaries correct?
2.A deterministic evaluation fingerprint is computed by sorting cases by case_id and hashing case_id, evidence_state, expected_route, slice_name, and source. What follows if the same cases are passed in a different order, and then one expected_route is changed?
3.An offline comparison over the frozen four-case suite produces these metrics: v1.1-regression has unsafe=2 and unnecessary=0, v1.2-fix has unsafe=0 and unnecessary=1, and v1.2-fix-b has unsafe=0 and unnecessary=0. The contract requires both guardrails to be zero. Which review outcome follows?
4.Which logged record makes a prompt-policy evaluation run reviewable without storing raw logs?
5.A team appends a fifth holdout case for a new dependency-error shape next week. The evaluator code is unchanged. How should reviewers compare the new run with the original four-case numbers?
6.Two runs have the same regression_sha, holdout_sha, and evaluation_sha, but one logged evaluator_version=claim-route-eval-v2 and the other logged claim-route-eval-v3. What should the reviewer do before ranking their metrics?
7.After selecting v1.2-fix-b, a teammate wants to delete the failed run_002 record for v1.2-fix because it missed the holdout case. What should the tracker preserve?
8.run_003 is the only run that passes both guardrails on four synthetic cases for a prompt-only change. During its canary, any unsupported root-cause claim must stop traffic. Which decision record and response match that evidence?

8 questions remaining.

Next Step
Continue to Mixed Precision Training

You can now record a controlled candidate run and reject improvements that fail guardrails. Next you'll use that experiment discipline to reason about faster training runs whose lower-precision arithmetic can change both throughput and numerical stability.

PreviousLLM Observability & Monitoring
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

ML Experiment Tracking.

MLflow Project. · 2026 · Official documentation

Experiments overview.

Weights & Biases. · 2026 · Official documentation

Artifacts overview.

Weights & Biases. · 2026 · Official documentation

Organize runs with tags.

Weights & Biases. · 2026 · Official documentation

Model Registry Workflows | MLflow AI Platform

MLflow · 2026