Turn a live LLM regression into a reproducible candidate decision by logging inputs, metrics, artifacts, and promotion evidence.
A live failure anchors this lesson: ci-answerer-v1.1-regression served two unsupported CI root-cause claims. Monitoring created a proposed fix, ci-answerer-v1.2-fix, but correctly left its status as BLOCKED_PENDING_EVALUATION.
Now you need to answer a different question: did the proposed fix earn a controlled next step? Experiment tracking records the hypothesis, frozen test inputs, changed artifact, measured outputs, and decision so that someone else can reproduce that answer.
For a large language model (LLM) application, an experiment isn't limited to training new weights. A run can evaluate a changed prompt template, evidence policy, retriever, tool schema, model, or fine-tuned checkpoint. This lesson uses a prompt-and-policy fix because it's the immediate continuation of the production incident.
The monitoring page established what happened in production. It didn't test any fix. Keep that boundary clear:
| System | Question it answers | Record here |
|---|---|---|
| Observability | What went wrong on live requests? | req_205 served an unsupported root-cause claim |
| Experiment tracking | Does a controlled candidate fix the failure without new regressions? | three offline runs over frozen cases |
| Deployment control | May an approved artifact receive limited traffic? | canary alias only after review |
The objects in an experiment record are straightforward once they're tied to this failure.
| Object | Meaning | CI-answerer example |
|---|---|---|
| Experiment | Related attempts to answer one question | ci-claim-gate-repair |
| Run | One execution against fixed inputs | evaluate ci-answerer-v1.2-fix-b |
| Parameter | Input chosen before evaluation | template id, policy id, evaluator version |
| Metric | Measured result | unsafe root-cause claims served, unnecessary abstentions |
| Artifact | Versioned output or evidence file | template bundle, redacted failure report |
| Tag | Searchable context | incident id, hypothesis, reviewer status |
| Registry decision | Which artifact may proceed | canary_candidate -> v1.2-fix-b |
MLflow Tracking records runs with parameters, metrics, tags, artifacts, source-code versions, and dataset inputs.[1] Weights & Biases (W&B) uses runs with configuration, metrics, and artifacts for the same evidence workflow.[2] Start with a local version of that workflow so the logic is visible without credentials or a hosted service.
Changing both the candidate and the test cases at the same time makes a win hard to interpret. Start from the handoff that monitoring produced, then create a small regression suite from known behavior.
The production exemplar contains no raw log here. The experiment needs stable evidence categories and expected routes, not durable copies of log text.
1from dataclasses import dataclass, field
2import hashlib
3import json
4
5@dataclass(frozen=True)
6class IncidentHandoff:
7 incident_policy_id: str
8 failing_template_id: str
9 exemplar_request_id: str
10 evidence_version: str
11 candidate_template_id: str
12 required_metric: str
13
14@dataclass(frozen=True)
15class EvalCase:
16 case_id: str
17 slice_name: str
18 evidence_state: str
19 expected_route: str
20 source: str
21
22handoff = IncidentHandoff(
23 incident_policy_id="ci-grounding-slo-v1",
24 failing_template_id="ci-answerer-v1.1-regression",
25 exemplar_request_id="req_205",
26 evidence_version="ci-logs@v17",
27 candidate_template_id="ci-answerer-v1.2-fix",
28 required_metric="unsafe_claim_served == 0 on regression and holdout windows",
29)
30
31regression_cases = (
32 EvalCase("req_205", "unsupported_root_cause", "no_source", "ABSTAIN", "incident"),
33 EvalCase("reg_002", "confirmed_failure", "test_failure_log", "SERVE", "known_good"),
34)
35
36print(f"failing_template={handoff.failing_template_id}")
37print(f"candidate={handoff.candidate_template_id}")
38print(f"regression_cases={len(regression_cases)}")
39print(f"incident_expected_route={regression_cases[0].expected_route}")
40print(f"required_metric={handoff.required_metric}")1failing_template=ci-answerer-v1.1-regression
2candidate=ci-answerer-v1.2-fix
3regression_cases=2
4incident_expected_route=ABSTAIN
5required_metric=unsafe_claim_served == 0 on regression and holdout windowsThe first case checks the incident: evidence can't support a root-cause claim, so the application must abstain. The second case prevents the blunt "fix" of always refusing to answer when trustworthy evidence is present.
A run id isn't enough to reproduce a decision. If someone quietly changes a label, adds a case, or updates the evidence snapshot, the comparison is a new experiment condition.
A fingerprint is a deterministic checksum of the evaluation inputs. Equal fingerprints mean two runs were measured against the same frozen representation. They don't prove the labels are correct; review still has to establish that.
1def suite_fingerprint(cases: tuple[EvalCase, ...]) -> str:
2 normalized = [
3 {
4 "case_id": case.case_id,
5 "evidence_state": case.evidence_state,
6 "expected_route": case.expected_route,
7 "slice_name": case.slice_name,
8 "source": case.source,
9 }
10 for case in sorted(cases, key=lambda item: item.case_id)
11 ]
12 payload = json.dumps(normalized, sort_keys=True, separators=(",", ":")).encode()
13 return hashlib.sha256(payload).hexdigest()[:16]
14
15regression_fingerprint = suite_fingerprint(regression_cases)
16corrupted_cases = (
17 regression_cases[0],
18 EvalCase("reg_002", "confirmed_failure", "test_failure_log", "ABSTAIN", "known_good"),
19)
20
21print(f"regression_sha={regression_fingerprint}")
22print(f"same_rows_same_sha={suite_fingerprint(tuple(reversed(regression_cases))) == regression_fingerprint}")
23print(f"corrupted_label_detected={suite_fingerprint(corrupted_cases) != regression_fingerprint}")1regression_sha=2be42790a4f41355
2same_rows_same_sha=True
3corrupted_label_detected=TrueThe row order isn't meaningful, so the fingerprint normalizes order. The example deliberately corrupts the known-good test-failure-log label; that mistake must create a different checksum. In a real project, also record dataset version, evidence snapshot version, labeling policy, and the code commit that built the cases.
Suppose the first fix rejects unsupported root-cause claims and serves explicit test failure logs. It passes the two known cases. That doesn't show how it behaves when a dependency resolver provides a version-conflict error, which is safe to quote but has a different evidence shape.
Add representative holdout cases before selecting a candidate. Keep them separate from the original incident suite so reviewers can tell whether a candidate repaired the known failure or generalized beyond it.
1holdout_cases = (
2 EvalCase("hold_001", "dependency_error", "dependency_error", "SERVE", "holdout"),
3 EvalCase("hold_002", "missing_stacktrace", "no_source", "ABSTAIN", "holdout"),
4)
5all_cases = regression_cases + holdout_cases
6holdout_fingerprint = suite_fingerprint(holdout_cases)
7evaluation_fingerprint = suite_fingerprint(all_cases)
8
9print(f"holdout_cases={len(holdout_cases)}")
10print(f"holdout_sha={holdout_fingerprint}")
11print(f"evaluation_sha={evaluation_fingerprint}")
12print("guardrails=unsafe_claim_served==0, unnecessary_abstention==0")1holdout_cases=2
2holdout_sha=d47cdbbfe1caae89
3evaluation_sha=94dfc930d2829d2c
4guardrails=unsafe_claim_served==0, unnecessary_abstention==0Notice the second guardrail. An application that abstains on every answer is factually safe on this metric, but it isn't useful. unsafe_claim_served protects trust; unnecessary_abstention protects the product from an overly restrictive repair.
The next cell models three template versions. These functions stand in for deterministic offline evaluation results so you can focus on run comparison:
v1.1-regressed serves every claimed root cause, including unsupported ones.v1.2-fix repairs the incident but recognizes only explicit test failure logs.v1.2-fix-b handles both test failure logs and dependency-error evidence.The candidate implementation doesn't get to read expected_route. It makes a rule from evidence_state, and the evaluator compares that output with the frozen expectation.
1@dataclass(frozen=True)
2class Evaluation:
3 template_id: str
4 case_count: int
5 unsafe_claim_served: int
6 unnecessary_abstention: int
7 exact_route_rate: float
8
9def predict_route(template_id: str, case: EvalCase) -> str:
10 if template_id == "ci-answerer-v1.1-regression":
11 return "SERVE"
12 if template_id == "ci-answerer-v1.2-fix":
13 return "SERVE" if case.evidence_state == "test_failure_log" else "ABSTAIN"
14 if template_id == "ci-answerer-v1.2-fix-b":
15 supported = {"test_failure_log", "dependency_error"}
16 return "SERVE" if case.evidence_state in supported else "ABSTAIN"
17 raise ValueError(f"unknown template: {template_id}")
18
19def evaluate(template_id: str, cases: tuple[EvalCase, ...]) -> Evaluation:
20 predicted = [predict_route(template_id, case) for case in cases]
21 unsafe = sum(
22 route == "SERVE" and case.expected_route == "ABSTAIN"
23 for route, case in zip(predicted, cases)
24 )
25 unnecessary = sum(
26 route == "ABSTAIN" and case.expected_route == "SERVE"
27 for route, case in zip(predicted, cases)
28 )
29 exact = sum(route == case.expected_route for route, case in zip(predicted, cases))
30 return Evaluation(template_id, len(cases), unsafe, unnecessary, exact / len(cases))
31
32known_fix = evaluate("ci-answerer-v1.2-fix", regression_cases)
33print(f"candidate={known_fix.template_id}")
34print(f"regression_unsafe={known_fix.unsafe_claim_served}")
35print(f"regression_unnecessary_abstention={known_fix.unnecessary_abstention}")
36print(f"regression_pass={known_fix.exact_route_rate == 1.0}")
37print("decision=CONTINUE_TO_HOLDOUT")1candidate=ci-answerer-v1.2-fix
2regression_unsafe=0
3regression_unnecessary_abstention=0
4regression_pass=True
5decision=CONTINUE_TO_HOLDOUTA clean regression result is progress, not approval. It means the candidate deserves evaluation on the holdout contract that was already declared.
1templates = (
2 "ci-answerer-v1.1-regression",
3 "ci-answerer-v1.2-fix",
4 "ci-answerer-v1.2-fix-b",
5)
6evaluations = [evaluate(template_id, all_cases) for template_id in templates]
7
8print("template unsafe unnecessary exact")
9for result in evaluations:
10 print(
11 f"{result.template_id:<33} "
12 f"{result.unsafe_claim_served:>6} "
13 f"{result.unnecessary_abstention:>12} "
14 f"{result.exact_route_rate:>6.0%}"
15 )
16
17first_fix = evaluations[1]
18revised_fix = evaluations[2]
19print(f"first_fix_blocked={first_fix.unnecessary_abstention > 0}")
20print(f"revised_fix_passes_guardrails={revised_fix.unsafe_claim_served == 0 and revised_fix.unnecessary_abstention == 0}")1template unsafe unnecessary exact
2ci-answerer-v1.1-regression 2 0 50%
3ci-answerer-v1.2-fix 0 1 75%
4ci-answerer-v1.2-fix-b 0 0 100%
5first_fix_blocked=True
6revised_fix_passes_guardrails=TrueTracking isn't a scoreboard for flattering metrics. The first fix satisfies the incident's safety condition but introduces a holdout experience regression. Its run should remain visible and rejected, not silently replaced by the revised fix.
Now put the experiment structure around those results. A tracking service should let a reviewer answer:
The local tracker below uses the same conceptual buckets as MLflow and W&B. It deliberately logs stable identifiers and a redacted exemplar report, not raw logs.
1@dataclass
2class RunRecord:
3 run_id: str
4 params: dict[str, str] = field(default_factory=dict)
5 metrics: dict[str, float] = field(default_factory=dict)
6 artifacts: list[str] = field(default_factory=list)
7 tags: dict[str, str] = field(default_factory=dict)
8
9class LocalTracker:
10 def __init__(self) -> None:
11 self.runs: list[RunRecord] = []
12
13 def start_run(self, run_id: str) -> RunRecord:
14 run = RunRecord(run_id)
15 self.runs.append(run)
16 return run
17
18 def log_params(self, run: RunRecord, params: dict[str, str]) -> None:
19 run.params.update(params)
20
21 def log_metrics(self, run: RunRecord, metrics: dict[str, float]) -> None:
22 run.metrics.update(metrics)
23
24 def log_artifact(self, run: RunRecord, path: str) -> None:
25 run.artifacts.append(path)
26
27 def set_tags(self, run: RunRecord, tags: dict[str, str]) -> None:
28 run.tags.update(tags)
29
30evaluator_version = "claim-route-eval-v2"
31tracker = LocalTracker()
32for index, result in enumerate(evaluations, start=1):
33 run = tracker.start_run(f"run_{index:03d}")
34 tracker.log_params(run, {
35 "template_id": result.template_id,
36 "policy_id": handoff.incident_policy_id,
37 "evaluator_version": evaluator_version,
38 "regression_sha": regression_fingerprint,
39 "holdout_sha": holdout_fingerprint,
40 "evaluation_sha": evaluation_fingerprint,
41 "evidence_version": handoff.evidence_version,
42 "code_commit": "fixture-commit-8b27f2",
43 })
44 tracker.log_metrics(run, {
45 "unsafe_claim_served": float(result.unsafe_claim_served),
46 "unnecessary_abstention": float(result.unnecessary_abstention),
47 "exact_route_rate": result.exact_route_rate,
48 })
49 tracker.log_artifact(run, f"reports/{result.template_id}-redacted-eval.json")
50 tracker.log_artifact(run, f"templates/{result.template_id}.json")
51 tracker.set_tags(run, {
52 "incident_request_id": handoff.exemplar_request_id,
53 "hypothesis": "restore evidence-gated CI root-cause claims",
54 "raw_log_text_stored": "false",
55 })
56
57print(f"tracked_runs={len(tracker.runs)}")
58print(f"shared_evaluation_sha={tracker.runs[0].params['evaluation_sha']}")
59print(f"run_002_artifact={tracker.runs[1].artifacts[0]}")
60print(f"run_002_template_artifact={tracker.runs[1].artifacts[1]}")
61print(f"raw_log_text_stored={tracker.runs[2].tags['raw_log_text_stored']}")1tracked_runs=3
2shared_evaluation_sha=94dfc930d2829d2c
3run_002_artifact=reports/ci-answerer-v1.2-fix-redacted-eval.json
4run_002_template_artifact=templates/ci-answerer-v1.2-fix.json
5raw_log_text_stored=falseAll three runs point to the same regression, holdout, and combined evaluation checksums plus policy, evidence snapshot, and evaluator version. Each run also preserves its redacted report and versioned template artifact. That makes their metric comparison meaningful. If a new test case or scoring rule is needed tomorrow, update the corresponding fingerprint or evaluator version and label subsequent runs as a new comparison set.
Before ranking runs, audit that they are genuinely comparable. A dashboard can display results with different evaluation inputs or scoring logic side by side, but the reviewer must refuse to call that a controlled comparison.
1def comparable(runs: list[RunRecord]) -> bool:
2 comparison_fields = (
3 "regression_sha",
4 "holdout_sha",
5 "evaluation_sha",
6 "policy_id",
7 "evidence_version",
8 "evaluator_version",
9 )
10 return all(
11 len({run.params[field] for run in runs}) == 1
12 for field in comparison_fields
13 )
14
15changed_suite_run = RunRecord(
16 run_id="run_004",
17 params={**tracker.runs[2].params, "holdout_sha": "different-holdout-sha"},
18)
19changed_evaluator_run = RunRecord(
20 run_id="run_005",
21 params={**tracker.runs[2].params, "evaluator_version": "claim-route-eval-v3"},
22)
23
24print(f"tracked_runs_comparable={comparable(tracker.runs)}")
25print(f"changed_suite_comparable={comparable(tracker.runs + [changed_suite_run])}")
26print(f"changed_evaluator_comparable={comparable(tracker.runs + [changed_evaluator_run])}")
27print("review_rule=never_rank_metrics_until_inputs_and_evaluator_match")1tracked_runs_comparable=True
2changed_suite_comparable=False
3changed_evaluator_comparable=False
4review_rule=never_rank_metrics_until_inputs_and_evaluator_matchThe local tracker isn't a substitute for a shared tracking service. It exposes the record design first. Once that design is correct, the platform mappings are small.
| Evidence in our run | MLflow Tracking | W&B |
|---|---|---|
| Experiment/run boundary | mlflow.set_experiment(...), mlflow.start_run() | wandb.init(project=..., name=...) |
| Candidate and policy configuration | mlflow.log_params(...) | run.config |
| Guardrail results | mlflow.log_metrics(...) | run.log(...) |
| Incident and decision context | mlflow.set_tags(...) | run tags / config fields |
| Redacted report or template bundle | mlflow.log_artifact(...) | run.log_artifact(...) |
| Evaluation input lineage | mlflow.log_input(...) with dataset metadata | run.use_artifact(...) for a versioned input artifact |
Those APIs are documented by their respective tracking, artifact, and run-tag guides.[1][2][3][4] A real integration would require installing the chosen SDK and configuring storage and authentication, so these snippets are intentionally not executed here:
1import mlflow
2
3mlflow.set_experiment("ci-claim-gate-repair")
4with mlflow.start_run(run_name="ci-answerer-v1.2-fix-b"):
5 mlflow.log_params({"template_id": "...", "evaluation_sha": "..."})
6 mlflow.log_metrics({"unsafe_claim_served": 0, "unnecessary_abstention": 0})
7 mlflow.set_tags({"incident_request_id": "req_205", "decision": "canary_review"})
8 mlflow.log_artifact("templates/ci-answerer-v1.2-fix-b.json")
9 mlflow.log_artifact("reports/redacted-eval.json")1import wandb
2
3candidate_bundle = wandb.Artifact("v1.2-fix-b-bundle", type="candidate")
4candidate_bundle.add_file("templates/ci-answerer-v1.2-fix-b.json")
5candidate_bundle.add_file("reports/redacted-eval.json")
6
7with wandb.init(
8 project="ci-claim-gate-repair",
9 name="ci-answerer-v1.2-fix-b",
10 config={"template_id": "...", "evaluation_sha": "..."},
11 tags=["incident:req_205", "canary_review"],
12) as run:
13 run.log({"unsafe_claim_served": 0, "unnecessary_abstention": 0})
14 run.log_artifact(candidate_bundle)Choose a platform because it fits your team's storage, access, dashboards, and deployment workflow. Don't confuse tool selection with experimental rigor. Either platform can preserve an under-specified or misleading run if you don't define inputs and guardrails first.
Don't delete rejected experiments. Their failure reason is part of the evidence. A future reviewer should see that the first repair eliminated unsafe claims but failed a usefulness guardrail.
1def review_status(run: RunRecord) -> str:
2 if run.metrics["unsafe_claim_served"] > 0:
3 return "REJECT_UNSAFE_SERVE"
4 if run.metrics["unnecessary_abstention"] > 0:
5 return "REJECT_UNNECESSARY_ABSTENTION"
6 return "ELIGIBLE_FOR_CANARY_REVIEW"
7
8for run in tracker.runs:
9 print(f"{run.params['template_id']}={review_status(run)}")1ci-answerer-v1.1-regression=REJECT_UNSAFE_SERVE
2ci-answerer-v1.2-fix=REJECT_UNNECESSARY_ABSTENTION
3ci-answerer-v1.2-fix-b=ELIGIBLE_FOR_CANARY_REVIEWEven a passing offline run isn't authority to deploy broadly. This lab's cases are synthetic and tiny. A correct decision says exactly what the evidence supports: the revised candidate may enter canary review, with a rollback pointer and continued monitoring of the same safety invariant.
For this prompt-only change, the registered object in your application's artifact registry is a template artifact. Don't pretend that every LLM application change produces a new model. If a later experiment fine-tunes model weights, MLflow Model Registry can link the registered model version to its source run and use aliases to identify a deployment target.[5]
1@dataclass(frozen=True)
2class PromotionDecision:
3 artifact_name: str
4 alias: str
5 source_run_id: str
6 evaluation_sha: str
7 rollback_artifact: str
8 status: str
9 limitation: str
10
11def passes_contract(run: RunRecord) -> bool:
12 return (
13 run.metrics["unsafe_claim_served"] == 0
14 and run.metrics["unnecessary_abstention"] == 0
15 and run.params["evaluation_sha"] == evaluation_fingerprint
16 and run.params["policy_id"] == handoff.incident_policy_id
17 and run.params["evidence_version"] == handoff.evidence_version
18 and run.params["evaluator_version"] == evaluator_version
19 and run.tags["incident_request_id"] == handoff.exemplar_request_id
20 )
21
22def only_eligible_run(runs: list[RunRecord]) -> RunRecord:
23 eligible = [run for run in runs if passes_contract(run)]
24 if len(eligible) != 1:
25 raise ValueError(f"expected one eligible run, found {len(eligible)}")
26 return eligible[0]
27
28selected = only_eligible_run(tracker.runs)
29decision = PromotionDecision(
30 artifact_name=selected.params["template_id"],
31 alias="canary_candidate",
32 source_run_id=selected.run_id,
33 evaluation_sha=selected.params["evaluation_sha"],
34 rollback_artifact="ci-answerer-v1",
35 status="APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB",
36 limitation="requires production review and monitored canary",
37)
38
39print(f"selected={decision.artifact_name}")
40print(f"source_run={decision.source_run_id}")
41print(f"alias={decision.alias}")
42print(f"status={decision.status}")
43print(f"rollback={decision.rollback_artifact}")
44print(f"limitation={decision.limitation}")1selected=ci-answerer-v1.2-fix-b
2source_run=run_003
3alias=canary_candidate
4status=APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB
5rollback=ci-answerer-v1
6limitation=requires production review and monitored canaryThe fixture has one eligible run. A real review must resolve ties explicitly rather than silently selecting the first passing candidate. The artifact, source run, fixed suite, rollback target, and limitation now travel together. If canary observability later reports another unsupported root-cause claim, the investigator can trace back to this exact decision rather than guessing which "fix" shipped.
Promotion evidence should also define the stop action before canary traffic begins. This canary check doesn't claim the revised candidate fails. It makes the rollback rule executable for any future unsafe serve.
1def canary_action(unsafe_claim_served: int, rollback_artifact: str) -> str:
2 if unsafe_claim_served > 0:
3 return f"ROLL_BACK_TO:{rollback_artifact}"
4 return "CONTINUE_CANARY_MONITORING"
5
6print(f"clean_window={canary_action(0, decision.rollback_artifact)}")
7print(f"unsafe_window={canary_action(1, decision.rollback_artifact)}")1clean_window=CONTINUE_CANARY_MONITORING
2unsafe_window=ROLL_BACK_TO:ci-answerer-v1This lab changed a template because that's what the incident demanded. The experiment-tracking habit generalizes when the changed artifact is a trained model:
| Prompt-and-policy candidate here | Training candidate in a later run |
|---|---|
| template id and policy id | model architecture, optimizer, precision mode |
| frozen route-evaluation contract | train/validation dataset fingerprint and evaluator version |
| unsafe serves and unnecessary abstentions | loss, quality slices, numerical failures |
| template bundle and redacted report | checkpoint and evaluation report |
| canary candidate alias | registered model version and alias |
The next lesson studies mixed precision. There, a run may change BF16 or FP16 settings and measure throughput, memory, NaN failures, and model-quality guardrails. Without a run record, a faster training job can look like progress even when numerical stability or final quality got worse.
| Capability | Working proof |
|---|---|
| Freeze a reproducible decision question | Regression, holdout, and combined suites receive deterministic fingerprints |
| Compare runs under one contract | Audit rejects suite or evaluator drift before metrics are ranked |
| Preserve incident-to-candidate lineage | Tracker records parameters, metrics, redacted artifacts, and incident tag |
| Bound promotion scope | Decision links canary alias, source run, rollback artifact, and production limitation |
v1.2-fix despite zero unsafe serves.unsafe_claim_served == 0 with unnecessary_abstention == 0 on supported cases.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
ML Experiment Tracking.
MLflow Project. · 2026 · Official documentation
Experiments overview.
Weights & Biases. · 2026 · Official documentation
Artifacts overview.
Weights & Biases. · 2026 · Official documentation
Organize runs with tags.
Weights & Biases. · 2026 · Official documentation
Model Registry Workflows | MLflow AI Platform
MLflow · 2026