LearnApplied LLM EngineeringExperiment Tracking with MLflow and W&B

⚙️MediumMLOps & Deployment

Experiment Tracking with MLflow and W&B

Turn a live LLM regression into a reproducible candidate decision by logging inputs, metrics, artifacts, and promotion evidence.

16 min read

Learning path

Step 72 of 158 in the full curriculum

LLM Observability & Monitoring Mixed Precision Training

A live failure anchors this lesson: ci-answerer-v1.1-regression served two unsupported CI root-cause claims. Monitoring created a proposed fix, ci-answerer-v1.2-fix, but correctly left its status as BLOCKED_PENDING_EVALUATION.

Now you need to answer a different question: did the proposed fix earn a controlled next step? Experiment tracking records the hypothesis, frozen test inputs, changed artifact, measured outputs, and decision so that someone else can reproduce that answer.

For a large language model (LLM) application, an experiment isn't limited to training new weights. A run can evaluate a changed prompt template, evidence policy, retriever, tool schema, model, or fine-tuned checkpoint. This lesson uses a prompt-and-policy fix because it's the immediate continuation of the production incident.

Experiment tracking board where one shared evaluation contract fans into three runs: run 001 is blocked for unsafe serves, run 002 is blocked for unnecessary abstention, and run 003 is the only candidate eligible for canary review. — Only shared-contract runs are comparable. Run 002 fixes safety but still fails usefulness, so only run 003 reaches canary review.

From alert to experiment

The monitoring page established what happened in production. It didn't test any fix. Keep that boundary clear:

System	Question it answers	Record here
Observability	What went wrong on live requests?	`req_205` served an unsupported root-cause claim
Experiment tracking	Does a controlled candidate fix the failure without new regressions?	three offline runs over frozen cases
Deployment control	May an approved artifact receive limited traffic?	canary alias only after review

The objects in an experiment record are straightforward once they're tied to this failure.

Object	Meaning	CI-answerer example
Experiment	Related attempts to answer one question	`ci-claim-gate-repair`
Run	One execution against fixed inputs	evaluate `ci-answerer-v1.2-fix-b`
Parameter	Input chosen before evaluation	template id, policy id, evaluator version
Metric	Measured result	unsafe root-cause claims served, unnecessary abstentions
Artifact	Versioned output or evidence file	template bundle, redacted failure report
Tag	Searchable context	incident id, hypothesis, reviewer status
Registry decision	Which artifact may proceed	`canary_candidate -> v1.2-fix-b`

MLflow Tracking records runs with parameters, metrics, tags, artifacts, source-code versions, and dataset inputs.^[1] Weights & Biases (W&B) uses runs with configuration, metrics, and artifacts for the same evidence workflow.^[2] Start with a local version of that workflow so the logic is visible without credentials or a hosted service.

Freeze the question before running candidates

Changing both the candidate and the test cases at the same time makes a win hard to interpret. Start from the handoff that monitoring produced, then create a small regression suite from known behavior.

The production exemplar contains no raw log here. The experiment needs stable evidence categories and expected routes, not durable copies of log text.

incident-contract.py

from dataclasses import dataclass, field
import hashlib
import json

@dataclass(frozen=True)
class IncidentHandoff:
    incident_policy_id: str
    failing_template_id: str
    exemplar_request_id: str
    evidence_version: str
    candidate_template_id: str
    required_metric: str

@dataclass(frozen=True)
class EvalCase:
    case_id: str
    slice_name: str
    evidence_state: str
    expected_route: str
    source: str

handoff = IncidentHandoff(
    incident_policy_id="ci-grounding-slo-v1",
    failing_template_id="ci-answerer-v1.1-regression",
    exemplar_request_id="req_205",
    evidence_version="ci-logs@v17",
    candidate_template_id="ci-answerer-v1.2-fix",
    required_metric="unsafe_claim_served == 0 on regression and holdout windows",
)

regression_cases = (
    EvalCase("req_205", "unsupported_root_cause", "no_source", "ABSTAIN", "incident"),
    EvalCase("reg_002", "confirmed_failure", "test_failure_log", "SERVE", "known_good"),
)

print(f"failing_template={handoff.failing_template_id}")
print(f"candidate={handoff.candidate_template_id}")
print(f"regression_cases={len(regression_cases)}")
print(f"incident_expected_route={regression_cases[0].expected_route}")
print(f"required_metric={handoff.required_metric}")

Output

failing_template=ci-answerer-v1.1-regression
candidate=ci-answerer-v1.2-fix
regression_cases=2
incident_expected_route=ABSTAIN
required_metric=unsafe_claim_served == 0 on regression and holdout windows

The first case checks the incident: evidence can't support a root-cause claim, so the application must abstain. The second case prevents the blunt "fix" of always refusing to answer when trustworthy evidence is present.

Fingerprint the evaluation inputs

A run id isn't enough to reproduce a decision. If someone quietly changes a label, adds a case, or updates the evidence snapshot, the comparison is a new experiment condition.

A fingerprint is a deterministic checksum of the evaluation inputs. Equal fingerprints mean two runs were measured against the same frozen representation. They don't prove the labels are correct; review still has to establish that.

evaluation-fingerprint.py

def suite_fingerprint(cases: tuple[EvalCase, ...]) -> str:
    normalized = [
        {
            "case_id": case.case_id,
            "evidence_state": case.evidence_state,
            "expected_route": case.expected_route,
            "slice_name": case.slice_name,
            "source": case.source,
        }
        for case in sorted(cases, key=lambda item: item.case_id)
    ]
    payload = json.dumps(normalized, sort_keys=True, separators=(",", ":")).encode()
    return hashlib.sha256(payload).hexdigest()[:16]

regression_fingerprint = suite_fingerprint(regression_cases)
corrupted_cases = (
    regression_cases[0],
    EvalCase("reg_002", "confirmed_failure", "test_failure_log", "ABSTAIN", "known_good"),
)

print(f"regression_sha={regression_fingerprint}")
print(f"same_rows_same_sha={suite_fingerprint(tuple(reversed(regression_cases))) == regression_fingerprint}")
print(f"corrupted_label_detected={suite_fingerprint(corrupted_cases) != regression_fingerprint}")

Output

regression_sha=2be42790a4f41355
same_rows_same_sha=True
corrupted_label_detected=True

The row order isn't meaningful, so the fingerprint normalizes order. The example deliberately corrupts the known-good test-failure-log label; that mistake must create a different checksum. In a real project, also record dataset version, evidence snapshot version, labeling policy, and the code commit that built the cases.

A regression pass is necessary, not sufficient

Suppose the first fix rejects unsupported root-cause claims and serves explicit test failure logs. It passes the two known cases. That doesn't show how it behaves when a dependency resolver provides a version-conflict error, which is safe to quote but has a different evidence shape.

Add representative holdout cases before selecting a candidate. Keep them separate from the original incident suite so reviewers can tell whether a candidate repaired the known failure or generalized beyond it.

holdout-suite.py

holdout_cases = (
    EvalCase("hold_001", "dependency_error", "dependency_error", "SERVE", "holdout"),
    EvalCase("hold_002", "missing_stacktrace", "no_source", "ABSTAIN", "holdout"),
)
all_cases = regression_cases + holdout_cases
holdout_fingerprint = suite_fingerprint(holdout_cases)
evaluation_fingerprint = suite_fingerprint(all_cases)

print(f"holdout_cases={len(holdout_cases)}")
print(f"holdout_sha={holdout_fingerprint}")
print(f"evaluation_sha={evaluation_fingerprint}")
print("guardrails=unsafe_claim_served==0, unnecessary_abstention==0")

Output

holdout_cases=2
holdout_sha=d47cdbbfe1caae89
evaluation_sha=94dfc930d2829d2c
guardrails=unsafe_claim_served==0, unnecessary_abstention==0

Notice the second guardrail. An application that abstains on every answer is factually safe on this metric, but it isn't useful. unsafe_claim_served protects trust; unnecessary_abstention protects the product from an overly restrictive repair.

Evaluate the behavior, not the proposed story

The next cell models three template versions. These functions stand in for deterministic offline evaluation results so you can focus on run comparison:

v1.1-regressed serves every claimed root cause, including unsupported ones.
v1.2-fix repairs the incident but recognizes only explicit test failure logs.
v1.2-fix-b handles both test failure logs and dependency-error evidence.

The candidate implementation doesn't get to read expected_route. It makes a rule from evidence_state, and the evaluator compares that output with the frozen expectation.

evaluate-templates.py

@dataclass(frozen=True)
class Evaluation:
    template_id: str
    case_count: int
    unsafe_claim_served: int
    unnecessary_abstention: int
    exact_route_rate: float

def predict_route(template_id: str, case: EvalCase) -> str:
    if template_id == "ci-answerer-v1.1-regression":
        return "SERVE"
    if template_id == "ci-answerer-v1.2-fix":
        return "SERVE" if case.evidence_state == "test_failure_log" else "ABSTAIN"
    if template_id == "ci-answerer-v1.2-fix-b":
        supported = {"test_failure_log", "dependency_error"}
        return "SERVE" if case.evidence_state in supported else "ABSTAIN"
    raise ValueError(f"unknown template: {template_id}")

def evaluate(template_id: str, cases: tuple[EvalCase, ...]) -> Evaluation:
    predicted = [predict_route(template_id, case) for case in cases]
    unsafe = sum(
        route == "SERVE" and case.expected_route == "ABSTAIN"
        for route, case in zip(predicted, cases)
    )
    unnecessary = sum(
        route == "ABSTAIN" and case.expected_route == "SERVE"
        for route, case in zip(predicted, cases)
    )
    exact = sum(route == case.expected_route for route, case in zip(predicted, cases))
    return Evaluation(template_id, len(cases), unsafe, unnecessary, exact / len(cases))

known_fix = evaluate("ci-answerer-v1.2-fix", regression_cases)
print(f"candidate={known_fix.template_id}")
print(f"regression_unsafe={known_fix.unsafe_claim_served}")
print(f"regression_unnecessary_abstention={known_fix.unnecessary_abstention}")
print(f"regression_pass={known_fix.exact_route_rate == 1.0}")
print("decision=CONTINUE_TO_HOLDOUT")

Output

candidate=ci-answerer-v1.2-fix
regression_unsafe=0
regression_unnecessary_abstention=0
regression_pass=True
decision=CONTINUE_TO_HOLDOUT

A clean regression result is progress, not approval. It means the candidate deserves evaluation on the holdout contract that was already declared.

compare-candidate-runs.py

templates = (
    "ci-answerer-v1.1-regression",
    "ci-answerer-v1.2-fix",
    "ci-answerer-v1.2-fix-b",
)
evaluations = [evaluate(template_id, all_cases) for template_id in templates]

print("template                          unsafe  unnecessary  exact")
for result in evaluations:
    print(
        f"{result.template_id:<33} "
        f"{result.unsafe_claim_served:>6} "
        f"{result.unnecessary_abstention:>12} "
        f"{result.exact_route_rate:>6.0%}"
    )

first_fix = evaluations[1]
revised_fix = evaluations[2]
print(f"first_fix_blocked={first_fix.unnecessary_abstention > 0}")
print(f"revised_fix_passes_guardrails={revised_fix.unsafe_claim_served == 0 and revised_fix.unnecessary_abstention == 0}")

Output

template                          unsafe  unnecessary  exact
ci-answerer-v1.1-regression            2            0    50%
ci-answerer-v1.2-fix                   0            1    75%
ci-answerer-v1.2-fix-b                 0            0   100%
first_fix_blocked=True
revised_fix_passes_guardrails=True

Tracking isn't a scoreboard for flattering metrics. The first fix satisfies the incident's safety condition but introduces a holdout experience regression. Its run should remain visible and rejected, not silently replaced by the revised fix.

Store one evidence record per run

Now put the experiment structure around those results. A tracking service should let a reviewer answer:

What changed?
What exactly was evaluated?
Which metrics and guardrails moved?
Which evidence artifacts explain the decision?
Which incident or hypothesis motivated the run?

The local tracker below uses the same conceptual buckets as MLflow and W&B. It deliberately logs stable identifiers and a redacted exemplar report, not raw logs.

local-run-records.py

@dataclass
class RunRecord:
    run_id: str
    params: dict[str, str] = field(default_factory=dict)
    metrics: dict[str, float] = field(default_factory=dict)
    artifacts: list[str] = field(default_factory=list)
    tags: dict[str, str] = field(default_factory=dict)

class LocalTracker:
    def __init__(self) -> None:
        self.runs: list[RunRecord] = []

    def start_run(self, run_id: str) -> RunRecord:
        run = RunRecord(run_id)
        self.runs.append(run)
        return run

    def log_params(self, run: RunRecord, params: dict[str, str]) -> None:
        run.params.update(params)

    def log_metrics(self, run: RunRecord, metrics: dict[str, float]) -> None:
        run.metrics.update(metrics)

    def log_artifact(self, run: RunRecord, path: str) -> None:
        run.artifacts.append(path)

    def set_tags(self, run: RunRecord, tags: dict[str, str]) -> None:
        run.tags.update(tags)

evaluator_version = "claim-route-eval-v2"
tracker = LocalTracker()
for index, result in enumerate(evaluations, start=1):
    run = tracker.start_run(f"run_{index:03d}")
    tracker.log_params(run, {
        "template_id": result.template_id,
        "policy_id": handoff.incident_policy_id,
        "evaluator_version": evaluator_version,
        "regression_sha": regression_fingerprint,
        "holdout_sha": holdout_fingerprint,
        "evaluation_sha": evaluation_fingerprint,
        "evidence_version": handoff.evidence_version,
        "code_commit": "fixture-commit-8b27f2",
    })
    tracker.log_metrics(run, {
        "unsafe_claim_served": float(result.unsafe_claim_served),
        "unnecessary_abstention": float(result.unnecessary_abstention),
        "exact_route_rate": result.exact_route_rate,
    })
    tracker.log_artifact(run, f"reports/{result.template_id}-redacted-eval.json")
    tracker.log_artifact(run, f"templates/{result.template_id}.json")
    tracker.set_tags(run, {
        "incident_request_id": handoff.exemplar_request_id,
        "hypothesis": "restore evidence-gated CI root-cause claims",
        "raw_log_text_stored": "false",
    })

print(f"tracked_runs={len(tracker.runs)}")
print(f"shared_evaluation_sha={tracker.runs[0].params['evaluation_sha']}")
print(f"run_002_artifact={tracker.runs[1].artifacts[0]}")
print(f"run_002_template_artifact={tracker.runs[1].artifacts[1]}")
print(f"raw_log_text_stored={tracker.runs[2].tags['raw_log_text_stored']}")

Output

tracked_runs=3
shared_evaluation_sha=94dfc930d2829d2c
run_002_artifact=reports/ci-answerer-v1.2-fix-redacted-eval.json
run_002_template_artifact=templates/ci-answerer-v1.2-fix.json
raw_log_text_stored=false

All three runs point to the same regression, holdout, and combined evaluation checksums plus policy, evidence snapshot, and evaluator version. Each run also preserves its redacted report and versioned template artifact. That makes their metric comparison meaningful. If a new test case or scoring rule is needed tomorrow, update the corresponding fingerprint or evaluator version and label subsequent runs as a new comparison set.

Before ranking runs, audit that they are genuinely comparable. A dashboard can display results with different evaluation inputs or scoring logic side by side, but the reviewer must refuse to call that a controlled comparison.

comparability-audit.py

def comparable(runs: list[RunRecord]) -> bool:
    comparison_fields = (
        "regression_sha",
        "holdout_sha",
        "evaluation_sha",
        "policy_id",
        "evidence_version",
        "evaluator_version",
    )
    return all(
        len({run.params[field] for run in runs}) == 1
        for field in comparison_fields
    )

changed_suite_run = RunRecord(
    run_id="run_004",
    params={**tracker.runs[2].params, "holdout_sha": "different-holdout-sha"},
)
changed_evaluator_run = RunRecord(
    run_id="run_005",
    params={**tracker.runs[2].params, "evaluator_version": "claim-route-eval-v3"},
)

print(f"tracked_runs_comparable={comparable(tracker.runs)}")
print(f"changed_suite_comparable={comparable(tracker.runs + [changed_suite_run])}")
print(f"changed_evaluator_comparable={comparable(tracker.runs + [changed_evaluator_run])}")
print("review_rule=never_rank_metrics_until_inputs_and_evaluator_match")

Output

tracked_runs_comparable=True
changed_suite_comparable=False
changed_evaluator_comparable=False
review_rule=never_rank_metrics_until_inputs_and_evaluator_match

How this maps to MLflow and W&B

The local tracker isn't a substitute for a shared tracking service. It exposes the record design first. Once that design is correct, the platform mappings are small.

Evidence in our run	MLflow Tracking	W&B
Experiment/run boundary	`mlflow.set_experiment(...)`, `mlflow.start_run()`	`wandb.init(project=..., name=...)`
Candidate and policy configuration	`mlflow.log_params(...)`	`run.config`
Guardrail results	`mlflow.log_metrics(...)`	`run.log(...)`
Incident and decision context	`mlflow.set_tags(...)`	run tags / config fields
Redacted report or template bundle	`mlflow.log_artifact(...)`	`run.log_artifact(...)`
Evaluation input lineage	`mlflow.log_input(...)` with dataset metadata	`run.use_artifact(...)` for a versioned input artifact

Those APIs are documented by their respective tracking, artifact, and run-tag guides.^[1]^[2]^[3]^[4] A real integration would require installing the chosen SDK and configuring storage and authentication, so these snippets are intentionally not executed here:

mlflow-shape.py

import mlflow

mlflow.set_experiment("ci-claim-gate-repair")
with mlflow.start_run(run_name="ci-answerer-v1.2-fix-b"):
    mlflow.log_params({"template_id": "...", "evaluation_sha": "..."})
    mlflow.log_metrics({"unsafe_claim_served": 0, "unnecessary_abstention": 0})
    mlflow.set_tags({"incident_request_id": "req_205", "decision": "canary_review"})
    mlflow.log_artifact("templates/ci-answerer-v1.2-fix-b.json")
    mlflow.log_artifact("reports/redacted-eval.json")

wandb-shape.py

import wandb

candidate_bundle = wandb.Artifact("v1.2-fix-b-bundle", type="candidate")
candidate_bundle.add_file("templates/ci-answerer-v1.2-fix-b.json")
candidate_bundle.add_file("reports/redacted-eval.json")

with wandb.init(
    project="ci-claim-gate-repair",
    name="ci-answerer-v1.2-fix-b",
    config={"template_id": "...", "evaluation_sha": "..."},
    tags=["incident:req_205", "canary_review"],
) as run:
    run.log({"unsafe_claim_served": 0, "unnecessary_abstention": 0})
    run.log_artifact(candidate_bundle)

Choose a platform because it fits your team's storage, access, dashboards, and deployment workflow. Don't confuse tool selection with experimental rigor. Either platform can preserve an under-specified or misleading run if you don't define inputs and guardrails first.

Forked experiment review contract where one incident trace and one pinned evaluation suite feed two repair runs; run 002 stays recorded as rejected for unnecessary abstention, while run 003 keeps rollback target and canary-only scope because it passes both review gates. — Selected branch doesn't erase failed repair. Both outcomes stay attached to same contract, while only run 003 carries rollback and canary scope.

Don't delete rejected experiments. Their failure reason is part of the evidence. A future reviewer should see that the first repair eliminated unsafe claims but failed a usefulness guardrail.

rejection-reasons.py

def review_status(run: RunRecord) -> str:
    if run.metrics["unsafe_claim_served"] > 0:
        return "REJECT_UNSAFE_SERVE"
    if run.metrics["unnecessary_abstention"] > 0:
        return "REJECT_UNNECESSARY_ABSTENTION"
    return "ELIGIBLE_FOR_CANARY_REVIEW"

for run in tracker.runs:
    print(f"{run.params['template_id']}={review_status(run)}")

Output

ci-answerer-v1.1-regression=REJECT_UNSAFE_SERVE
ci-answerer-v1.2-fix=REJECT_UNNECESSARY_ABSTENTION
ci-answerer-v1.2-fix-b=ELIGIBLE_FOR_CANARY_REVIEW

Turn passing metrics into a limited decision

Even a passing offline run isn't authority to deploy broadly. This lab's cases are synthetic and tiny. A correct decision says exactly what the evidence supports: the revised candidate may enter canary review, with a rollback pointer and continued monitoring of the same safety invariant.

For this prompt-only change, the registered object in your application's artifact registry is a template artifact. Don't pretend that every LLM application change produces a new model. If a later experiment fine-tunes model weights, MLflow Model Registry can link the registered model version to its source run and use aliases to identify a deployment target.^[5]

promotion-record.py

@dataclass(frozen=True)
class PromotionDecision:
    artifact_name: str
    alias: str
    source_run_id: str
    evaluation_sha: str
    rollback_artifact: str
    status: str
    limitation: str

def passes_contract(run: RunRecord) -> bool:
    return (
        run.metrics["unsafe_claim_served"] == 0
        and run.metrics["unnecessary_abstention"] == 0
        and run.params["evaluation_sha"] == evaluation_fingerprint
        and run.params["policy_id"] == handoff.incident_policy_id
        and run.params["evidence_version"] == handoff.evidence_version
        and run.params["evaluator_version"] == evaluator_version
        and run.tags["incident_request_id"] == handoff.exemplar_request_id
    )

def only_eligible_run(runs: list[RunRecord]) -> RunRecord:
    eligible = [run for run in runs if passes_contract(run)]
    if len(eligible) != 1:
        raise ValueError(f"expected one eligible run, found {len(eligible)}")
    return eligible[0]

selected = only_eligible_run(tracker.runs)
decision = PromotionDecision(
    artifact_name=selected.params["template_id"],
    alias="canary_candidate",
    source_run_id=selected.run_id,
    evaluation_sha=selected.params["evaluation_sha"],
    rollback_artifact="ci-answerer-v1",
    status="APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB",
    limitation="requires production review and monitored canary",
)

print(f"selected={decision.artifact_name}")
print(f"source_run={decision.source_run_id}")
print(f"alias={decision.alias}")
print(f"status={decision.status}")
print(f"rollback={decision.rollback_artifact}")
print(f"limitation={decision.limitation}")

Output

selected=ci-answerer-v1.2-fix-b
source_run=run_003
alias=canary_candidate
status=APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB
rollback=ci-answerer-v1
limitation=requires production review and monitored canary

The fixture has one eligible run. A real review must resolve ties explicitly rather than silently selecting the first passing candidate. The artifact, source run, fixed suite, rollback target, and limitation now travel together. If canary observability later reports another unsupported root-cause claim, the investigator can trace back to this exact decision rather than guessing which "fix" shipped.

Promotion evidence should also define the stop action before canary traffic begins. This canary check doesn't claim the revised candidate fails. It makes the rollback rule executable for any future unsafe serve.

canary-rollback-rule.py

def canary_action(unsafe_claim_served: int, rollback_artifact: str) -> str:
    if unsafe_claim_served > 0:
        return f"ROLL_BACK_TO:{rollback_artifact}"
    return "CONTINUE_CANARY_MONITORING"

print(f"clean_window={canary_action(0, decision.rollback_artifact)}")
print(f"unsafe_window={canary_action(1, decision.rollback_artifact)}")

Output

clean_window=CONTINUE_CANARY_MONITORING
unsafe_window=ROLL_BACK_TO:ci-answerer-v1

The same record applies to training runs

This lab changed a template because that's what the incident demanded. The experiment-tracking habit generalizes when the changed artifact is a trained model:

Prompt-and-policy candidate here	Training candidate in a later run
template id and policy id	model architecture, optimizer, precision mode
frozen route-evaluation contract	train/validation dataset fingerprint and evaluator version
unsafe serves and unnecessary abstentions	loss, quality slices, numerical failures
template bundle and redacted report	checkpoint and evaluation report
canary candidate alias	registered model version and alias

The next lesson studies mixed precision. There, a run may change BF16 or FP16 settings and measure throughput, memory, NaN failures, and model-quality guardrails. Without a run record, a faster training job can look like progress even when numerical stability or final quality got worse.

Mastery check

Mastery outcomes

Capability	Working proof
Freeze a reproducible decision question	Regression, holdout, and combined suites receive deterministic fingerprints
Compare runs under one contract	Audit rejects suite or evaluator drift before metrics are ranked
Preserve incident-to-candidate lineage	Tracker records parameters, metrics, redacted artifacts, and incident tag
Bound promotion scope	Decision links canary alias, source run, rollback artifact, and production limitation

Evaluation rubric

Foundational: Distinguishes a live incident from an offline experiment and a deployment decision.
Foundational: Explains why a candidate needs fixed inputs, a fingerprint, and a fixed evaluator version.
Intermediate: Reads the three-run comparison and rejects v1.2-fix despite zero unsafe serves.
Intermediate: Identifies parameters, metrics, tags, and artifacts that make the selected run reproducible.
Advanced: Explains why a prompt-only change shouldn't be presented as a registered model version.
Advanced: Produces a canary decision that names source run, rollback target, limitation, and monitored invariant.

Follow-up questions

Common pitfalls

The experiment forgets the incident that motivated it

Symptom: A candidate run has nice metrics but no link to the failed production request.
Cause: The team started a fresh dashboard instead of carrying forward the incident handoff.
Fix: Tag the run with the incident exemplar, failing artifact, evidence snapshot, and required safety metric.

The comparison contract changes silently between candidates

Symptom: A later run looks better, but it used different labels, cases, or scoring logic.
Cause: Inputs or evaluator semantics were saved informally rather than fingerprinted and versioned.
Fix: Log frozen regression and holdout fingerprints plus evaluator version with every comparable run.

Zero unsafe serves hides an unusable repair

Symptom: A candidate looks safe because it abstains on every CI question.
Cause: Review tracked a safety invariant without a usefulness guardrail.
Fix: Pair unsafe_claim_served == 0 with unnecessary_abstention == 0 on supported cases.

A prompt change is called a new model

Symptom: Registry history can't tell whether weights, retrieval, or templates changed.
Cause: Every application artifact was collapsed into the word "model."
Fix: Register and version the artifact that changed; use a model registry when model weights changed.

A passing offline run is treated as deployment approval

Symptom: A four-case test produces an immediate broad rollout.
Cause: Evidence scope and promotion scope were confused.
Fix: State the limitation in the decision record and require reviewed canary monitoring before wider traffic.

Next Step

Continue to Mixed Precision Training

You can now record a controlled candidate run and reject improvements that fail guardrails. Next you'll use that experiment discipline to reason about faster training runs whose lower-precision arithmetic can change both throughput and numerical stability.

PreviousLLM Observability & Monitoring

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

ML Experiment Tracking.

MLflow Project. · 2026 · Official documentation

Experiments overview.

Weights & Biases. · 2026 · Official documentation

Artifacts overview.

Weights & Biases. · 2026 · Official documentation

Organize runs with tags.

Weights & Biases. · 2026 · Official documentation

Model Registry Workflows | MLflow AI Platform

MLflow · 2026

Experiment Tracking with MLflow and W&B

From alert to experiment

Why isn't ci-answerer-v1.2-fix already approved when it was created from a clear incident?

Freeze the question before running candidates

Fingerprint the evaluation inputs

A regression pass is necessary, not sufficient

Evaluate the behavior, not the proposed story

Store one evidence record per run

How this maps to MLflow and W&B

Turn passing metrics into a limited decision

The same record applies to training runs

Mastery check

Mastery outcomes

Evaluation rubric

Follow-up questions

Why does ci-answerer-v1.2-fix remain blocked even though it never serves an unsupported root-cause claim?

Why keep the rejected first fix in the tracker?

If the team adds a new dependency-error case next week, should its run be compared directly with these numbers?

Why isn't a shared evaluation SHA enough to call two runs comparable?

Why is the final status only APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB?

Common pitfalls

The experiment forgets the incident that motivated it

The comparison contract changes silently between candidates

Zero unsafe serves hides an unusable repair

A prompt change is called a new model

A passing offline run is treated as deployment approval

Mastery Check

Experiment Tracking with MLflow and W&B

From alert to experiment

Why isn't ci-answerer-v1.2-fix already approved when it was created from a clear incident?

Freeze the question before running candidates

Fingerprint the evaluation inputs

A regression pass is necessary, not sufficient

Evaluate the behavior, not the proposed story

Store one evidence record per run

How this maps to MLflow and W&B

Turn passing metrics into a limited decision

The same record applies to training runs

Mastery check

Mastery outcomes

Evaluation rubric

Follow-up questions

Why does ci-answerer-v1.2-fix remain blocked even though it never serves an unsupported root-cause claim?

Why keep the rejected first fix in the tracker?

If the team adds a new dependency-error case next week, should its run be compared directly with these numbers?

Why isn't a shared evaluation SHA enough to call two runs comparable?

Why is the final status only APPROVED_FOR_CANARY_IN_SYNTHETIC_LAB?

Common pitfalls

The experiment forgets the incident that motivated it

The comparison contract changes silently between candidates

Zero unsafe serves hides an unusable repair

A prompt change is called a new model

A passing offline run is treated as deployment approval

Mastery Check