LearnProduction ML SystemsMonitoring Predictive Models

⚙️MediumMLOps & Deployment

Monitoring Predictive Models

Monitor predictive models from feature freshness through delayed labels, then gate retraining, promotion, and rollback.

16 min read

Learning path

Step 47 of 158 in the full curriculum

Forecasting and Anomaly Detection The Bitter Lesson & Compute

A deployed model isn't done when its endpoint responds. A system that scores job-SLA risk, ranks alerts, or forecasts runner-pool load can still fail after it reaches live traffic: inputs break now, labels arrive days later, and a new model can regress under real usage.

Monitoring answers separate questions on separate clocks: can this request be scored safely, has traffic changed, did eventual outcomes remain good enough, and should a candidate replace the current release?

Predictive-model monitoring dashboard separating request-time safety, delayed outcome quality, and reversible release rollback. — Request-time checks score `4/6` safe rows and expose parity and drift problems before labels arrive. Later quality uses `5/5` label coverage in the mature cohort, and delayed review cost `8 > 5` proves the rollback pointer restores `v3`.

Start with a served release and feature traces

Use a job-SLA model as the running example. Each job request carries two model inputs:

Feature	Meaning	Why it can fail
`queue_backlog`	jobs waiting in the runner pool	feed can become missing or use wrong units
`hours_since_last_heartbeat`	age of latest scheduler heartbeat	heartbeat pipeline can become stale

Store a feature trace with each prediction. A feature trace records the values and versions that produced one score. It lets an engineer reproduce the decision after traffic, code, or source data changes.

Build six incoming requests against deployed release job-risk-v3. Two rows contain failures that should be caught before delayed SLA labels exist.

build-live-feature-traces.py

from datetime import datetime, timedelta, timezone
from hashlib import sha256
from json import dumps
from math import log

now = datetime(2026, 2, 4, 12, 0, tzinfo=timezone.utc)
deployed = {
    "model": "job-risk-v3",
    "feature_contract": "features-v3",
    "threshold": 0.50,
}
requests = [
    {"job_id": "s-100", "runner_pool": "fc-a", "queue_backlog": 18, "hours_since_last_heartbeat": 2, "feature_as_of": now - timedelta(minutes=10), "risk_score": 0.18},
    {"job_id": "s-101", "runner_pool": "fc-a", "queue_backlog": 27, "hours_since_last_heartbeat": 8, "feature_as_of": now - timedelta(minutes=20), "risk_score": 0.44},
    {"job_id": "s-102", "runner_pool": "fc-b", "queue_backlog": 35, "hours_since_last_heartbeat": 20, "feature_as_of": now - timedelta(minutes=25), "risk_score": 0.76},
    {"job_id": "s-103", "runner_pool": "fc-b", "queue_backlog": None, "hours_since_last_heartbeat": 26, "feature_as_of": now - timedelta(minutes=15), "risk_score": None},
    {"job_id": "s-104", "runner_pool": "fc-a", "queue_backlog": 22, "hours_since_last_heartbeat": 6, "feature_as_of": now - timedelta(minutes=180), "risk_score": 0.52},
    {"job_id": "s-105", "runner_pool": "fc-b", "queue_backlog": 40, "hours_since_last_heartbeat": 30, "feature_as_of": now - timedelta(minutes=5), "risk_score": 0.91},
]
feature_traces = [
    {
        **request,
        "model": deployed["model"],
        "feature_contract": deployed["feature_contract"],
    }
    for request in requests
]

print("deployed:", deployed["model"])
print("feature contract:", deployed["feature_contract"])
print("feature traces:", len(feature_traces))
print("label status: pending")

Output

deployed: job-risk-v3
feature contract: features-v3
feature traces: 6
label status: pending

At request time, you don't know which jobs will miss their SLA. You do know whether required inputs exist and whether their timestamps are fresh enough to trust. Each trace binds those values to the model and feature-contract versions that consumed them.

Block invalid inputs before waiting for labels

Define two request-time invariants:

Required features can't be missing.
Feature snapshots can't be older than 60 minutes.

An invariant is a condition that must stay true for the system to operate safely. Check each trace and block unsafe rows instead of sending a plausible-looking score downstream.

check-request-time-input-health.py

required_features = ["queue_backlog", "hours_since_last_heartbeat"]
max_feature_age_minutes = 60

def request_health(row):
    missing = [
        name
        for name in required_features
        if row[name] is None
    ]
    feature_age = int((now - row["feature_as_of"]).total_seconds() / 60)
    reasons = []
    if missing:
        reasons.append("missing=" + ",".join(missing))
    if feature_age > max_feature_age_minutes:
        reasons.append(f"stale={feature_age}m")
    return reasons

healthy_requests = []
for row in feature_traces:
    reasons = request_health(row)
    if reasons:
        print(row["job_id"], "-> block", ";".join(reasons))
    else:
        healthy_requests.append(row)
        print(row["job_id"], "-> score")

print("scored:", len(healthy_requests))
print("blocked:", len(feature_traces) - len(healthy_requests))

Output

s-100 -> score
s-101 -> score
s-102 -> score
s-103 -> block missing=queue_backlog
s-104 -> block stale=180m
s-105 -> score
scored: 4
blocked: 2

Job s-103 lacks backlog data. Job s-104 carries a three-hour-old snapshot. A label-based accuracy report would discover the unsafe scoring path too late.

Sculley et al. describe unstable data dependencies, input-data testing, and live monitoring as central production ML concerns.^{[1]Reference 1Hidden Technical Debt in Machine Learning Systems.https://research.google/pubs/hidden-technical-debt-in-machine-learning-systems/} Their point is practical: models depend on surrounding data systems, not model code alone.

Measure drift without calling it failure

Healthy rows can still look different from training traffic. Data drift means an input distribution changed over time. It's a reason to inspect, not proof that accuracy fell.^{[2]Reference 2Monitoring Machine Learning Models in Production.https://arxiv.org/abs/2007.06299}

Compare the historical and current distributions for hours_since_last_heartbeat:

Bucket	Training window	Current window	Change
`0-4h`	50%	30%	-20 points
`4-12h`	30%	25%	-5 points
`12-24h`	15%	25%	+10 points
`24h+`	5%	20%	+15 points

Older heartbeats are more common now. A scheduler outage could cause that shift, but so could a holiday or runner-pool mix change.

This lab summarizes the bucket changes with Population Stability Index (PSI):

\text{PSI} = \sum_i (a_i - e_i)\log\left(\frac{a_i}{e_i}\right)

Here $e_i$ is the historical share for bucket $i$ , and $a_i$ is its current share. This exercise uses non-zero shares in every bucket so the logarithm is defined. Real monitoring code must choose an explicit policy for empty buckets.

Calculate PSI and apply a local investigation threshold of 0.20.

calculate-input-drift-diagnostic.py

bucket_names = ["0-4h", "4-12h", "12-24h", "24h+"]
reference_counts = [50, 30, 15, 5]
current_counts = [30, 25, 25, 20]

def normalize(counts):
    total = sum(counts)
    return [count / total for count in counts]

def population_stability_index(reference, current):
    return sum(
        (actual - expected) * log(actual / expected)
        for expected, actual in zip(reference, current)
    )

reference = normalize(reference_counts)
current = normalize(current_counts)
psi = population_stability_index(reference, current)
investigate_at = 0.20

for name, before, after in zip(bucket_names, reference, current):
    print(f"{name}: reference={before:.2f} current={after:.2f} delta={after - before:+.2f}")
print("PSI:", round(psi, 3))
print("local action:", "inspect shift" if psi >= investigate_at else "continue")

Output

0-4h: reference=0.50 current=0.30 delta=-0.20
4-12h: reference=0.30 current=0.25 delta=-0.05
12-24h: reference=0.15 current=0.25 delta=+0.10
24h+: reference=0.05 current=0.20 delta=+0.15
PSI: 0.37
local action: inspect shift

PSI is a compact diagnostic here, not a universal truth threshold. The output says heartbeat age changed enough to inspect. It doesn't say release v3 is inaccurate, and it doesn't say retraining is the first fix.

Reproduce training-serving skew

Training-serving skew means production computes an input differently from training. Suppose training divided backlog count by staffed runner-pool capacity, while a serving change divides by a fixed 100.

Probe the same raw row through both transformations.

reproduce-training-serving-skew.py

probe = {"queue_backlog": 20, "staffed_capacity": 50}

def training_transform(row):
    return round(row["queue_backlog"] / row["staffed_capacity"], 3)

def buggy_serving_transform(row):
    return round(row["queue_backlog"] / 100, 3)

training_value = training_transform(probe)
serving_value = buggy_serving_transform(probe)
print("training backlog ratio:", training_value)
print("serving backlog ratio:", serving_value)
print("parity:", training_value == serving_value)

Output

training backlog ratio: 0.4
serving backlog ratio: 0.2
parity: False

The same runner pool becomes 0.4 in training and 0.2 online. A model retrained on the existing training function won't repair that serving bug. Fix the shared transformation contract and keep a parity test in the release gate.

Google Cloud's MLOps guidance includes unit tests for feature engineering, prediction-service tests with expected inputs, data validation, and predictive-performance validation before deployment.^{[3]Reference 3MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.https://docs.cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning}

Join delayed labels back to stored predictions

Request-time checks move fast because they don't need outcomes. Accuracy moves slower. An SLA-miss label may arrive only after the promised job window closes.

Store model ID, score, job ID, decision context, and the time when each outcome becomes mature. Later, join outcomes by immutable prediction ID. Compute quality on cohorts whose outcome windows have closed, and report label coverage inside those cohorts. Labels that happen to arrive early can be a biased subset; a missing label in a mature cohort is a coverage gap, not an on-schedule outcome.

join-delayed-job-SLA-labels.py

predictions = [
    {"prediction_id": "p-001", "job_id": "s-200", "risk_score": 0.15, "priority_job": False, "label_due_at": now - timedelta(hours=5)},
    {"prediction_id": "p-002", "job_id": "s-201", "risk_score": 0.32, "priority_job": False, "label_due_at": now - timedelta(hours=4)},
    {"prediction_id": "p-003", "job_id": "s-202", "risk_score": 0.68, "priority_job": True, "label_due_at": now - timedelta(hours=3)},
    {"prediction_id": "p-004", "job_id": "s-203", "risk_score": 0.78, "priority_job": False, "label_due_at": now - timedelta(hours=2)},
    {"prediction_id": "p-005", "job_id": "s-204", "risk_score": 0.91, "priority_job": True, "label_due_at": now - timedelta(hours=1)},
    {"prediction_id": "p-006", "job_id": "s-205", "risk_score": 0.84, "priority_job": True, "label_due_at": now + timedelta(hours=1)},
]
decision_policy = "missed-sla-threshold-050-v1"
predictions = [
    {
        **prediction,
        "model": deployed["model"],
        "feature_contract": deployed["feature_contract"],
        "decision_policy": decision_policy,
        "decision_threshold": deployed["threshold"],
    }
    for prediction in predictions
]
labels = {
    "p-001": False,
    "p-002": False,
    "p-003": True,
    "p-004": False,
    "p-005": True,
}

matured = [prediction for prediction in predictions if prediction["label_due_at"] <= now]
immature = [prediction for prediction in predictions if prediction["label_due_at"] > now]
matured_labeled = []
missing_matured_labels = []
for prediction in matured:
    prediction_id = prediction["prediction_id"]
    if prediction_id in labels:
        matured_labeled.append({**prediction, "missed_sla": labels[prediction_id]})
    else:
        missing_matured_labels.append(prediction_id)

label_coverage = len(matured_labeled) / len(matured) if matured else 0.0

print("stored predictions:", len(predictions))
print("matured predictions:", len(matured))
print("joined matured labels:", len(matured_labeled))
print(f"matured label coverage: {label_coverage:.3f}")
print("missing matured labels:", missing_matured_labels)
print("immature predictions:", [row["prediction_id"] for row in immature])

Output

stored predictions: 6
matured predictions: 5
joined matured labels: 5
matured label coverage: 1.000
missing matured labels: []
immature predictions: ['p-006']

Prediction p-006 is excluded because its SLA outcome window hasn't closed. Label arrival doesn't define cohort maturity. The five-row mature cohort has 100% label coverage. If a mature row were still unlabeled, the report would expose the coverage gap and withhold the quality decision rather than assume the observed subset is representative.

Evaluate decisions and calibration

A binary decision uses deployed threshold 0.50: scores at or above the threshold enter job-SLA review. Require complete label coverage for this mature lab cohort, then compute precision, recall, priority-job misses, and a small review-cost receipt.

measure-delayed-quality-window.py

threshold = deployed["threshold"]

if missing_matured_labels:
    raise RuntimeError("quality metrics unavailable: mature cohort has missing labels")

def predicted_missed_sla(row):
    return row["risk_score"] >= threshold

true_positives = sum(predicted_missed_sla(row) and row["missed_sla"] for row in matured_labeled)
false_positives = sum(predicted_missed_sla(row) and not row["missed_sla"] for row in matured_labeled)
false_negatives = sum(not predicted_missed_sla(row) and row["missed_sla"] for row in matured_labeled)

def rate_or_none(numerator, denominator):
    return round(numerator / denominator, 3) if denominator else None

precision = rate_or_none(true_positives, true_positives + false_positives)
recall = rate_or_none(true_positives, true_positives + false_negatives)
missed_priority_job = sum(
    row["priority_job"] and row["missed_sla"] and not predicted_missed_sla(row)
    for row in matured_labeled
)
review_cost = false_positives * 2 + false_negatives * 10

print("precision:", precision)
print("recall:", recall)
print("missed priority jobs:", missed_priority_job)
print("review cost:", review_cost)

Output

precision: 0.667
recall: 1.0
missed priority jobs: 0
review cost: 2

The local cost policy assigns 2 units to an unnecessary review and 10 to a missed SLA. It's an example decision policy, not a universal business value. If a denominator is zero, report None: missing evidence isn't the same as measured failure.

Scores also need a calibration check. A score near 0.80 shouldn't be interpreted as an 80% risk unless outcomes from adequately covered mature cohorts support that use.^{[4]Reference 4On Calibration of Modern Neural Networkshttps://arxiv.org/abs/1706.04599} Compare average score with observed job-SLA rate inside two tiny buckets.

measure-risk-score-calibration.py

score_buckets = [
    ("low", [row for row in matured_labeled if row["risk_score"] < 0.50]),
    ("high", [row for row in matured_labeled if row["risk_score"] >= 0.50]),
]

for name, rows_in_bucket in score_buckets:
    if not rows_in_bucket:
        print(f"{name}: n=0 average_score=None observed_missed_sla_rate=None")
        continue
    average_score = sum(row["risk_score"] for row in rows_in_bucket) / len(rows_in_bucket)
    observed_rate = sum(row["missed_sla"] for row in rows_in_bucket) / len(rows_in_bucket)
    print(
        f"{name}: n={len(rows_in_bucket)} "
        f"average_score={average_score:.3f} "
        f"observed_missed_sla_rate={observed_rate:.3f}"
    )

Output

low: n=2 average_score=0.235 observed_missed_sla_rate=0.000
high: n=3 average_score=0.790 observed_missed_sla_rate=0.667

Five labels with full mature-cohort coverage are enough to explain the receipt, not enough to approve probability quality. An empty bucket should report None instead of pretending evidence exists. A production report needs larger mature windows, explicit overall and slice coverage, and slices such as runner pool, service tier, geography, and forecast age.

Triage repair, inspection, and retraining separately

Drift, skew, and delayed quality degradation can appear together. They don't imply the same action:

Evidence	First action
missing or stale online features	repair data path
parity probe fails	repair training-serving contract
distribution shifts	inspect source health and business context
delayed decision cost regresses after inputs are valid	train frozen-snapshot candidate

Encode that order. The first scenario has broken data and parity. The second has clean inputs plus delayed quality regression.

triage-monitoring-signals.py

def triage(schema_failures, stale_features, parity_ok, psi, cost_regressed):
    actions = []
    if schema_failures or stale_features:
        actions.append("repair online data path")
    if not parity_ok:
        actions.append("repair training-serving parity")
    if psi >= investigate_at:
        actions.append("inspect distribution shift")
    if not schema_failures and not stale_features and parity_ok and cost_regressed:
        actions.append("train frozen-snapshot candidate")
    return actions

print("dirty window:", triage(1, 1, False, psi, True))
print("clean changed window:", triage(0, 0, True, psi, True))

Output

dirty window: ['repair online data path', 'repair training-serving parity', 'inspect distribution shift']
clean changed window: ['inspect distribution shift', 'train frozen-snapshot candidate']

Continuous training doesn't mean every alert immediately overwrites production. Google Cloud documents scheduled, new-data, performance-degradation, and distribution-change triggers for retraining pipelines, with metadata and model-validation stages around those pipelines.^{[3]Reference 3MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.https://docs.cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning}

Diagram showing Production requests feature traces, Fast checks schema + freshness, Predictions release v3, and Delayed labels + quality window cost + slices. — Production requests feature traces, Fast checks schema + freshness, Predictions release v3, and Delayed labels + quality window cost + slices.

Freeze an immutable candidate bundle

When clean evidence justifies retraining, freeze the snapshot, training pipeline, feature contract, decision policy, evaluation policy, and previous model pointer. The resulting candidate bundle is an immutable release proposal. Promotion moves an alias; it doesn't rewrite the bundle.

Create a deterministic bundle ID from the candidate configuration.

publish-immutable-candidate-bundle.py

candidate_config = {
    "model": "job-risk-v4",
    "snapshot": "labels-through-2026-02-03",
    "training_pipeline": "job-risk-train-v4",
    "feature_contract": "features-v3",
    "decision_policy": decision_policy,
    "threshold": 0.50,
    "evaluation_policy": "job-risk-eval-v1",
    "previous_model": deployed["model"],
}
payload = dumps(candidate_config, sort_keys=True, separators=(",", ":"))
bundle_id = "job-risk-v4-" + sha256(payload.encode()).hexdigest()[:10]
candidate_bundle = {**candidate_config, "bundle_id": bundle_id}

print("candidate bundle:", candidate_bundle["bundle_id"])
print("rollback pointer:", candidate_bundle["previous_model"])

Output

candidate bundle: job-risk-v4-edde9c2176
rollback pointer: job-risk-v3

Google Cloud's MLOps guidance recommends recording pipeline versions, arguments, produced artifacts, evaluation metrics, and a pointer to the previous trained model for rollback or comparison.^{[3]Reference 3MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.https://docs.cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning}

Gate offline metrics and canary traffic

Compare candidate v4 with current v3 before changing live traffic. These thresholds are explicit local policies:

check-offline-promotion-gates.py

current_metrics = {
    "recall": 0.80,
    "priority_misses": 1,
    "review_cost": 6,
    "p95_latency_ms": 40,
}
candidate_metrics = {
    "recall": 0.90,
    "priority_misses": 0,
    "review_cost": 4,
    "p95_latency_ms": 42,
}
evaluation_cohort = {
    "maturity_cutoff": "2026-02-03T12:00:00Z",
    "matured_predictions": 1000,
    "joined_labels": 1000,
}
evaluation_cohort["label_coverage"] = (
    evaluation_cohort["joined_labels"] / evaluation_cohort["matured_predictions"]
)
offline_evaluation = {
    "policy": candidate_bundle["evaluation_policy"],
    "label_snapshot": candidate_bundle["snapshot"],
    "cohort": evaluation_cohort,
    "current": current_metrics,
    "candidate": candidate_metrics,
}
offline_gates = {
    "mature_label_coverage": evaluation_cohort["label_coverage"] == 1.0,
    "recall": candidate_metrics["recall"] >= current_metrics["recall"],
    "priority_misses": candidate_metrics["priority_misses"] == 0,
    "review_cost": candidate_metrics["review_cost"] <= current_metrics["review_cost"],
    "latency_budget": candidate_metrics["p95_latency_ms"] <= 50,
}

for name, passed in offline_gates.items():
    print(f"{name}: {passed}")
print("offline gate:", all(offline_gates.values()))

Output

mature_label_coverage: True
recall: True
priority_misses: True
review_cost: True
latency_budget: True
offline gate: True

Offline evidence is necessary but incomplete. A canary sends limited live traffic to a candidate before full promotion. Gate fast signals such as service errors, latency, request count, and immediate review-queue pressure. Those signals can stop a bad rollout quickly, but they don't replace delayed-label quality.

check-canary-promotion-gates.py

canary_metrics = {
    "requests": 500,
    "error_rate": 0.002,
    "p95_latency_ms": 44,
    "review_queue_load_ratio": 0.72,
}
canary_gates = {
    "enough_requests": canary_metrics["requests"] >= 500,
    "error_rate": canary_metrics["error_rate"] <= 0.005,
    "p95_latency_ms": canary_metrics["p95_latency_ms"] <= 50,
    "review_queue_load_ratio": canary_metrics["review_queue_load_ratio"] <= 0.80,
}
canary_evaluation = {
    "policy": "job-risk-canary-v1",
    "metrics": canary_metrics,
    "checks": canary_gates,
}

for name, passed in canary_gates.items():
    print(f"{name}: {passed}")
print("canary gate:", all(canary_gates.values()))

Output

enough_requests: True
error_rate: True
p95_latency_ms: True
review_queue_load_ratio: True
canary gate: True

Argo Rollouts documents canary steps, metric analysis, unsuccessful-analysis aborts, and post-promotion analysis that can switch traffic back to a previous stable release.^{[5]Reference 5Argo Rollouts - Kubernetes Progressive Delivery Controllerhttps://argoproj.github.io/argo-rollouts/} The exact deployment tool may differ, but promotion needs measured conditions and a rollback path.

Promote an alias, then rehearse rollback

Keep immutable bundle IDs separate from a movable production alias. If both gate sets pass, move the alias from v3 to the candidate. Then simulate a slower delayed-label guardrail arriving after promotion. The cost threshold is evaluated only after the declared mature cohort reaches full label coverage; an incomplete cohort would report insufficient evidence rather than treat early labels as representative.

promote-and-rollback-model-alias.py

alias = {"production": deployed["model"]}
rollback_pointer = alias["production"]

if all(offline_gates.values()) and all(canary_gates.values()):
    alias["production"] = candidate_bundle["bundle_id"]

print("production alias:", alias["production"])
print("rollback pointer:", rollback_pointer)

post_promotion = {
    "matured_predictions": 500,
    "joined_labels": 500,
    "delayed_review_cost": 8,
}
post_promotion["label_coverage"] = (
    post_promotion["joined_labels"] / post_promotion["matured_predictions"]
)
rollback_limit = 5
rollback_reason = None
if (
    post_promotion["label_coverage"] == 1.0
    and post_promotion["delayed_review_cost"] > rollback_limit
):
    rollback_reason = (
        f"delayed review cost {post_promotion['delayed_review_cost']} "
        f"> {rollback_limit}"
    )
    alias["production"] = rollback_pointer

print(f"post-promotion label coverage: {post_promotion['label_coverage']:.3f}")
print("delayed review cost:", post_promotion["delayed_review_cost"])
print("final alias:", alias["production"])

receipt = {
    "candidate_bundle": candidate_bundle,
    "previous": rollback_pointer,
    "offline_evaluation": {**offline_evaluation, "checks": offline_gates},
    "canary_evaluation": canary_evaluation,
    "post_promotion": {**post_promotion, "rollback_limit": rollback_limit},
    "rollback_reason": rollback_reason,
    "release_action": "rolled_back" if rollback_reason else "keep_candidate",
    "final_alias": alias["production"],
}
receipt_json = dumps(receipt, sort_keys=True, separators=(",", ":"))
print("release action:", receipt["release_action"])
print("receipt sha256:", sha256(receipt_json.encode()).hexdigest()[:12])

Output

production alias: job-risk-v4-edde9c2176
rollback pointer: job-risk-v3
post-promotion label coverage: 1.000
delayed review cost: 8
final alias: job-risk-v3
release action: rolled_back
receipt sha256: 54de7af4a622

The rehearsal promotes v4, waits for a fully labeled 500-row mature cohort, observes delayed review cost 8 above local limit 5, and restores alias v3. Its receipt binds candidate lineage, offline evidence, canary evidence, delayed-label coverage, and the derived rollback reason. A rollback drill is stronger than a diagram: it proves the pointer and policy work together.

Explain the release loop without looking back

Before continuing, explain the full path in your own words:

Which checks protect a job before delayed labels exist?
Why is drift evidence different from quality evidence?
Why can't retraining fix a training-serving skew bug?
Why must quality metrics use mature cohorts and report label coverage?
Which artifacts make a candidate reproducible?
Why do offline gates, canary gates, and rollback drills solve different problems?

Practice

Change s-104 feature age from 180 to 30 minutes. Predict how many requests score before rerunning the lab.
Change current 24h+ heartbeat share from 20 to 5 and move the removed 15 counts into 0-4h. Predict whether PSI rises or falls.
Fix buggy_serving_transform() to divide by staffed_capacity. Explain which triage action disappears.
Advance now past p-006's label_due_at, add label p-006: True, and compute the new precision, recall, and mature-cohort coverage.
Change canary p95_latency_ms from 44 to 65. Explain whether production alias ever moves to v4.
Remove rollback_pointer assignment. Explain why a model registry entry alone doesn't prove rollback works.

Practice answer sketches

Prompt	Reasoning check
Freshen `s-104`	Five requests score. Only `s-103` remains blocked for missing backlog.
Restore recent-heartbeat mix	PSI falls because current traffic moves closer to reference traffic.
Fix serving transform	Training-serving parity becomes `True`, so `repair training-serving parity` disappears.
Resolve `p-006` as missed_sla	True positives rise from `2` to `3`; precision rises to `3 / 4 = 0.75`; recall remains `1.0`.
Slow canary	`p95_latency_ms` gate fails. Alias remains `job-risk-v3`.
Delete rollback pointer	Candidate may still be reproducible, but rollback target is no longer proven by execution.

What strong answers show

Evidence	A strong explanation demonstrates
fast controls	blocks missing or stale features before delayed labels arrive
drift reasoning	treats PSI as a local investigation diagnostic, not an accuracy verdict
parity	reproduces skew with one raw row and fixes contract before retraining
delayed quality	joins immutable prediction IDs, gates on mature cohorts and label coverage, and checks slices
candidate control	freezes snapshot, training pipeline, feature contract, decision policy, evaluation policy, and previous model pointer
rollout	separates offline gate, limited live canary, promotion alias, and rollback receipt

When monitoring breaks

Symptom	Cause	Fix
Accuracy report arrives after days of bad inputs	only label metrics are monitored	add schema and freshness alarms
Every drift alert starts retraining	drift is confused with failure	inspect data path, context, and delayed outcomes
New training run repeats wrong scores	serving transform differs from training	add parity probes before retraining
Report looks good while labels are still arriving	metrics use whichever labels arrived first	define mature cohorts, report label coverage, and block or qualify incomplete windows
New model can't be rolled back cleanly	bundle and alias aren't versioned separately	publish immutable candidate plus rollback pointer
Canary passes but delayed cost later rises	only fast rollout metrics were checked	keep post-promotion guardrails and rehearse rollback

Next Step

Continue to The Bitter Lesson & Compute

You can build, serve, monitor, promote, and roll back conventional predictive models with explicit evidence. Next, apply the same measurement discipline to language models whose capabilities grow with data and computation.

PreviousForecasting and Anomaly Detection

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Hidden Technical Debt in Machine Learning Systems.

Sculley et al. · 2015

Monitoring Machine Learning Models in Production.

Klaise, J., et al. · 2020

MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.

Google Cloud. · 2026 · Official documentation

On Calibration of Modern Neural Networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017

Argo Rollouts - Kubernetes Progressive Delivery Controller

Argo Project · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnProduction ML SystemsMonitoring Predictive Models

⚙️MediumMLOps & Deployment

Monitoring Predictive Models

Monitor predictive models from feature freshness through delayed labels, then gate retraining, promotion, and rollback.

16 min read

Learning path

Step 47 of 158 in the full curriculum

Forecasting and Anomaly Detection The Bitter Lesson & Compute

Start with a served release and feature traces

Use a job-SLA model as the running example. Each job request carries two model inputs:

Feature	Meaning	Why it can fail
`queue_backlog`	jobs waiting in the runner pool	feed can become missing or use wrong units
`hours_since_last_heartbeat`	age of latest scheduler heartbeat	heartbeat pipeline can become stale

Build six incoming requests against deployed release job-risk-v3. Two rows contain failures that should be caught before delayed SLA labels exist.

build-live-feature-traces.py

from datetime import datetime, timedelta, timezone
from hashlib import sha256
from json import dumps
from math import log

now = datetime(2026, 2, 4, 12, 0, tzinfo=timezone.utc)
deployed = {
    "model": "job-risk-v3",
    "feature_contract": "features-v3",
    "threshold": 0.50,
}
requests = [
    {"job_id": "s-100", "runner_pool": "fc-a", "queue_backlog": 18, "hours_since_last_heartbeat": 2, "feature_as_of": now - timedelta(minutes=10), "risk_score": 0.18},
    {"job_id": "s-101", "runner_pool": "fc-a", "queue_backlog": 27, "hours_since_last_heartbeat": 8, "feature_as_of": now - timedelta(minutes=20), "risk_score": 0.44},
    {"job_id": "s-102", "runner_pool": "fc-b", "queue_backlog": 35, "hours_since_last_heartbeat": 20, "feature_as_of": now - timedelta(minutes=25), "risk_score": 0.76},
    {"job_id": "s-103", "runner_pool": "fc-b", "queue_backlog": None, "hours_since_last_heartbeat": 26, "feature_as_of": now - timedelta(minutes=15), "risk_score": None},
    {"job_id": "s-104", "runner_pool": "fc-a", "queue_backlog": 22, "hours_since_last_heartbeat": 6, "feature_as_of": now - timedelta(minutes=180), "risk_score": 0.52},
    {"job_id": "s-105", "runner_pool": "fc-b", "queue_backlog": 40, "hours_since_last_heartbeat": 30, "feature_as_of": now - timedelta(minutes=5), "risk_score": 0.91},
]
feature_traces = [
    {
        **request,
        "model": deployed["model"],
        "feature_contract": deployed["feature_contract"],
    }
    for request in requests
]

print("deployed:", deployed["model"])
print("feature contract:", deployed["feature_contract"])
print("feature traces:", len(feature_traces))
print("label status: pending")

Output

deployed: job-risk-v3
feature contract: features-v3
feature traces: 6
label status: pending

Block invalid inputs before waiting for labels

Define two request-time invariants:

Required features can't be missing.
Feature snapshots can't be older than 60 minutes.

An invariant is a condition that must stay true for the system to operate safely. Check each trace and block unsafe rows instead of sending a plausible-looking score downstream.

check-request-time-input-health.py

required_features = ["queue_backlog", "hours_since_last_heartbeat"]
max_feature_age_minutes = 60

def request_health(row):
    missing = [
        name
        for name in required_features
        if row[name] is None
    ]
    feature_age = int((now - row["feature_as_of"]).total_seconds() / 60)
    reasons = []
    if missing:
        reasons.append("missing=" + ",".join(missing))
    if feature_age > max_feature_age_minutes:
        reasons.append(f"stale={feature_age}m")
    return reasons

healthy_requests = []
for row in feature_traces:
    reasons = request_health(row)
    if reasons:
        print(row["job_id"], "-> block", ";".join(reasons))
    else:
        healthy_requests.append(row)
        print(row["job_id"], "-> score")

print("scored:", len(healthy_requests))
print("blocked:", len(feature_traces) - len(healthy_requests))

Output

s-100 -> score
s-101 -> score
s-102 -> score
s-103 -> block missing=queue_backlog
s-104 -> block stale=180m
s-105 -> score
scored: 4
blocked: 2

Job s-103 lacks backlog data. Job s-104 carries a three-hour-old snapshot. A label-based accuracy report would discover the unsafe scoring path too late.

Measure drift without calling it failure

Compare the historical and current distributions for hours_since_last_heartbeat:

Bucket	Training window	Current window	Change
`0-4h`	50%	30%	-20 points
`4-12h`	30%	25%	-5 points
`12-24h`	15%	25%	+10 points
`24h+`	5%	20%	+15 points

Older heartbeats are more common now. A scheduler outage could cause that shift, but so could a holiday or runner-pool mix change.

This lab summarizes the bucket changes with Population Stability Index (PSI):

\text{PSI} = \sum_i (a_i - e_i)\log\left(\frac{a_i}{e_i}\right)

Calculate PSI and apply a local investigation threshold of 0.20.

calculate-input-drift-diagnostic.py

bucket_names = ["0-4h", "4-12h", "12-24h", "24h+"]
reference_counts = [50, 30, 15, 5]
current_counts = [30, 25, 25, 20]

def normalize(counts):
    total = sum(counts)
    return [count / total for count in counts]

def population_stability_index(reference, current):
    return sum(
        (actual - expected) * log(actual / expected)
        for expected, actual in zip(reference, current)
    )

reference = normalize(reference_counts)
current = normalize(current_counts)
psi = population_stability_index(reference, current)
investigate_at = 0.20

for name, before, after in zip(bucket_names, reference, current):
    print(f"{name}: reference={before:.2f} current={after:.2f} delta={after - before:+.2f}")
print("PSI:", round(psi, 3))
print("local action:", "inspect shift" if psi >= investigate_at else "continue")

Output

0-4h: reference=0.50 current=0.30 delta=-0.20
4-12h: reference=0.30 current=0.25 delta=-0.05
12-24h: reference=0.15 current=0.25 delta=+0.10
24h+: reference=0.05 current=0.20 delta=+0.15
PSI: 0.37
local action: inspect shift

Reproduce training-serving skew

Probe the same raw row through both transformations.

reproduce-training-serving-skew.py

probe = {"queue_backlog": 20, "staffed_capacity": 50}

def training_transform(row):
    return round(row["queue_backlog"] / row["staffed_capacity"], 3)

def buggy_serving_transform(row):
    return round(row["queue_backlog"] / 100, 3)

training_value = training_transform(probe)
serving_value = buggy_serving_transform(probe)
print("training backlog ratio:", training_value)
print("serving backlog ratio:", serving_value)
print("parity:", training_value == serving_value)

Output

training backlog ratio: 0.4
serving backlog ratio: 0.2
parity: False

Join delayed labels back to stored predictions

Request-time checks move fast because they don't need outcomes. Accuracy moves slower. An SLA-miss label may arrive only after the promised job window closes.

join-delayed-job-SLA-labels.py

predictions = [
    {"prediction_id": "p-001", "job_id": "s-200", "risk_score": 0.15, "priority_job": False, "label_due_at": now - timedelta(hours=5)},
    {"prediction_id": "p-002", "job_id": "s-201", "risk_score": 0.32, "priority_job": False, "label_due_at": now - timedelta(hours=4)},
    {"prediction_id": "p-003", "job_id": "s-202", "risk_score": 0.68, "priority_job": True, "label_due_at": now - timedelta(hours=3)},
    {"prediction_id": "p-004", "job_id": "s-203", "risk_score": 0.78, "priority_job": False, "label_due_at": now - timedelta(hours=2)},
    {"prediction_id": "p-005", "job_id": "s-204", "risk_score": 0.91, "priority_job": True, "label_due_at": now - timedelta(hours=1)},
    {"prediction_id": "p-006", "job_id": "s-205", "risk_score": 0.84, "priority_job": True, "label_due_at": now + timedelta(hours=1)},
]
decision_policy = "missed-sla-threshold-050-v1"
predictions = [
    {
        **prediction,
        "model": deployed["model"],
        "feature_contract": deployed["feature_contract"],
        "decision_policy": decision_policy,
        "decision_threshold": deployed["threshold"],
    }
    for prediction in predictions
]
labels = {
    "p-001": False,
    "p-002": False,
    "p-003": True,
    "p-004": False,
    "p-005": True,
}

matured = [prediction for prediction in predictions if prediction["label_due_at"] <= now]
immature = [prediction for prediction in predictions if prediction["label_due_at"] > now]
matured_labeled = []
missing_matured_labels = []
for prediction in matured:
    prediction_id = prediction["prediction_id"]
    if prediction_id in labels:
        matured_labeled.append({**prediction, "missed_sla": labels[prediction_id]})
    else:
        missing_matured_labels.append(prediction_id)

label_coverage = len(matured_labeled) / len(matured) if matured else 0.0

print("stored predictions:", len(predictions))
print("matured predictions:", len(matured))
print("joined matured labels:", len(matured_labeled))
print(f"matured label coverage: {label_coverage:.3f}")
print("missing matured labels:", missing_matured_labels)
print("immature predictions:", [row["prediction_id"] for row in immature])

Output

stored predictions: 6
matured predictions: 5
joined matured labels: 5
matured label coverage: 1.000
missing matured labels: []
immature predictions: ['p-006']

Evaluate decisions and calibration

measure-delayed-quality-window.py

threshold = deployed["threshold"]

if missing_matured_labels:
    raise RuntimeError("quality metrics unavailable: mature cohort has missing labels")

def predicted_missed_sla(row):
    return row["risk_score"] >= threshold

true_positives = sum(predicted_missed_sla(row) and row["missed_sla"] for row in matured_labeled)
false_positives = sum(predicted_missed_sla(row) and not row["missed_sla"] for row in matured_labeled)
false_negatives = sum(not predicted_missed_sla(row) and row["missed_sla"] for row in matured_labeled)

def rate_or_none(numerator, denominator):
    return round(numerator / denominator, 3) if denominator else None

precision = rate_or_none(true_positives, true_positives + false_positives)
recall = rate_or_none(true_positives, true_positives + false_negatives)
missed_priority_job = sum(
    row["priority_job"] and row["missed_sla"] and not predicted_missed_sla(row)
    for row in matured_labeled
)
review_cost = false_positives * 2 + false_negatives * 10

print("precision:", precision)
print("recall:", recall)
print("missed priority jobs:", missed_priority_job)
print("review cost:", review_cost)

Output

precision: 0.667
recall: 1.0
missed priority jobs: 0
review cost: 2

measure-risk-score-calibration.py

score_buckets = [
    ("low", [row for row in matured_labeled if row["risk_score"] < 0.50]),
    ("high", [row for row in matured_labeled if row["risk_score"] >= 0.50]),
]

for name, rows_in_bucket in score_buckets:
    if not rows_in_bucket:
        print(f"{name}: n=0 average_score=None observed_missed_sla_rate=None")
        continue
    average_score = sum(row["risk_score"] for row in rows_in_bucket) / len(rows_in_bucket)
    observed_rate = sum(row["missed_sla"] for row in rows_in_bucket) / len(rows_in_bucket)
    print(
        f"{name}: n={len(rows_in_bucket)} "
        f"average_score={average_score:.3f} "
        f"observed_missed_sla_rate={observed_rate:.3f}"
    )

Output

low: n=2 average_score=0.235 observed_missed_sla_rate=0.000
high: n=3 average_score=0.790 observed_missed_sla_rate=0.667

Triage repair, inspection, and retraining separately

Drift, skew, and delayed quality degradation can appear together. They don't imply the same action:

Evidence	First action
missing or stale online features	repair data path
parity probe fails	repair training-serving contract
distribution shifts	inspect source health and business context
delayed decision cost regresses after inputs are valid	train frozen-snapshot candidate

Encode that order. The first scenario has broken data and parity. The second has clean inputs plus delayed quality regression.

triage-monitoring-signals.py

def triage(schema_failures, stale_features, parity_ok, psi, cost_regressed):
    actions = []
    if schema_failures or stale_features:
        actions.append("repair online data path")
    if not parity_ok:
        actions.append("repair training-serving parity")
    if psi >= investigate_at:
        actions.append("inspect distribution shift")
    if not schema_failures and not stale_features and parity_ok and cost_regressed:
        actions.append("train frozen-snapshot candidate")
    return actions

print("dirty window:", triage(1, 1, False, psi, True))
print("clean changed window:", triage(0, 0, True, psi, True))

Output

dirty window: ['repair online data path', 'repair training-serving parity', 'inspect distribution shift']
clean changed window: ['inspect distribution shift', 'train frozen-snapshot candidate']

Freeze an immutable candidate bundle

Create a deterministic bundle ID from the candidate configuration.

publish-immutable-candidate-bundle.py

candidate_config = {
    "model": "job-risk-v4",
    "snapshot": "labels-through-2026-02-03",
    "training_pipeline": "job-risk-train-v4",
    "feature_contract": "features-v3",
    "decision_policy": decision_policy,
    "threshold": 0.50,
    "evaluation_policy": "job-risk-eval-v1",
    "previous_model": deployed["model"],
}
payload = dumps(candidate_config, sort_keys=True, separators=(",", ":"))
bundle_id = "job-risk-v4-" + sha256(payload.encode()).hexdigest()[:10]
candidate_bundle = {**candidate_config, "bundle_id": bundle_id}

print("candidate bundle:", candidate_bundle["bundle_id"])
print("rollback pointer:", candidate_bundle["previous_model"])

Output

candidate bundle: job-risk-v4-edde9c2176
rollback pointer: job-risk-v3

Gate offline metrics and canary traffic

Compare candidate v4 with current v3 before changing live traffic. These thresholds are explicit local policies:

check-offline-promotion-gates.py

current_metrics = {
    "recall": 0.80,
    "priority_misses": 1,
    "review_cost": 6,
    "p95_latency_ms": 40,
}
candidate_metrics = {
    "recall": 0.90,
    "priority_misses": 0,
    "review_cost": 4,
    "p95_latency_ms": 42,
}
evaluation_cohort = {
    "maturity_cutoff": "2026-02-03T12:00:00Z",
    "matured_predictions": 1000,
    "joined_labels": 1000,
}
evaluation_cohort["label_coverage"] = (
    evaluation_cohort["joined_labels"] / evaluation_cohort["matured_predictions"]
)
offline_evaluation = {
    "policy": candidate_bundle["evaluation_policy"],
    "label_snapshot": candidate_bundle["snapshot"],
    "cohort": evaluation_cohort,
    "current": current_metrics,
    "candidate": candidate_metrics,
}
offline_gates = {
    "mature_label_coverage": evaluation_cohort["label_coverage"] == 1.0,
    "recall": candidate_metrics["recall"] >= current_metrics["recall"],
    "priority_misses": candidate_metrics["priority_misses"] == 0,
    "review_cost": candidate_metrics["review_cost"] <= current_metrics["review_cost"],
    "latency_budget": candidate_metrics["p95_latency_ms"] <= 50,
}

for name, passed in offline_gates.items():
    print(f"{name}: {passed}")
print("offline gate:", all(offline_gates.values()))

Output

mature_label_coverage: True
recall: True
priority_misses: True
review_cost: True
latency_budget: True
offline gate: True

check-canary-promotion-gates.py

canary_metrics = {
    "requests": 500,
    "error_rate": 0.002,
    "p95_latency_ms": 44,
    "review_queue_load_ratio": 0.72,
}
canary_gates = {
    "enough_requests": canary_metrics["requests"] >= 500,
    "error_rate": canary_metrics["error_rate"] <= 0.005,
    "p95_latency_ms": canary_metrics["p95_latency_ms"] <= 50,
    "review_queue_load_ratio": canary_metrics["review_queue_load_ratio"] <= 0.80,
}
canary_evaluation = {
    "policy": "job-risk-canary-v1",
    "metrics": canary_metrics,
    "checks": canary_gates,
}

for name, passed in canary_gates.items():
    print(f"{name}: {passed}")
print("canary gate:", all(canary_gates.values()))

Output

enough_requests: True
error_rate: True
p95_latency_ms: True
review_queue_load_ratio: True
canary gate: True

Promote an alias, then rehearse rollback

promote-and-rollback-model-alias.py

alias = {"production": deployed["model"]}
rollback_pointer = alias["production"]

if all(offline_gates.values()) and all(canary_gates.values()):
    alias["production"] = candidate_bundle["bundle_id"]

print("production alias:", alias["production"])
print("rollback pointer:", rollback_pointer)

post_promotion = {
    "matured_predictions": 500,
    "joined_labels": 500,
    "delayed_review_cost": 8,
}
post_promotion["label_coverage"] = (
    post_promotion["joined_labels"] / post_promotion["matured_predictions"]
)
rollback_limit = 5
rollback_reason = None
if (
    post_promotion["label_coverage"] == 1.0
    and post_promotion["delayed_review_cost"] > rollback_limit
):
    rollback_reason = (
        f"delayed review cost {post_promotion['delayed_review_cost']} "
        f"> {rollback_limit}"
    )
    alias["production"] = rollback_pointer

print(f"post-promotion label coverage: {post_promotion['label_coverage']:.3f}")
print("delayed review cost:", post_promotion["delayed_review_cost"])
print("final alias:", alias["production"])

receipt = {
    "candidate_bundle": candidate_bundle,
    "previous": rollback_pointer,
    "offline_evaluation": {**offline_evaluation, "checks": offline_gates},
    "canary_evaluation": canary_evaluation,
    "post_promotion": {**post_promotion, "rollback_limit": rollback_limit},
    "rollback_reason": rollback_reason,
    "release_action": "rolled_back" if rollback_reason else "keep_candidate",
    "final_alias": alias["production"],
}
receipt_json = dumps(receipt, sort_keys=True, separators=(",", ":"))
print("release action:", receipt["release_action"])
print("receipt sha256:", sha256(receipt_json.encode()).hexdigest()[:12])

Output

production alias: job-risk-v4-edde9c2176
rollback pointer: job-risk-v3
post-promotion label coverage: 1.000
delayed review cost: 8
final alias: job-risk-v3
release action: rolled_back
receipt sha256: 54de7af4a622

Explain the release loop without looking back

Before continuing, explain the full path in your own words:

Which checks protect a job before delayed labels exist?
Why is drift evidence different from quality evidence?
Why can't retraining fix a training-serving skew bug?
Why must quality metrics use mature cohorts and report label coverage?
Which artifacts make a candidate reproducible?
Why do offline gates, canary gates, and rollback drills solve different problems?

Practice

Change s-104 feature age from 180 to 30 minutes. Predict how many requests score before rerunning the lab.
Change current 24h+ heartbeat share from 20 to 5 and move the removed 15 counts into 0-4h. Predict whether PSI rises or falls.
Fix buggy_serving_transform() to divide by staffed_capacity. Explain which triage action disappears.
Advance now past p-006's label_due_at, add label p-006: True, and compute the new precision, recall, and mature-cohort coverage.
Change canary p95_latency_ms from 44 to 65. Explain whether production alias ever moves to v4.
Remove rollback_pointer assignment. Explain why a model registry entry alone doesn't prove rollback works.

Practice answer sketches

Prompt	Reasoning check
Freshen `s-104`	Five requests score. Only `s-103` remains blocked for missing backlog.
Restore recent-heartbeat mix	PSI falls because current traffic moves closer to reference traffic.
Fix serving transform	Training-serving parity becomes `True`, so `repair training-serving parity` disappears.
Resolve `p-006` as missed_sla	True positives rise from `2` to `3`; precision rises to `3 / 4 = 0.75`; recall remains `1.0`.
Slow canary	`p95_latency_ms` gate fails. Alias remains `job-risk-v3`.
Delete rollback pointer	Candidate may still be reproducible, but rollback target is no longer proven by execution.

What strong answers show

Evidence	A strong explanation demonstrates
fast controls	blocks missing or stale features before delayed labels arrive
drift reasoning	treats PSI as a local investigation diagnostic, not an accuracy verdict
parity	reproduces skew with one raw row and fixes contract before retraining
delayed quality	joins immutable prediction IDs, gates on mature cohorts and label coverage, and checks slices
candidate control	freezes snapshot, training pipeline, feature contract, decision policy, evaluation policy, and previous model pointer
rollout	separates offline gate, limited live canary, promotion alias, and rollback receipt

When monitoring breaks

Symptom	Cause	Fix
Accuracy report arrives after days of bad inputs	only label metrics are monitored	add schema and freshness alarms
Every drift alert starts retraining	drift is confused with failure	inspect data path, context, and delayed outcomes
New training run repeats wrong scores	serving transform differs from training	add parity probes before retraining
Report looks good while labels are still arriving	metrics use whichever labels arrived first	define mature cohorts, report label coverage, and block or qualify incomplete windows
New model can't be rolled back cleanly	bundle and alias aren't versioned separately	publish immutable candidate plus rollback pointer
Canary passes but delayed cost later rises	only fast rollout metrics were checked	keep post-promotion guardrails and rehearse rollback

Next Step

Continue to The Bitter Lesson & Compute

PreviousForecasting and Anomaly Detection

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Hidden Technical Debt in Machine Learning Systems.

Sculley et al. · 2015

Monitoring Machine Learning Models in Production.

Klaise, J., et al. · 2020

MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.

Google Cloud. · 2026 · Official documentation

On Calibration of Modern Neural Networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017

Argo Rollouts - Kubernetes Progressive Delivery Controller

Argo Project · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Monitoring Predictive Models

Start with a served release and feature traces

Block invalid inputs before waiting for labels

Measure drift without calling it failure

Reproduce training-serving skew

Join delayed labels back to stored predictions

Evaluate decisions and calibration

Triage repair, inspection, and retraining separately

Freeze an immutable candidate bundle

Gate offline metrics and canary traffic

Promote an alias, then rehearse rollback

Explain the release loop without looking back

Practice

Practice answer sketches

What strong answers show

When monitoring breaks

Mastery Check

Discussion

Monitoring Predictive Models

Start with a served release and feature traces

Block invalid inputs before waiting for labels

Measure drift without calling it failure

Reproduce training-serving skew

Join delayed labels back to stored predictions

Evaluate decisions and calibration

Triage repair, inspection, and retraining separately

Freeze an immutable candidate bundle

Gate offline metrics and canary traffic

Promote an alias, then rehearse rollback

Explain the release loop without looking back

Practice

Practice answer sketches

What strong answers show

When monitoring breaks

Mastery Check

Discussion

Monitoring Predictive Models

Start with a served release and feature traces

Block invalid inputs before waiting for labels

Why block s-104 before its SLA label arrives?

Measure drift without calling it failure

Why shouldn't the 0.370 PSI result automatically start retraining?

Reproduce training-serving skew

Join delayed labels back to stored predictions

Why exclude p-006 from this quality cohort instead of counting it as an on-schedule job?

Evaluate decisions and calibration

Triage repair, inspection, and retraining separately

Freeze an immutable candidate bundle

Gate offline metrics and canary traffic

Promote an alias, then rehearse rollback

Explain the release loop without looking back

Practice

Practice answer sketches

What strong answers show

When monitoring breaks

Mastery Check

Discussion

Monitoring Predictive Models

Start with a served release and feature traces

Block invalid inputs before waiting for labels

Why block s-104 before its SLA label arrives?

Measure drift without calling it failure

Why shouldn't the 0.370 PSI result automatically start retraining?

Reproduce training-serving skew

Join delayed labels back to stored predictions

Why exclude p-006 from this quality cohort instead of counting it as an on-schedule job?

Evaluate decisions and calibration

Triage repair, inspection, and retraining separately

Freeze an immutable candidate bundle

Gate offline metrics and canary traffic

Promote an alias, then rehearse rollback

Explain the release loop without looking back

Practice

Practice answer sketches

What strong answers show

When monitoring breaks

Mastery Check

Discussion