Monitor predictive models from feature freshness through delayed labels, then gate retraining and promotion.
You now have models that score delivery risk, rank products, and forecast warehouse load. A deployed model isn't done when its endpoint responds. Its inputs change, labels arrive late, and business policies change what a prediction means.
Monitoring should answer three separate questions: is input data valid now, does model behavior still look plausible, and did eventual outcomes remain good enough to keep serving this release?
For late-delivery prediction, the true label may arrive days after scoring. You still need immediate controls.
| Signal | Available when | What it catches |
|---|---|---|
| missing-feature rate | request time | broken feed or schema |
| feature freshness | request time | stale carrier or backlog data |
| score distribution | request time | sudden model/input shift |
| latency and error rate | request time | serving regression |
| precision, recall, cost | after delivery label | quality degradation |
| calibration by risk bucket | after enough labels | unreliable probability use |
An input alarm is not an accuracy claim. If origin_backlog becomes null for half of shipments, you can block or degrade predictions before labels confirm delays. If score distribution shifts because holiday demand is legitimate, delayed labels may show that retraining rather than rollback is appropriate.
Sculley et al. identify configuration, data dependencies, and feedback loops as central sources of production ML debt.[1] Google Cloud's MLOps pipeline guidance makes monitoring and continuous training part of an automated lifecycle, with validation and deployment controls around each candidate.[2]
One lightweight drift diagnostic compares a training distribution with a current window. Suppose late-risk model training saw these hours_since_last_scan buckets, then current traffic changed:
1from math import log
2
3training = [0.50, 0.30, 0.15, 0.05]
4current = [0.30, 0.25, 0.25, 0.20]
5labels = ["0-4h", "4-12h", "12-24h", "24h+"]
6
7def population_stability_index(expected, actual):
8 return sum((a - e) * log(a / e) for e, a in zip(expected, actual))
9
10psi = population_stability_index(training, current)
11largest = max(range(len(labels)), key=lambda i: current[i] - training[i])
12print("PSI:", round(psi, 3))
13print("largest increase:", labels[largest])
14print("action:", "inspect and await labels" if psi > 0.20 else "continue normal monitoring")1PSI: 0.37
2largest increase: 24h+
3action: inspect and await labelsPopulation Stability Index (PSI) is a diagnostic convention, not a universal truth threshold. Here it says old scans are far more common in the current window. Investigate carrier ingestion, storms, and capacity changes. Don't automatically declare the model bad or retrain from corrupted inputs.
When deliveries complete, join outcomes back to stored predictions using immutable release and feature identifiers. Compare:
| Quality gate | Example policy |
|---|---|
| expedited missed-delay rate | zero misses in critical reviewed slice |
| expected intervention cost | no more than approved baseline |
| calibration | high-risk bucket doesn't overpromise |
| warehouse slices | no material regression hidden by aggregate |
If quality deteriorates because traffic has changed and data remains valid, start a retraining job against a new frozen snapshot. If quality deteriorates because online transformations don't match training, repair parity before retraining. Training on broken inputs produces a newer broken model.
Continuous training doesn't mean automatic production replacement. Treat it as candidate generation:
This workflow unifies predictive ML with the LLM release discipline later in the course. Model weights, features, thresholds, prompts, or retrieval indexes may differ, but a reliable engineer always knows exactly which artifact produced a decision.
| Evidence | A production-ready answer demonstrates |
|---|---|
| monitoring layers | separates input health, score shifts, delayed outcomes, and action metrics |
| trigger design | turns drift or new labels into candidate evaluation rather than automatic promotion |
| release control | names registry evidence, canary checks, and rollback conditions |
| Symptom | Cause | Fix |
|---|---|---|
| Accuracy report arrives after days of bad inputs | only label metrics monitored | add schema and freshness alarms |
| Every drift alert starts retraining | drift confused with failure | inspect data path and wait for outcome evidence |
| New model can't be rolled back cleanly | bundle and alias not versioned | publish immutable candidate plus rollback pointer |