Assemble predictive ML artifacts into validated training, registry promotion, canary monitoring, and rollback.
You have built four products: a late-delivery warning model, a product ranker, a warehouse demand forecast, and a damaged-package photo classifier. Each uses different metrics, but each relies on the same operational discipline: immutable data evidence, validated candidates, controlled promotion, monitoring, and rollback.
This capstone assembles that discipline into one ML platform workflow. It isn't tied to a particular orchestrator or cloud vendor. A reviewer must be able to trace any live decision back to data, feature, model, policy, and promotion evidence.
Models differ, but their release manifest can share a schema:
| Field | ETA example | Ranking example | Forecast example | Vision example |
|---|---|---|---|---|
| data snapshot | carrier events through cutoff | catalog and judged queries | daily counts through cutoff | return photos grouped by shipment |
| feature version | eta_features_v1 | ranking_features_v1 | demand_lags_v1 | parcel_rgb_224_v1 |
| model artifact | delay_model_v1 | ranker_v1 | forecast_v1 | damage_cnn_v1 |
| action policy | warning threshold | eligibility and slate rule | alert threshold | quality gate and review threshold |
| gates | critical-lane recall | blocked listings and NDCG | peak underforecast cost | usable-image and source-slice checks |
| monitor | delayed labels and freshness | impressions and returns | residuals and alert review | photo quality and reviewer labels |
This release tuple prevents an incident response meeting from asking which threshold or feature transform happened to be active. Sculley et al. warn that ML systems accumulate debt through data dependencies, configuration, and feedback loops unless those boundaries are managed explicitly.[1]
Submit a small but inspectable platform surface:
1production-ml-platform/
2 contracts/
3 release_manifest.schema.json
4 promotion_policy.json
5 pipelines/
6 validate_snapshot.py
7 train_candidate.py
8 evaluate_candidate.py
9 promote_alias.py
10 registry/
11 releases.jsonl
12 monitoring/
13 live_windows.py
14 rollback_policy.py
15 projects/
16 eta/
17 ranking/
18 forecast/
19 vision/
20 tests/
21 test_failed_gate_never_promotes.py
22 test_rollback_restores_manifest.pyGoogle Cloud's MLOps architecture separates automated data/model validation, metadata, serving, monitoring, and continuous-training triggers around promotion.[2] Your repository needn't copy that platform, but it should prove each boundary through a deterministic local fixture and test.
The code below treats validation as permission to open canary traffic, not permission to overwrite production. A failed critical gate leaves the production alias untouched.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Release:
5 name: str
6 data: str
7 features: str
8 model: str
9 policy: str
10
11registry = {
12 "eta_v1": Release("eta_v1", "events_2026_04", "eta_features_v1", "delay_model_v1", "warning_v1"),
13 "eta_v2": Release("eta_v2", "events_2026_05", "eta_features_v1", "delay_model_v2", "warning_v1"),
14}
15aliases = {"production": "eta_v1"}
16
17def open_canary(candidate, gates):
18 required = {"schema_valid", "no_leakage", "critical_slice_pass", "cost_improves"}
19 failed = sorted(required - {gate for gate, passed in gates.items() if passed})
20 if failed:
21 return "hold:" + ",".join(failed)
22 aliases["canary"] = candidate
23 return "canary_open"
24
25bad = {"schema_valid": True, "no_leakage": True, "critical_slice_pass": False, "cost_improves": True}
26good = {"schema_valid": True, "no_leakage": True, "critical_slice_pass": True, "cost_improves": True}
27print("first decision:", open_canary("eta_v2", bad))
28print("production:", aliases["production"])
29print("second decision:", open_canary("eta_v2", good))
30print("canary:", aliases["canary"])1first decision: hold:critical_slice_pass
2production: eta_v1
3second decision: canary_open
4canary: eta_v2This boundary is the heart of the capstone. Training finished doesn't mean release approved. Registry history keeps both artifacts, aliases express current traffic decisions, and the monitor can restore the known-good alias without rebuilding a model.
Live checks differ by product, but the promotion controller handles the same categories:
| Gate type | ETA | Ranking | Forecast | Vision |
|---|---|---|---|---|
| immediate data health | scan freshness | eligible candidate supply | latest counts loaded | photo quality |
| immediate service health | latency/errors | scoring latency | forecast API availability | image scoring latency |
| delayed quality | late warning cost | purchase/return experiment | MAE and peak residual | reviewer-confirmed damage |
| rollback event | stale warning spike | blocked listing exposure | broken alert flood | unsupported escalations |
For scoring systems with delayed labels, canary monitoring should pause wider promotion until enough outcomes arrive. A model that hasn't failed yet is not the same as a model that has passed.
Continuous training is appropriate when a schedule or monitored condition creates a candidate run. It should never skip data validation, offline comparisons, or a promotion record. The pipeline's value is not automation alone; it is refusing untraceable changes.
| Artifact | Acceptance condition |
|---|---|
| release schema | identifies data, features, model, policy, gates |
| registry | contains immutable stable and candidate releases |
| evaluation report | records why each candidate passed or held |
| alias promotion code | never moves production on failed gates |
| monitor policy | defines canary pause, promote, abort, rollback |
| tests | execute hold and rollback paths |
This completes the conventional production ML portfolio. The next capstone returns to LLM products: document QA must apply the same lineage and release discipline to retrieved evidence and generated answers.
| Artifact | Strong submission demonstrates |
|---|---|
| reproducible run | versioned data, features, model artifact, threshold policy, and evaluation evidence |
| controlled promotion | candidate alias, automated gates, canary criteria, and explicit production move |
| recovery | monitoring tied to actions, rollback trigger, and deployable prior release |
| Symptom | Cause | Fix |
|---|---|---|
| Retrain job changes behavior with no review | training and promotion merged | separate candidate registry from aliases |
| Rollback restores weights but not threshold | policy omitted from release bundle | version complete release tuple |
| Canary promotes before outcomes exist | only latency checked | require delayed quality window |