Ship a delivery-delay prediction service with time-safe features, threshold gates, API contract, and drift evidence.
The production ML lessons gave you each component in isolation. This capstone packages them into a service another engineer can evaluate: given an in-transit order at a defined timestamp, estimate late-delivery risk and decide whether the product may show a proactive delay warning.
The product contract is intentionally narrow. The model doesn't promise an exact arrival minute, issue refunds, or change carrier routing. It returns a risk score with a controlled action: normal_tracking, warn_customer, or manual_review when inputs are unreliable.
Use one decision moment: two hours after carrier pickup. Use one label: whether delivery occurred after the promised date. A prediction stored without those definitions can't be replayed later.
| Contract field | Pinned value |
|---|---|
| prediction event | two hours after first carrier pickup |
| label | delivered after promised end-of-day |
| score output | late_risk between zero and one |
| displayed action | warn only when threshold passes |
| unavailable data action | route to manual_review, no narrow ETA claim |
The feature bundle includes route distance, service tier, origin backlog, scan age, weekday, and carrier code. Every field must be reconstructed as of prediction time. The earlier feature-store lesson explained why point-in-time joins protect training from future scans; Feast documents this historical retrieval pattern for production feature data.[1]
Your repository artifact should contain this layout:
1eta-prediction/
2 data/
3 feature_contract.json
4 train_snapshot_manifest.json
5 training/
6 baseline.py
7 train_booster.py
8 evaluate_slices.py
9 artifacts/
10 delay_model_v1.json
11 threshold_policy_v1.json
12 metrics_v1.json
13 service/
14 api.py
15 schemas.py
16 monitoring/
17 drift_window.py
18 tests/
19 test_point_in_time_features.py
20 test_warning_gate.pyFirst fit a rule baseline such as hours_since_last_scan >= 18. Then fit the tree candidate using the same time-ordered splits. XGBoost is a defensible implementation for structured features because its boosted-tree system is designed for sparse, scalable tabular learning.[2] It still must beat the baseline on the exact action policy, not only on a model metric.
Required release rows:
| Gate | Requirement |
|---|---|
| no feature leakage | replay test excludes post-prediction scans |
| expedited shipments | no missed warning in required validation slice |
| expected warning cost | better than rule baseline |
| feature freshness | stale scan/backlog returns fallback |
| API schema | model, feature, and threshold versions emitted |
The service below focuses on the boundary. A real late_risk would be produced by the trained model artifact; this compact example tests the policy the service must preserve.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Prediction:
5 order_id: str
6 late_risk: float
7 scan_age_hours: float
8 tier: str
9
10POLICY = {"threshold": 0.40, "max_scan_age_hours": 24, "model": "delay_model_v1"}
11
12def route(prediction):
13 if prediction.scan_age_hours > POLICY["max_scan_age_hours"]:
14 return {"action": "manual_review", "reason": "stale_features", "model": POLICY["model"]}
15 if prediction.late_risk >= POLICY["threshold"]:
16 return {"action": "warn_customer", "reason": "late_risk_threshold", "model": POLICY["model"]}
17 return {"action": "normal_tracking", "reason": "below_threshold", "model": POLICY["model"]}
18
19cases = [
20 Prediction("O-201", 0.62, 5, "expedited"),
21 Prediction("O-202", 0.25, 3, "standard"),
22 Prediction("O-203", 0.81, 31, "standard"),
23]
24for case in cases:
25 print(case.order_id, route(case)["action"], route(case)["reason"])1O-201 warn_customer late_risk_threshold
2O-202 normal_tracking below_threshold
3O-203 manual_review stale_featuresCase O-203 is the key design result. A high model score isn't authority to message a customer when the supporting scan features are stale. The product needs a safe fallback independent of model confidence.
The deployment emits one row per score: request timestamp, feature version, model version, threshold version, feature freshness, score, action, and eventually the delivery label. Immediate monitoring catches nulls, stale scans, error rate, and score-distribution shift. Delayed monitoring computes missed-warning cost, calibration by score bucket, and slice performance.
Promotion should be an alias move from delay_model_v1 to a separately evaluated candidate. Google Cloud's MLOps guidance describes this separation between validation, metadata, serving, monitoring, and continuous training stages.[3] A triggered retraining job creates evidence; it doesn't silently rewrite live behavior.
| Artifact | Reviewer should verify |
|---|---|
| feature contract | every field has type, timestamp boundary, and missing policy |
| training manifest | time split and dataset fingerprint exist |
| baseline comparison | candidate improves declared cost without required-slice misses |
| service API | stale inputs fail to a safer route |
| monitoring plan | input checks and delayed label metrics are distinct |
| rollback plan | prior artifact and threshold remain deployable |
| Artifact | Strong submission demonstrates |
|---|---|
| model package | time-safe feature contract, baseline comparison, and calibrated warning threshold |
| service | versioned response trace and safe behavior for missing or stale scans |
| operations | input monitoring, delayed-label evaluation, candidate promotion, and rollback |
| Symptom | Cause | Fix |
|---|---|---|
| Warning appears accurate offline but misses live disruptions | future scans leaked into training | enforce as-of tests |
| Customer receives unsupported ETA warning | service trusts score despite stale inputs | gate freshness before action |
| Team can't reproduce a warning | artifact versions absent from response trace | log full release tuple |