LearnPortfolio CapstonesCapstone: Delivery ETA Prediction

⚙️HardMLOps & Deployment

Capstone: Delivery ETA Prediction

Ship a delivery-delay warning service with as-of features, versioned policy gates, baseline evidence, and monitored fallback.

14 min read

Learning path

Step 79 of 158 in the full curriculum

Design an Automated Support Agent Capstone: Product Ranking

The production ML lessons gave you each component in isolation. This capstone packages them into a service another engineer can evaluate: given an in-transit order at a defined timestamp, estimate late-delivery risk and decide whether the product may show a proactive delay warning.

The product contract is intentionally narrow. The model doesn't promise an exact arrival minute, issue refunds, or change carrier routing. It returns a risk score with a controlled action: normal_tracking, warn_customer, or manual_review when inputs are unreliable.

Point-in-time evidence and release gate for delivery order O-201. A noon prediction admits the in-transit scan that occurred at 11:00 and arrived at 11:03, rejects a hub scan that occurred at 11:30 but arrived at 12:30, and rejects a future carrier-delay event at 15:00. Fresh versioned features produce risk 0.62 above threshold 0.40, so the service warns the customer. The release receipt lowers fixture cost from 150 to 8, has zero expedited misses, passes 12 of 12 gates, and advances only to shadow traffic. — The noon replay admits only evidence that both occurred and arrived before scoring. `O-201` passes freshness and threshold checks, while the release receipt proves the candidate is ready for shadow traffic, not automatic production promotion.

Define the Contract Before Choosing a Model

Use one decision moment: two hours after carrier pickup. Use one label: whether delivery occurred after the promised date. A prediction stored without those definitions can't be replayed later.

Contract field	Pinned value
prediction event	two hours after first carrier pickup
label	delivered after promised end-of-day
score output	`late_risk` between zero and one
displayed action	warn only when threshold passes
unavailable data action	route to `manual_review`, no narrow ETA claim

The feature bundle includes route distance, service tier, origin backlog, scan age, weekday, and carrier code. Every field must be reconstructed from data that both occurred and arrived by prediction time. The earlier pipeline lesson explained why a point-in-time replay needs event_time <= prediction_time and ingested_at <= prediction_time; Feast documents point-in-time correct historical retrieval for production feature data.^{[1]Reference 1Feast: Production Feature Store for Machine Learninghttps://feast.dev/}

Diagram showing Carrier events as of pickup + 2h, Feature bundle v1 freshness checked, Delay model v1 risk score, and Policy gate evidence + score. — Carrier events as of pickup + 2h, Feature bundle v1 freshness checked, Delay model v1 risk score, and Policy gate evidence + score.

Establish Baseline and Release Evidence

Your repository artifact should contain this layout:

text

eta-prediction/
  data/
    feature_contract.json
    train_snapshot_manifest.json
  training/
    baseline.py
    train_booster.py
    evaluate_slices.py
  artifacts/
    delay_model_v1.json
    threshold_policy_v1.json
    metrics_v1.json
  service/
    api.py
    schemas.py
  monitoring/
    drift_window.py
  tests/
    test_point_in_time_features.py
    test_warning_gate.py

First fit a rule baseline such as hours_since_last_scan >= 18. Then fit the tree candidate using the same time-ordered train, validation, and test splits. XGBoost is a defensible implementation for structured features because its boosted-tree system is designed for sparse, scalable tabular learning.^{[2]Reference 2XGBoost: A Scalable Tree Boosting System.https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf} It still must beat the baseline on the exact action policy, not a model metric alone.

Required release rows:

Gate	Requirement
no feature leakage	replay test excludes post-prediction and late-arriving scans
expedited shipments	no missed warning in required validation slice
expected warning cost	better than rule baseline
feature freshness	stale scan/backlog returns fallback
API schema	model, feature, threshold, and freshness trace emitted

Prove the Snapshot Uses Past Events Only

The first portfolio receipt should prove the timestamp rule with actual events. Order O-201 has one scan that occurred and arrived before the prediction moment, one pre-prediction scan that arrived late, and one scan three hours later. The feature builder must select only the scan known at prediction time.

01-scan-event-contract.py

from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from math import isfinite
import json

BASE_TIME = datetime(2026, 5, 1, 10, tzinfo=timezone.utc)

def at(hours: float) -> datetime:
    return BASE_TIME + timedelta(hours=hours)

@dataclass(frozen=True)
class ScanEvent:
    order_id: str
    event_time: datetime
    ingested_at: datetime
    status: str

@dataclass(frozen=True)
class FeatureRow:
    order_id: str
    prediction_at: datetime
    scan_age_hours: float | None
    backlog_age_hours: float | None
    tier: str
    feature_contract_id: str = "eta-features-v1"

@dataclass(frozen=True)
class ScoredRow:
    features: FeatureRow
    late_risk: float

SCAN_EVENTS = [
    ScanEvent("O-201", at(1), at(1.05), "in_transit"),
    ScanEvent("O-201", at(1.5), at(2.5), "hub_scan"),
    ScanEvent("O-201", at(5), at(5.05), "carrier_delay_posted"),
]

print("scan_events:", len(SCAN_EVENTS))
print("statuses:", [event.status for event in SCAN_EVENTS])

Output

scan_events: 3
statuses: ['in_transit', 'hub_scan', 'carrier_delay_posted']

02-as-of-admission-filter.py

def scans_known_by(order_id: str, prediction_at: datetime) -> list[ScanEvent]:
    return [
        event for event in SCAN_EVENTS
        if (
            event.order_id == order_id
            and event.event_time <= prediction_at
            and event.ingested_at <= prediction_at
        )
    ]

def latest_scan_known_by(order_id: str, prediction_at: datetime) -> ScanEvent:
    admitted = scans_known_by(order_id, prediction_at)
    if not admitted:
        raise ValueError("no scan available at prediction time")
    return max(admitted, key=lambda event: (event.event_time, event.ingested_at))

prediction_at = at(2)
admitted_scan_statuses = {event.status for event in scans_known_by("O-201", prediction_at)}
selected_scan = latest_scan_known_by("O-201", prediction_at)
print("prediction_at:", prediction_at.isoformat())
print("admitted:", sorted(admitted_scan_statuses))
print("selected:", selected_scan.status)

03-build-feature-row.py

feature_row = FeatureRow(
    order_id="O-201",
    prediction_at=prediction_at,
    scan_age_hours=(prediction_at - selected_scan.event_time).total_seconds() / 3600,
    backlog_age_hours=2,
    tier="expedited",
)
feature_score = ScoredRow(feature_row, late_risk=0.62)

assert selected_scan.status == "in_transit"
assert admitted_scan_statuses == {"in_transit"}
assert feature_row.scan_age_hours == 1
print("selected_scan:", selected_scan.status, selected_scan.event_time.isoformat(), selected_scan.ingested_at.isoformat())
print("ignored_late_arrival:", SCAN_EVENTS[1].status, SCAN_EVENTS[1].event_time.isoformat(), SCAN_EVENTS[1].ingested_at.isoformat())
print("ignored_future_scan:", SCAN_EVENTS[2].status, SCAN_EVENTS[2].event_time.isoformat(), SCAN_EVENTS[2].ingested_at.isoformat())

Output

selected_scan: in_transit 2026-05-01T11:00:00+00:00 2026-05-01T11:03:00+00:00
ignored_late_arrival: hub_scan 2026-05-01T11:30:00+00:00 2026-05-01T12:30:00+00:00
ignored_future_scan: carrier_delay_posted 2026-05-01T15:00:00+00:00 2026-05-01T15:03:00+00:00

The future scan is useful when the true outcome arrives, but it can't help a model that scores at noon. Neither can hub_scan: it occurred before noon but arrived afterward. Keeping this test near the feature builder makes both forms of leakage visible before model training begins.

Route Scores Through a Versioned Policy

A real late_risk comes from the trained model artifact. The boundary below focuses on what happens after scoring, so the harness keeps selected features separate from frozen scored output. The response always emits a versioned trace: model, observed and expected feature contracts, threshold policy, prediction time, and freshness fields. An operator can replay why a customer saw a warning or why a mismatched request fell back.

Freshness is part of the action contract. Both the latest carrier scan and the origin-backlog snapshot must be present, finite, non-negative, and recent enough. A missing contract version, invalid score, or invalid feature routes to review before the score can trigger a customer-facing message.

04-release-policy-contract.py

@dataclass(frozen=True)
class ReleasePolicy:
    threshold: float
    max_scan_age_hours: float
    max_backlog_age_hours: float
    model_id: str
    feature_contract_id: str
    threshold_policy_id: str

POLICY = ReleasePolicy(
    threshold=0.40,
    max_scan_age_hours=24,
    max_backlog_age_hours=8,
    model_id="delay-model-v1",
    feature_contract_id="eta-features-v1",
    threshold_policy_id="eta-threshold-v1",
)

def score(
    order_id: str,
    scan_age_hours: float | None,
    backlog_age_hours: float | None,
    tier: str,
    late_risk: float,
    feature_contract_id: str = "eta-features-v1",
) -> ScoredRow:
    return ScoredRow(
        FeatureRow(order_id, prediction_at, scan_age_hours, backlog_age_hours, tier, feature_contract_id),
        late_risk,
    )

def response(scored: ScoredRow, action: str, reason: str) -> dict[str, object]:
    row = scored.features
    return {
        "order_id": row.order_id,
        "action": action,
        "reason": reason,
        "late_risk": scored.late_risk,
        "prediction_at": row.prediction_at.isoformat(),
        "scan_age_hours": row.scan_age_hours,
        "backlog_age_hours": row.backlog_age_hours,
        "model_id": POLICY.model_id,
        "feature_contract_id": row.feature_contract_id,
        "expected_feature_contract_id": POLICY.feature_contract_id,
        "threshold_policy_id": POLICY.threshold_policy_id,
    }

print("policy:", POLICY.threshold_policy_id, "threshold=", POLICY.threshold)

05-freshness-and-route.py

def invalid_age(value: float | None) -> bool:
    return value is None or not isfinite(value) or value < 0

def input_issue(row: FeatureRow) -> str | None:
    if row.feature_contract_id != POLICY.feature_contract_id:
        return "feature_contract_mismatch"
    if invalid_age(row.scan_age_hours):
        return "invalid_scan_age"
    if invalid_age(row.backlog_age_hours):
        return "invalid_backlog_age"
    if row.scan_age_hours > POLICY.max_scan_age_hours:
        return "stale_scan_features"
    if row.backlog_age_hours > POLICY.max_backlog_age_hours:
        return "stale_backlog_features"
    return None

def route(scored: ScoredRow) -> dict[str, object]:
    issue = input_issue(scored.features)
    if issue is not None:
        return response(scored, "manual_review", issue)
    if not isfinite(scored.late_risk) or not 0 <= scored.late_risk <= 1:
        return response(scored, "manual_review", "invalid_late_risk")
    if scored.late_risk >= POLICY.threshold:
        return response(scored, "warn_customer", "late_risk_threshold")
    return response(scored, "normal_tracking", "below_threshold")

print("O-201 route:", route(feature_score)["action"], route(feature_score)["reason"])

06-policy-case-matrix.py

policy_cases = [
    feature_score,
    score("O-202", 3, 2, "standard", 0.25),
    score("O-203", 31, 2, "standard", 0.81),
    score("O-204", 4, 12, "standard", 0.75),
    score("O-205", 4, 2, "standard", float("nan")),
    score("O-206", 4, None, "standard", 0.75),
    score("O-207", float("nan"), 2, "standard", 0.75),
    score("O-208", -1, 2, "standard", 0.75),
    score("O-209", 4, 2, "standard", 0.75, "eta-features-v0"),
]

for scored in policy_cases:
    result = route(scored)
    print(scored.features.order_id, result["action"], result["reason"])

trace = route(feature_score)
print("release_tuple:", trace.get("feature_contract_id"), trace.get("model_id"), trace.get("threshold_policy_id"))

Output

O-201 warn_customer late_risk_threshold
O-202 normal_tracking below_threshold
O-203 manual_review stale_scan_features
O-204 manual_review stale_backlog_features
O-205 manual_review invalid_late_risk
O-206 manual_review invalid_backlog_age
O-207 manual_review invalid_scan_age
O-208 manual_review invalid_scan_age
O-209 manual_review feature_contract_mismatch
release_tuple: eta-features-v1 delay-model-v1 eta-threshold-v1

Orders O-203 and O-204 are key design results. High model scores aren't authority to message a customer when the evidence is stale. Orders O-205 through O-209 show the same rule for malformed output, missing freshness, impossible ages, and version mismatch: unreliable evidence reaches review, not a narrow ETA claim.

Publish Evidence Against the Rule Baseline

The release gate must test the action policy as well as the fitted score. The holdout below uses later shipments with known outcomes. A missed expedited warning costs 150 fixture units, a standard miss costs 60, and a false warning costs 8. These are local teaching values, not universal business constants.

07-holdout-warning-cost.py

@dataclass(frozen=True)
class HoldoutCase:
    row: ScoredRow
    delivered_late: bool

holdout = [
    HoldoutCase(score("E-301", 5, 1, "expedited", 0.78), True),
    HoldoutCase(score("E-302", 20, 2, "standard", 0.64), True),
    HoldoutCase(score("E-303", 4, 1, "standard", 0.12), False),
    HoldoutCase(score("E-304", 3, 2, "standard", 0.58), False),
]

def warning_cost(case: HoldoutCase, warn: bool) -> int:
    if warn and not case.delivered_late:
        return 8
    if not warn and case.delivered_late:
        return 150 if case.row.features.tier == "expedited" else 60
    return 0

def baseline_warn(case: HoldoutCase) -> bool:
    row = case.row.features
    return input_issue(row) is None and row.scan_age_hours is not None and row.scan_age_hours >= 18

def candidate_warn(case: HoldoutCase) -> bool:
    return route(case.row)["action"] == "warn_customer"

print("holdout rows:", len(holdout))
print("baseline warns:", [case.row.features.order_id for case in holdout if baseline_warn(case)])

08-baseline-vs-candidate-cost.py

baseline_cost = sum(warning_cost(case, baseline_warn(case)) for case in holdout)
candidate_cost = sum(warning_cost(case, candidate_warn(case)) for case in holdout)
expedited_misses = sum(
    case.row.features.tier == "expedited" and case.delivered_late and not candidate_warn(case)
    for case in holdout
)
fallback_reasons = {row.features.order_id: route(row)["reason"] for row in policy_cases[2:]}
print("baseline_cost:", baseline_cost, "candidate_cost:", candidate_cost, "expedited_misses:", expedited_misses)

09-release-gate-checklist.py

trace_keys = {"feature_contract_id", "expected_feature_contract_id", "model_id", "threshold_policy_id"}
freshness_trace_keys = {"prediction_at", "scan_age_hours", "backlog_age_hours"}

release_gates = {
    "replay_excludes_unavailable_scans": admitted_scan_statuses == {"in_transit"},
    "lower_cost_than_rule_baseline": candidate_cost < baseline_cost,
    "zero_expedited_misses": expedited_misses == 0,
    "stale_scan_falls_back": fallback_reasons["O-203"] == "stale_scan_features",
    "stale_backlog_falls_back": fallback_reasons["O-204"] == "stale_backlog_features",
    "invalid_score_falls_back": fallback_reasons["O-205"] == "invalid_late_risk",
    "missing_backlog_falls_back": fallback_reasons["O-206"] == "invalid_backlog_age",
    "invalid_scan_age_falls_back": fallback_reasons["O-207"] == "invalid_scan_age",
    "future_scan_age_falls_back": fallback_reasons["O-208"] == "invalid_scan_age",
    "feature_contract_mismatch_falls_back": fallback_reasons["O-209"] == "feature_contract_mismatch",
    "versioned_trace": all(trace.get(key) for key in trace_keys),
    "freshness_trace": freshness_trace_keys.issubset(trace),
}

print("release_gates_pass:", all(release_gates.values()))

10-publish-release-receipt.py

receipt = {
    "bundle_id": "delivery-risk-v1",
    "evaluation_snapshot": "eta-holdout-2026-05",
    "previous_bundle": "delivery-risk-v0",
    "baseline_cost": baseline_cost,
    "candidate_cost": candidate_cost,
    "expedited_misses": expedited_misses,
    "release_gates": release_gates,
    "candidate_decision": "candidate_for_shadow" if all(release_gates.values()) else "hold",
}

print(json.dumps(receipt, indent=2))

Output

{
  "bundle_id": "delivery-risk-v1",
  "evaluation_snapshot": "eta-holdout-2026-05",
  "previous_bundle": "delivery-risk-v0",
  "baseline_cost": 150,
  "candidate_cost": 8,
  "expedited_misses": 0,
  "release_gates": {
    "replay_excludes_unavailable_scans": true,
    "lower_cost_than_rule_baseline": true,
    "zero_expedited_misses": true,
    "stale_scan_falls_back": true,
    "stale_backlog_falls_back": true,
    "invalid_score_falls_back": true,
    "missing_backlog_falls_back": true,
    "invalid_scan_age_falls_back": true,
    "future_scan_age_falls_back": true,
    "feature_contract_mismatch_falls_back": true,
    "versioned_trace": true,
    "freshness_trace": true
  },
  "candidate_decision": "candidate_for_shadow"
}

The receipt doesn't claim broad production readiness. It proves one immutable candidate deserves shadow traffic: its feature snapshot excludes unavailable events, its policy tuple and freshness fields are visible, its cost beats the rule baseline on held-out fixtures, and its required fallback paths execute.

Operate the Service After Release

The deployment emits one row per score: request timestamp, feature version, model version, threshold version, feature freshness, score, action, and eventually the delivery label. Immediate monitoring catches nulls, stale scans, error rate, and score-distribution shift. Delayed monitoring computes missed-warning cost, calibration by score bucket, and slice performance.

Promotion should move a production alias from delivery-risk-v0 to separately evaluated delivery-risk-v1. Google Cloud's MLOps guidance describes this separation between validation, metadata, serving, monitoring, and continuous training stages.^{[3]Reference 3MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.https://docs.cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning} A triggered retraining job creates evidence; it doesn't silently rewrite live behavior. Keep rollback available by retaining the prior alias target.

Submission checklist

Artifact	Reviewer should verify
feature contract	every field has type, timestamp boundary, and missing policy
training manifest	time split and dataset fingerprint exist
baseline comparison	candidate improves declared cost without required-slice misses
service API	stale inputs fail to a safer route
monitoring plan	input checks and delayed label metrics are distinct
rollback plan	prior artifact and threshold remain deployable

Practice: break the release contract

Use the runnable examples as a small release harness. Change one input at a time, predict the result, then rerun the examples.

Move hub_scan ingestion from at(2.5) to at(1.75). Which scan should the as-of builder select?
Change O-204 backlog age from 12 to 7. Which action replaces manual_review?
Replace float("nan") with 1.4 for O-205. Why should the service still refuse the score?
Raise threshold from 0.40 to 0.80. Which expedited gate fails?
Remove threshold_policy_id from response(). Which release gate catches the incomplete trace?

Practice answer sketches

Mastery check

Evaluation rubric

Artifact	Strong submission demonstrates
model package	time-safe feature contract, baseline comparison, and calibrated warning threshold
service	versioned response trace and safe behavior for missing or stale scans
operations	input monitoring, delayed-label evaluation, candidate promotion, and rollback

Common failures

Symptom	Cause	Fix
Warning appears accurate offline but misses live disruptions	future or late-arriving scans leaked into training	enforce replay tests on event and ingestion time
Customer receives unsupported ETA warning	service trusts score despite stale inputs	gate freshness before action
Missing or impossible freshness reaches threshold logic	comparisons assume valid numeric ages	reject null, non-finite, and negative ages
Customer receives warning from malformed score	serving boundary assumes output is finite and bounded	validate score before thresholding
Team can't reproduce a warning	artifact versions absent from response trace	log full release tuple

Next Step

Continue to Capstone: Product Ranking

You have shipped one prediction service with time-safe features and release gates. Next you'll ship a ranked marketplace surface whose exposures must be measured as carefully as its scores.

PreviousDesign an Automated Support Agent

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Feast: Production Feature Store for Machine Learning

Feast Contributors · 2024

XGBoost: A Scalable Tree Boosting System.

Chen, T. & Guestrin, C. · 2016 · KDD 2016

MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.

Google Cloud. · 2026 · Official documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnPortfolio CapstonesCapstone: Delivery ETA Prediction

⚙️HardMLOps & Deployment

Capstone: Delivery ETA Prediction

Ship a delivery-delay warning service with as-of features, versioned policy gates, baseline evidence, and monitored fallback.

14 min read

Learning path

Step 79 of 158 in the full curriculum

Design an Automated Support Agent Capstone: Product Ranking

Define the Contract Before Choosing a Model

Use one decision moment: two hours after carrier pickup. Use one label: whether delivery occurred after the promised date. A prediction stored without those definitions can't be replayed later.

Contract field	Pinned value
prediction event	two hours after first carrier pickup
label	delivered after promised end-of-day
score output	`late_risk` between zero and one
displayed action	warn only when threshold passes
unavailable data action	route to `manual_review`, no narrow ETA claim

Establish Baseline and Release Evidence

Your repository artifact should contain this layout:

text

eta-prediction/
  data/
    feature_contract.json
    train_snapshot_manifest.json
  training/
    baseline.py
    train_booster.py
    evaluate_slices.py
  artifacts/
    delay_model_v1.json
    threshold_policy_v1.json
    metrics_v1.json
  service/
    api.py
    schemas.py
  monitoring/
    drift_window.py
  tests/
    test_point_in_time_features.py
    test_warning_gate.py

Required release rows:

Gate	Requirement
no feature leakage	replay test excludes post-prediction and late-arriving scans
expedited shipments	no missed warning in required validation slice
expected warning cost	better than rule baseline
feature freshness	stale scan/backlog returns fallback
API schema	model, feature, threshold, and freshness trace emitted

Prove the Snapshot Uses Past Events Only

01-scan-event-contract.py

from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from math import isfinite
import json

BASE_TIME = datetime(2026, 5, 1, 10, tzinfo=timezone.utc)

def at(hours: float) -> datetime:
    return BASE_TIME + timedelta(hours=hours)

@dataclass(frozen=True)
class ScanEvent:
    order_id: str
    event_time: datetime
    ingested_at: datetime
    status: str

@dataclass(frozen=True)
class FeatureRow:
    order_id: str
    prediction_at: datetime
    scan_age_hours: float | None
    backlog_age_hours: float | None
    tier: str
    feature_contract_id: str = "eta-features-v1"

@dataclass(frozen=True)
class ScoredRow:
    features: FeatureRow
    late_risk: float

SCAN_EVENTS = [
    ScanEvent("O-201", at(1), at(1.05), "in_transit"),
    ScanEvent("O-201", at(1.5), at(2.5), "hub_scan"),
    ScanEvent("O-201", at(5), at(5.05), "carrier_delay_posted"),
]

print("scan_events:", len(SCAN_EVENTS))
print("statuses:", [event.status for event in SCAN_EVENTS])

Output

scan_events: 3
statuses: ['in_transit', 'hub_scan', 'carrier_delay_posted']

02-as-of-admission-filter.py

def scans_known_by(order_id: str, prediction_at: datetime) -> list[ScanEvent]:
    return [
        event for event in SCAN_EVENTS
        if (
            event.order_id == order_id
            and event.event_time <= prediction_at
            and event.ingested_at <= prediction_at
        )
    ]

def latest_scan_known_by(order_id: str, prediction_at: datetime) -> ScanEvent:
    admitted = scans_known_by(order_id, prediction_at)
    if not admitted:
        raise ValueError("no scan available at prediction time")
    return max(admitted, key=lambda event: (event.event_time, event.ingested_at))

prediction_at = at(2)
admitted_scan_statuses = {event.status for event in scans_known_by("O-201", prediction_at)}
selected_scan = latest_scan_known_by("O-201", prediction_at)
print("prediction_at:", prediction_at.isoformat())
print("admitted:", sorted(admitted_scan_statuses))
print("selected:", selected_scan.status)

03-build-feature-row.py

feature_row = FeatureRow(
    order_id="O-201",
    prediction_at=prediction_at,
    scan_age_hours=(prediction_at - selected_scan.event_time).total_seconds() / 3600,
    backlog_age_hours=2,
    tier="expedited",
)
feature_score = ScoredRow(feature_row, late_risk=0.62)

assert selected_scan.status == "in_transit"
assert admitted_scan_statuses == {"in_transit"}
assert feature_row.scan_age_hours == 1
print("selected_scan:", selected_scan.status, selected_scan.event_time.isoformat(), selected_scan.ingested_at.isoformat())
print("ignored_late_arrival:", SCAN_EVENTS[1].status, SCAN_EVENTS[1].event_time.isoformat(), SCAN_EVENTS[1].ingested_at.isoformat())
print("ignored_future_scan:", SCAN_EVENTS[2].status, SCAN_EVENTS[2].event_time.isoformat(), SCAN_EVENTS[2].ingested_at.isoformat())

Output

selected_scan: in_transit 2026-05-01T11:00:00+00:00 2026-05-01T11:03:00+00:00
ignored_late_arrival: hub_scan 2026-05-01T11:30:00+00:00 2026-05-01T12:30:00+00:00
ignored_future_scan: carrier_delay_posted 2026-05-01T15:00:00+00:00 2026-05-01T15:03:00+00:00

Route Scores Through a Versioned Policy

04-release-policy-contract.py

@dataclass(frozen=True)
class ReleasePolicy:
    threshold: float
    max_scan_age_hours: float
    max_backlog_age_hours: float
    model_id: str
    feature_contract_id: str
    threshold_policy_id: str

POLICY = ReleasePolicy(
    threshold=0.40,
    max_scan_age_hours=24,
    max_backlog_age_hours=8,
    model_id="delay-model-v1",
    feature_contract_id="eta-features-v1",
    threshold_policy_id="eta-threshold-v1",
)

def score(
    order_id: str,
    scan_age_hours: float | None,
    backlog_age_hours: float | None,
    tier: str,
    late_risk: float,
    feature_contract_id: str = "eta-features-v1",
) -> ScoredRow:
    return ScoredRow(
        FeatureRow(order_id, prediction_at, scan_age_hours, backlog_age_hours, tier, feature_contract_id),
        late_risk,
    )

def response(scored: ScoredRow, action: str, reason: str) -> dict[str, object]:
    row = scored.features
    return {
        "order_id": row.order_id,
        "action": action,
        "reason": reason,
        "late_risk": scored.late_risk,
        "prediction_at": row.prediction_at.isoformat(),
        "scan_age_hours": row.scan_age_hours,
        "backlog_age_hours": row.backlog_age_hours,
        "model_id": POLICY.model_id,
        "feature_contract_id": row.feature_contract_id,
        "expected_feature_contract_id": POLICY.feature_contract_id,
        "threshold_policy_id": POLICY.threshold_policy_id,
    }

print("policy:", POLICY.threshold_policy_id, "threshold=", POLICY.threshold)

05-freshness-and-route.py

def invalid_age(value: float | None) -> bool:
    return value is None or not isfinite(value) or value < 0

def input_issue(row: FeatureRow) -> str | None:
    if row.feature_contract_id != POLICY.feature_contract_id:
        return "feature_contract_mismatch"
    if invalid_age(row.scan_age_hours):
        return "invalid_scan_age"
    if invalid_age(row.backlog_age_hours):
        return "invalid_backlog_age"
    if row.scan_age_hours > POLICY.max_scan_age_hours:
        return "stale_scan_features"
    if row.backlog_age_hours > POLICY.max_backlog_age_hours:
        return "stale_backlog_features"
    return None

def route(scored: ScoredRow) -> dict[str, object]:
    issue = input_issue(scored.features)
    if issue is not None:
        return response(scored, "manual_review", issue)
    if not isfinite(scored.late_risk) or not 0 <= scored.late_risk <= 1:
        return response(scored, "manual_review", "invalid_late_risk")
    if scored.late_risk >= POLICY.threshold:
        return response(scored, "warn_customer", "late_risk_threshold")
    return response(scored, "normal_tracking", "below_threshold")

print("O-201 route:", route(feature_score)["action"], route(feature_score)["reason"])

06-policy-case-matrix.py

policy_cases = [
    feature_score,
    score("O-202", 3, 2, "standard", 0.25),
    score("O-203", 31, 2, "standard", 0.81),
    score("O-204", 4, 12, "standard", 0.75),
    score("O-205", 4, 2, "standard", float("nan")),
    score("O-206", 4, None, "standard", 0.75),
    score("O-207", float("nan"), 2, "standard", 0.75),
    score("O-208", -1, 2, "standard", 0.75),
    score("O-209", 4, 2, "standard", 0.75, "eta-features-v0"),
]

for scored in policy_cases:
    result = route(scored)
    print(scored.features.order_id, result["action"], result["reason"])

trace = route(feature_score)
print("release_tuple:", trace.get("feature_contract_id"), trace.get("model_id"), trace.get("threshold_policy_id"))

Output

O-201 warn_customer late_risk_threshold
O-202 normal_tracking below_threshold
O-203 manual_review stale_scan_features
O-204 manual_review stale_backlog_features
O-205 manual_review invalid_late_risk
O-206 manual_review invalid_backlog_age
O-207 manual_review invalid_scan_age
O-208 manual_review invalid_scan_age
O-209 manual_review feature_contract_mismatch
release_tuple: eta-features-v1 delay-model-v1 eta-threshold-v1

Publish Evidence Against the Rule Baseline

07-holdout-warning-cost.py

@dataclass(frozen=True)
class HoldoutCase:
    row: ScoredRow
    delivered_late: bool

holdout = [
    HoldoutCase(score("E-301", 5, 1, "expedited", 0.78), True),
    HoldoutCase(score("E-302", 20, 2, "standard", 0.64), True),
    HoldoutCase(score("E-303", 4, 1, "standard", 0.12), False),
    HoldoutCase(score("E-304", 3, 2, "standard", 0.58), False),
]

def warning_cost(case: HoldoutCase, warn: bool) -> int:
    if warn and not case.delivered_late:
        return 8
    if not warn and case.delivered_late:
        return 150 if case.row.features.tier == "expedited" else 60
    return 0

def baseline_warn(case: HoldoutCase) -> bool:
    row = case.row.features
    return input_issue(row) is None and row.scan_age_hours is not None and row.scan_age_hours >= 18

def candidate_warn(case: HoldoutCase) -> bool:
    return route(case.row)["action"] == "warn_customer"

print("holdout rows:", len(holdout))
print("baseline warns:", [case.row.features.order_id for case in holdout if baseline_warn(case)])

08-baseline-vs-candidate-cost.py

baseline_cost = sum(warning_cost(case, baseline_warn(case)) for case in holdout)
candidate_cost = sum(warning_cost(case, candidate_warn(case)) for case in holdout)
expedited_misses = sum(
    case.row.features.tier == "expedited" and case.delivered_late and not candidate_warn(case)
    for case in holdout
)
fallback_reasons = {row.features.order_id: route(row)["reason"] for row in policy_cases[2:]}
print("baseline_cost:", baseline_cost, "candidate_cost:", candidate_cost, "expedited_misses:", expedited_misses)

09-release-gate-checklist.py

trace_keys = {"feature_contract_id", "expected_feature_contract_id", "model_id", "threshold_policy_id"}
freshness_trace_keys = {"prediction_at", "scan_age_hours", "backlog_age_hours"}

release_gates = {
    "replay_excludes_unavailable_scans": admitted_scan_statuses == {"in_transit"},
    "lower_cost_than_rule_baseline": candidate_cost < baseline_cost,
    "zero_expedited_misses": expedited_misses == 0,
    "stale_scan_falls_back": fallback_reasons["O-203"] == "stale_scan_features",
    "stale_backlog_falls_back": fallback_reasons["O-204"] == "stale_backlog_features",
    "invalid_score_falls_back": fallback_reasons["O-205"] == "invalid_late_risk",
    "missing_backlog_falls_back": fallback_reasons["O-206"] == "invalid_backlog_age",
    "invalid_scan_age_falls_back": fallback_reasons["O-207"] == "invalid_scan_age",
    "future_scan_age_falls_back": fallback_reasons["O-208"] == "invalid_scan_age",
    "feature_contract_mismatch_falls_back": fallback_reasons["O-209"] == "feature_contract_mismatch",
    "versioned_trace": all(trace.get(key) for key in trace_keys),
    "freshness_trace": freshness_trace_keys.issubset(trace),
}

print("release_gates_pass:", all(release_gates.values()))

10-publish-release-receipt.py

receipt = {
    "bundle_id": "delivery-risk-v1",
    "evaluation_snapshot": "eta-holdout-2026-05",
    "previous_bundle": "delivery-risk-v0",
    "baseline_cost": baseline_cost,
    "candidate_cost": candidate_cost,
    "expedited_misses": expedited_misses,
    "release_gates": release_gates,
    "candidate_decision": "candidate_for_shadow" if all(release_gates.values()) else "hold",
}

print(json.dumps(receipt, indent=2))

Output

{
  "bundle_id": "delivery-risk-v1",
  "evaluation_snapshot": "eta-holdout-2026-05",
  "previous_bundle": "delivery-risk-v0",
  "baseline_cost": 150,
  "candidate_cost": 8,
  "expedited_misses": 0,
  "release_gates": {
    "replay_excludes_unavailable_scans": true,
    "lower_cost_than_rule_baseline": true,
    "zero_expedited_misses": true,
    "stale_scan_falls_back": true,
    "stale_backlog_falls_back": true,
    "invalid_score_falls_back": true,
    "missing_backlog_falls_back": true,
    "invalid_scan_age_falls_back": true,
    "future_scan_age_falls_back": true,
    "feature_contract_mismatch_falls_back": true,
    "versioned_trace": true,
    "freshness_trace": true
  },
  "candidate_decision": "candidate_for_shadow"
}

Operate the Service After Release

Submission checklist

Artifact	Reviewer should verify
feature contract	every field has type, timestamp boundary, and missing policy
training manifest	time split and dataset fingerprint exist
baseline comparison	candidate improves declared cost without required-slice misses
service API	stale inputs fail to a safer route
monitoring plan	input checks and delayed label metrics are distinct
rollback plan	prior artifact and threshold remain deployable

Practice: break the release contract

Use the runnable examples as a small release harness. Change one input at a time, predict the result, then rerun the examples.

Move hub_scan ingestion from at(2.5) to at(1.75). Which scan should the as-of builder select?
Change O-204 backlog age from 12 to 7. Which action replaces manual_review?
Replace float("nan") with 1.4 for O-205. Why should the service still refuse the score?
Raise threshold from 0.40 to 0.80. Which expedited gate fails?
Remove threshold_policy_id from response(). Which release gate catches the incomplete trace?

Practice answer sketches

Mastery check

Evaluation rubric

Artifact	Strong submission demonstrates
model package	time-safe feature contract, baseline comparison, and calibrated warning threshold
service	versioned response trace and safe behavior for missing or stale scans
operations	input monitoring, delayed-label evaluation, candidate promotion, and rollback

Common failures

Symptom	Cause	Fix
Warning appears accurate offline but misses live disruptions	future or late-arriving scans leaked into training	enforce replay tests on event and ingestion time
Customer receives unsupported ETA warning	service trusts score despite stale inputs	gate freshness before action
Missing or impossible freshness reaches threshold logic	comparisons assume valid numeric ages	reject null, non-finite, and negative ages
Customer receives warning from malformed score	serving boundary assumes output is finite and bounded	validate score before thresholding
Team can't reproduce a warning	artifact versions absent from response trace	log full release tuple

Next Step

Continue to Capstone: Product Ranking

You have shipped one prediction service with time-safe features and release gates. Next you'll ship a ranked marketplace surface whose exposures must be measured as carefully as its scores.

PreviousDesign an Automated Support Agent

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Feast: Production Feature Store for Machine Learning

Feast Contributors · 2024

XGBoost: A Scalable Tree Boosting System.

Chen, T. & Guestrin, C. · 2016 · KDD 2016

MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.

Google Cloud. · 2026 · Official documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Capstone: Delivery ETA Prediction

Define the Contract Before Choosing a Model

Establish Baseline and Release Evidence

Prove the Snapshot Uses Past Events Only

Route Scores Through a Versioned Policy

Publish Evidence Against the Rule Baseline

Operate the Service After Release

Submission checklist

Practice: break the release contract

Practice answer sketches

Mastery check

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Delivery ETA Prediction

Define the Contract Before Choosing a Model

Establish Baseline and Release Evidence

Prove the Snapshot Uses Past Events Only

Route Scores Through a Versioned Policy

Publish Evidence Against the Rule Baseline

Operate the Service After Release

Submission checklist

Practice: break the release contract

Practice answer sketches

Mastery check

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Delivery ETA Prediction

Define the Contract Before Choosing a Model

Establish Baseline and Release Evidence

Prove the Snapshot Uses Past Events Only

Route Scores Through a Versioned Policy

Publish Evidence Against the Rule Baseline

Operate the Service After Release

Submission checklist

Practice: break the release contract

Practice answer sketches

What changes when hub_scan ingestion moves from at(2.5) to at(1.75)?

What changes when O-204 backlog age falls from 12 to 7?

Why does score 1.4 still route O-205 to review?

Which gate fails when threshold moves from 0.40 to 0.80?

Which gate fails when response trace drops threshold_policy_id?

Mastery check

Why does the service return manual_review for a high-risk order whose scan is too stale?

Why does replay filter both event_time and ingested_at?

Why must the trained model be compared with a rule baseline under the warning policy?

Why does the release trace include a threshold-policy identifier separately from model identifier?

What would make retraining unsafe even when more labels are available?

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Delivery ETA Prediction

Define the Contract Before Choosing a Model

Establish Baseline and Release Evidence

Prove the Snapshot Uses Past Events Only

Route Scores Through a Versioned Policy

Publish Evidence Against the Rule Baseline

Operate the Service After Release

Submission checklist

Practice: break the release contract

Practice answer sketches

What changes when hub_scan ingestion moves from at(2.5) to at(1.75)?

What changes when O-204 backlog age falls from 12 to 7?

Why does score 1.4 still route O-205 to review?

Which gate fails when threshold moves from 0.40 to 0.80?

Which gate fails when response trace drops threshold_policy_id?

Mastery check

Why does the service return manual_review for a high-risk order whose scan is too stale?

Why does replay filter both event_time and ingested_at?

Why must the trained model be compared with a rule baseline under the warning policy?

Why does the release trace include a threshold-policy identifier separately from model identifier?

What would make retraining unsafe even when more labels are available?

Evaluation rubric

Common failures

Mastery Check

Discussion