LearnPortfolio CapstonesCapstone: Demand Forecasting

⚙️HardMLOps & Deployment

Capstone: Demand Forecasting

Ship a demand forecast and capacity-alert artifact with rolling backtests, alert review, and retraining policy.

16 min read

Learning path

Step 81 of 158 in the full curriculum

Capstone: Product Ranking Capstone: Image Damage Classifier

The ranking capstone influenced which products users could purchase. Warehouse teams now need an operational input: forecast daily parcel volume by fulfillment center so staffing and packing capacity can be planned before demand arrives.

This capstone ships a forecast and alert artifact. It doesn't automatically hire labor, move inventory, or page an operator on every miss. It creates a versioned expectation, detects unusually large forecast errors, and records evidence for a planner's human-in-the-loop decision.

Earlier, you built a same-weekday seasonal baseline and separated forecast errors from anomaly review. This capstone packages those mechanics into a release candidate with shadow evidence and a rollback pointer.

Rolling-origin demand forecast evidence for fulfillment center FC-A. Fourteen immutable daily forecasts across February 2 through February 15 compare a same-weekday baseline, candidate forecast, plus-or-minus-six expected range, and later actual counts. The February 6 seller campaign is the only range breach: candidate 140, expected range 134 to 146, actual 160, error plus 20, known before the February 1 cutoff. The receipt improves MAE from 4.643 to 2.643 and peak underforecast cost from 108 to 72, covers 13 of 14 observations, joins all 14 issued forecasts, records policy-matched alert precision and recall of 0.667, passes 19 of 19 gates, keeps warehouse-demand-v0 as rollback, and advances warehouse-demand-v1 only to planner shadow review. — Two frozen seven-day origins preserve the February 6 range breach while the receipt keeps point accuracy, peak cost, interval coverage, alert review, and rollback evidence separate.

Choose the Series and Decision

Predict daily shipped parcels for each warehouse seven days ahead. Use an explicit planning contract:

Field	Contract
entity	fulfillment center and shipping service tier
target	parcels shipped per calendar day
horizon	next seven days
decision	planner reviews capacity when forecast or alert requires it
baseline	same weekday from prior week
evaluation	MAE plus underforecast cost by high-volume slice

Demand can change around promotions, holidays, seller campaigns, inventory shortages, and data outages. Those known drivers should appear as features only if they are scheduled and available before the forecast cutoff.

Hyndman and Athanasopoulos explain why forecast evaluation must use later observations and rolling forecasting origins rather than random splits.^{[1]Reference 1Forecasting: Principles and Practice, Third Edition.https://otexts.com/fpp3/} For this project, each backtest run records its training cutoff, horizon, model version, and the actual values that arrived afterward.

Diagram showing History snapshot through cutoff, Immutable forecast rows baseline + candidate, Join later observations rolling-origin metrics, and Alert queue range + owner + resolution. — History snapshot through cutoff, Immutable forecast rows baseline + candidate, Join later observations rolling-origin metrics, and Alert queue range + owner + resolution.

Build a Reviewable Artifact

Your repository surface should look like:

text

demand-forecast/
  data/
    warehouse_daily_counts.parquet
    planned_events.json
    split_manifest.json
  forecasting/
    seasonal_baseline.py
    train_candidate.py
    rolling_backtest.py
  alerts/
    forecast_error_policy.json
    evaluate_alerts.py
  reports/
    backtest_metrics.json
    alert_review.csv
  tests/
    test_future_rows_excluded.py
    test_observation_join.py
    test_alert_contract.py

The candidate can be a tree model over lag features, rolling means, service tier, weekday, and known promotions. It must beat the seasonal baseline on later windows, especially where underforecasting is expensive. A candidate that marginally improves MAE but misses peak-volume days should remain blocked.

Freeze issued forecasts before outcomes arrive

The runnable receipt below starts after training. It freezes two seven-day forecast windows issued one week apart. Each issued row stores the cutoff, target date, horizon, baseline, candidate, expected range, and known event context before the target day arrives.

Actual parcel counts belong in a separate append-only observation stream. The backtest joins each later observation to the immutable forecast it evaluates. A real pipeline should materialize the same boundary after every rolling-origin fold.

Stream	Stored before replay	Arrival time
issued forecast	center, tier, issue date, target date, horizon, model outputs, range, known-event context	before target day
observation	forecast ID, observed date, actual parcel count	on or after target day
alert	joined forecast ID, range breach, policies, owner, resolution	after observation join

01-freeze-forecast-contract.py

from dataclasses import dataclass
from datetime import date
import json

@dataclass(frozen=True)
class IssuedForecast:
    forecast_id: str
    fold: str
    issued_at: date
    target_date: date
    horizon_day: int
    center: str
    service_tier: str
    baseline: int
    candidate: int
    lower: int
    upper: int
    scheduled_event: str | None = None
    event_known_at: date | None = None

@dataclass(frozen=True)
class Observation:
    forecast_id: str
    observed_at: date
    actual: int

@dataclass(frozen=True)
class BacktestRow:
    issued: IssuedForecast
    observation: Observation

INTERVAL_HALF_WIDTH = 6
INTERVAL_POLICY = "candidate-plus-minus-6-v1"

def issue_forecast(
    fold: str,
    issued_at: date,
    target_date: date,
    center: str,
    service_tier: str,
    baseline: int,
    candidate: int,
    scheduled_event: str | None = None,
    event_known_at: date | None = None,
) -> IssuedForecast:
    horizon_day = (target_date - issued_at).days
    return IssuedForecast(
        forecast_id=f"{center}:{service_tier}:{issued_at}:{target_date}",
        fold=fold,
        issued_at=issued_at,
        target_date=target_date,
        horizon_day=horizon_day,
        center=center,
        service_tier=service_tier,
        baseline=baseline,
        candidate=candidate,
        lower=candidate - INTERVAL_HALF_WIDTH,
        upper=candidate + INTERVAL_HALF_WIDTH,
        scheduled_event=scheduled_event,
        event_known_at=event_known_at,
    )

print("interval_policy:", INTERVAL_POLICY, "half_width:", INTERVAL_HALF_WIDTH)

Output

interval_policy: candidate-plus-minus-6-v1 half_width: 6

02-issue-forecasts-and-observations.py

ISSUED_FORECASTS = [
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 2), "FC-A", "standard", 100, 103),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 3), "FC-A", "standard", 112, 111),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 4), "FC-A", "standard", 115, 117),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 5), "FC-A", "standard", 118, 117),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 6), "FC-A", "standard", 132, 140, "seller-campaign", date(2026, 1, 29)),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 7), "FC-A", "standard", 82, 83),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 8), "FC-A", "standard", 76, 77),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 9), "FC-A", "standard", 104, 105),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 10), "FC-A", "standard", 110, 112),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 11), "FC-A", "standard", 119, 120),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 12), "FC-A", "standard", 116, 119),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 13), "FC-A", "standard", 160, 164, "seller-campaign", date(2026, 2, 5)),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 14), "FC-A", "standard", 84, 85),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 15), "FC-A", "standard", 78, 79),
]

ACTUALS = [104, 110, 119, 116, 160, 84, 78, 106, 113, 121, 118, 168, 86, 80]
OBSERVATIONS = [
    Observation(forecast.forecast_id, forecast.target_date, actual)
    for forecast, actual in zip(ISSUED_FORECASTS, ACTUALS, strict=True)
]
issued_by_id = {row.forecast_id: row for row in ISSUED_FORECASTS}
observations_by_forecast_id = {row.forecast_id: row for row in OBSERVATIONS}

print("issued forecasts:", len(ISSUED_FORECASTS))
print("later observations:", len(OBSERVATIONS))

03-join-backtest-rows.py

ROWS = [
    BacktestRow(forecast, observations_by_forecast_id[forecast.forecast_id])
    for forecast in ISSUED_FORECASTS
    if forecast.forecast_id in observations_by_forecast_id
]

folds = sorted({row.fold for row in ISSUED_FORECASTS})
for fold in folds:
    rows = [row for row in ISSUED_FORECASTS if row.fold == fold]
    print(
        f"{fold}: cutoff={rows[0].issued_at}",
        f"window={rows[0].target_date}..{rows[-1].target_date}",
        f"rows={len(rows)}",
    )
print("first forecast id:", ISSUED_FORECASTS[0].forecast_id)
print("joined backtest rows:", len(ROWS))

Output

fold-1: cutoff=2026-02-01 window=2026-02-02..2026-02-08 rows=7
fold-2: cutoff=2026-02-08 window=2026-02-09..2026-02-15 rows=7
first forecast id: FC-A:standard:2026-02-01:2026-02-02
joined backtest rows: 14

The fixture is intentionally compact: one fulfillment center and one service tier make every row inspectable. Its ID includes center, tier, issue date, and target date so additional series can't collide during replay. A production report should repeat the same contract by center, tier, horizon, and event slice.

Evaluate Forecasts and Alerts Separately

Mean absolute error (MAE) answers how far point forecasts miss on average. Capacity planning needs a second view because a large underforecast on a peak day can cost more than a small overforecast on a routine day. The next cell assigns a local cost of 3 units to each underforecast parcel on a day with at least 150 observed parcels.

An expected range answers a different question. It should contain a stated share of later observations when measured across enough held-out windows. Hyndman and Athanasopoulos describe prediction intervals as forecast ranges with a specified coverage probability and explain why distributional forecasts need their own accuracy measures.^{[1]Reference 1Forecasting: Principles and Practice, Third Edition.https://otexts.com/fpp3/} The local candidate ± 6 policy only demonstrates release plumbing; it isn't a calibrated production interval.

04-evaluate-point-and-range-metrics.py

def mae(field: str) -> float:
    return sum(
        abs(row.observation.actual - getattr(row.issued, field))
        for row in ROWS
    ) / len(ROWS)

def peak_underforecast_cost(field: str) -> int:
    return sum(
        3 * max(row.observation.actual - getattr(row.issued, field), 0)
        for row in ROWS
        if row.observation.actual >= 150
    )

def coverage() -> float:
    return sum(
        row.issued.lower <= row.observation.actual <= row.issued.upper
        for row in ROWS
    ) / len(ROWS)

ALERT_POLICY = "outside-range-review-v1"
ALERT_OWNER = "capacity-ops"
CANDIDATE_FORECAST = "warehouse-demand-v1"
PREVIOUS_FORECAST = "warehouse-demand-v0"

forecast_metrics = {
    "baseline_mae": round(mae("baseline"), 3),
    "candidate_mae": round(mae("candidate"), 3),
    "baseline_peak_underforecast_cost": peak_underforecast_cost("baseline"),
    "candidate_peak_underforecast_cost": peak_underforecast_cost("candidate"),
    "range_coverage": round(coverage(), 3),
    "range_rows": len(ROWS),
}

print(json.dumps(forecast_metrics, indent=2))

05-build-range-alerts.py

new_alerts = [
    {
        "forecast_id": row.issued.forecast_id,
        "forecast_version": CANDIDATE_FORECAST,
        "interval_policy": INTERVAL_POLICY,
        "policy_version": ALERT_POLICY,
        "owner": ALERT_OWNER,
        "issued_at": str(row.issued.issued_at),
        "target_date": str(row.issued.target_date),
        "horizon_day": row.issued.horizon_day,
        "center": row.issued.center,
        "service_tier": row.issued.service_tier,
        "expected_range": [row.issued.lower, row.issued.upper],
        "observed_at": str(row.observation.observed_at),
        "observed": row.observation.actual,
        "error": row.observation.actual - row.issued.candidate,
        "scheduled_event": row.issued.scheduled_event,
        "event_known_at": str(row.issued.event_known_at) if row.issued.event_known_at else None,
        "resolution": None,
    }
    for row in ROWS
    if not row.issued.lower <= row.observation.actual <= row.issued.upper
]

print("new alerts:", json.dumps(new_alerts, indent=2))

Output

new alerts: [
  {
    "forecast_id": "FC-A:standard:2026-02-01:2026-02-06",
    "forecast_version": "warehouse-demand-v1",
    "interval_policy": "candidate-plus-minus-6-v1",
    "policy_version": "outside-range-review-v1",
    "owner": "capacity-ops",
    "issued_at": "2026-02-01",
    "target_date": "2026-02-06",
    "horizon_day": 5,
    "center": "FC-A",
    "service_tier": "standard",
    "expected_range": [
      134,
      146
    ],
    "observed_at": "2026-02-06",
    "observed": 160,
    "error": 20,
    "scheduled_event": "seller-campaign",
    "event_known_at": "2026-01-29",
    "resolution": null
  }
]

The candidate improves average error and peak-day cost, but one seller-campaign day still escapes its expected range. That row belongs in a planner queue. The queue preserves which immutable forecast was issued, what happened later, who owns review, and which interval and alert policies created the alert.

Publish a Shadow-Review Receipt

Forecast quality and alert usefulness require separate evidence. A narrow range can create an exhausting queue. A broad range can hide events a planner needed to see. Measure alert precision and recall on historical alerts after reviewers label whether each event required action. Keep the forecast and alert-policy versions on those review rows; otherwise a candidate could pass using evidence produced by a different model or threshold.

The final cell publishes one candidate receipt. It advances to planner shadow review, not production replacement. The previous forecast alias remains explicit so a later promotion process can roll back cleanly.

06-review-historical-alerts.py

REVIEWED_ALERTS = [
    {
        "alert_id": "A-101",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": True,
        "actionable": True,
    },
    {
        "alert_id": "A-102",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": True,
        "actionable": True,
    },
    {
        "alert_id": "A-103",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": True,
        "actionable": False,
    },
    {
        "alert_id": "A-104",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": False,
        "actionable": True,
    },
    {
        "alert_id": "A-105",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": False,
        "actionable": False,
    },
]

true_positives = sum(row["triggered"] and row["actionable"] for row in REVIEWED_ALERTS)
false_positives = sum(row["triggered"] and not row["actionable"] for row in REVIEWED_ALERTS)
false_negatives = sum(not row["triggered"] and row["actionable"] for row in REVIEWED_ALERTS)

def rate_or_none(numerator: int, denominator: int) -> float | None:
    return round(numerator / denominator, 3) if denominator else None

alert_review = {
    "forecast_version": CANDIDATE_FORECAST,
    "policy_version": ALERT_POLICY,
    "precision": rate_or_none(true_positives, true_positives + false_positives),
    "recall": rate_or_none(true_positives, true_positives + false_negatives),
    "reviewed_rows": len(REVIEWED_ALERTS),
}

print("alert_review:", alert_review)

07-backtest-release-gates.py

required_alert_fields = {
    "forecast_id", "forecast_version", "interval_policy", "policy_version", "owner",
    "issued_at", "target_date", "horizon_day", "center", "service_tier",
    "expected_range", "observed_at", "observed", "error", "scheduled_event",
    "event_known_at", "resolution",
}
release_gates = {
    "issued_forecast_ids_unique": len(issued_by_id) == len(ISSUED_FORECASTS),
    "observation_forecast_ids_unique": len(observations_by_forecast_id) == len(OBSERVATIONS),
    "observations_join_issued_forecasts": all(
        row.forecast_id in issued_by_id
        for row in OBSERVATIONS
    ),
    "issued_forecasts_have_observations": len(ROWS) == len(ISSUED_FORECASTS),
    "targets_after_cutoff": all(row.issued.issued_at < row.issued.target_date for row in ROWS),
    "horizons_match_dates": all(
        row.issued.horizon_day == (row.issued.target_date - row.issued.issued_at).days
        for row in ROWS
    ),
    "horizons_within_seven_day_contract": all(
        1 <= row.issued.horizon_day <= 7
        for row in ROWS
    ),
    "observations_arrive_on_or_after_target": all(
        row.observation.observed_at >= row.issued.target_date
        for row in ROWS
    ),
    "scheduled_events_known_by_cutoff": all(
        row.event_known_at is None or row.event_known_at <= row.issued_at
        for row in ISSUED_FORECASTS
    ),
    "multiple_rolling_origins": len({row.issued_at for row in ISSUED_FORECASTS}) >= 2,
    "candidate_beats_baseline_mae": forecast_metrics["candidate_mae"] < forecast_metrics["baseline_mae"],
    "candidate_reduces_peak_underforecast_cost": (
        forecast_metrics["candidate_peak_underforecast_cost"]
        < forecast_metrics["baseline_peak_underforecast_cost"]
    ),
    "local_range_coverage_at_least_0_85": forecast_metrics["range_coverage"] >= 0.85,
    "shadow_alert_count_at_most_3": len(new_alerts) <= 3,
    "alert_review_matches_candidate_policy": all(
        row["forecast_version"] == CANDIDATE_FORECAST
        and row["policy_version"] == ALERT_POLICY
        for row in REVIEWED_ALERTS
    ),
    "alert_precision_evidence_at_least_0_60": (
        alert_review["precision"] is not None
        and alert_review["precision"] >= 0.60
    ),
    "alert_recall_evidence_at_least_0_60": (
        alert_review["recall"] is not None
        and alert_review["recall"] >= 0.60
    ),
    "alert_rows_replayable": all(required_alert_fields <= row.keys() for row in new_alerts),
    "rollback_pointer_recorded": bool(PREVIOUS_FORECAST),
}

print("release_gates_pass:", all(release_gates.values()))

08-assemble-shadow-receipt.py

receipt = {
    "candidate_forecast": CANDIDATE_FORECAST,
    "previous_forecast": PREVIOUS_FORECAST,
    "latest_rolling_origin": str(max(row.issued_at for row in ISSUED_FORECASTS)),
    "interval_policy": INTERVAL_POLICY,
    "alert_policy": ALERT_POLICY,
    "owner": ALERT_OWNER,
    "replay": {
        "issued_forecast_rows": len(ISSUED_FORECASTS),
        "joined_observation_rows": len(ROWS),
    },
    "backtest": forecast_metrics,
    "alert_review": alert_review,
    "release_gates": release_gates,
    "candidate_decision": "candidate_for_planner_shadow_review" if all(release_gates.values()) else "hold",
}

print("candidate_decision:", receipt["candidate_decision"])

09-verify-receipt-fields.py

assert receipt["candidate_forecast"] == CANDIDATE_FORECAST
assert receipt["previous_forecast"] == PREVIOUS_FORECAST
assert receipt["backtest"]["candidate_mae"] < receipt["backtest"]["baseline_mae"]
print("receipt keys:", sorted(receipt))

10-publish-shadow-review-receipt.py

print(json.dumps(receipt, indent=2))

Output

{
  "candidate_forecast": "warehouse-demand-v1",
  "previous_forecast": "warehouse-demand-v0",
  "latest_rolling_origin": "2026-02-08",
  "interval_policy": "candidate-plus-minus-6-v1",
  "alert_policy": "outside-range-review-v1",
  "owner": "capacity-ops",
  "replay": {
    "issued_forecast_rows": 14,
    "joined_observation_rows": 14
  },
  "backtest": {
    "baseline_mae": 4.643,
    "candidate_mae": 2.643,
    "baseline_peak_underforecast_cost": 108,
    "candidate_peak_underforecast_cost": 72,
    "range_coverage": 0.929,
    "range_rows": 14
  },
  "alert_review": {
    "forecast_version": "warehouse-demand-v1",
    "policy_version": "outside-range-review-v1",
    "precision": 0.667,
    "recall": 0.667,
    "reviewed_rows": 5
  },
  "release_gates": {
    "issued_forecast_ids_unique": true,
    "observation_forecast_ids_unique": true,
    "observations_join_issued_forecasts": true,
    "issued_forecasts_have_observations": true,
    "targets_after_cutoff": true,
    "horizons_match_dates": true,
    "horizons_within_seven_day_contract": true,
    "observations_arrive_on_or_after_target": true,
    "scheduled_events_known_by_cutoff": true,
    "multiple_rolling_origins": true,
    "candidate_beats_baseline_mae": true,
    "candidate_reduces_peak_underforecast_cost": true,
    "local_range_coverage_at_least_0_85": true,
    "shadow_alert_count_at_most_3": true,
    "alert_review_matches_candidate_policy": true,
    "alert_precision_evidence_at_least_0_60": true,
    "alert_recall_evidence_at_least_0_60": true,
    "alert_rows_replayable": true,
    "rollback_pointer_recorded": true
  },
  "candidate_decision": "candidate_for_planner_shadow_review"
}

candidate_for_planner_shadow_review is intentionally narrower than launch approval. Frozen forecasts, later observation joins, and reviewed alerts say this bundle deserves planner observation beside current production forecasts. They don't prove that every center, tier, event slice, or future week will behave well. A reviewed-alert window with no triggered or actionable rows reports None, not an invented precision or recall score.

Plan Refresh and Monitoring

New outcomes arrive daily, but model replacement should happen on a scheduled or triggered review cycle. Store:

Operational item	Required decision
daily observation join	append actual count, then join it to immutable issued forecast ID
weekly accuracy report	compare baseline, production, and shadow candidate by slice
range coverage report	measure later-window coverage by center, tier, and horizon
alert resolution review	classify actionable, expected, or data issue
retraining trigger	investigate sustained cost regression before fitting replacement
promotion gate	rerun rolling backtest, protected slices, shadow review, and rollback check

Practice: break the forecast contract

Use the runnable examples as a release harness. Change one condition at a time, predict the failure, then rerun the examples.

Change every fold-2 issue date from 2026-02-08 to 2026-02-15. Which temporal gates fail?
Change fold-1 Friday candidate from 140 to 132. Which cost gets worse even though only one row changed?
Set INTERVAL_HALF_WIDTH = 0. Why can MAE stay unchanged while range and queue gates fail?
Mark A-104 as not actionable. Which alert metric improves, and which stays unchanged?
Set PREVIOUS_FORECAST = "". Which executable gate fails?
Give first observation forecast ID missing:standard:2026-02-01:2026-02-02. Which replay gate fails?
Set every reviewed alert's triggered and actionable values to False. Why do alert-evidence gates fail instead of crashing or passing?
Change every reviewed alert's policy_version to outside-range-review-v0. Which provenance gate fails?

Practice answer sketches

Mastery check

Evaluation rubric

Artifact	Strong submission demonstrates
forecast package	immutable issued rows, later observation joins, time-aware folds, seasonal baseline, expected ranges, and previous alias
forecast evaluation	MAE, peak underforecast cost, later-window range coverage, and slice review
alert workflow	replayable alert rows, reviewed precision and recall, ownership, and resolution logging
operations	shadow comparison, retraining investigation, promotion gates, and rollback plan

Common failures

Symptom	Cause	Fix
Backtest looks precise but live peaks miss	future or event leakage	freeze cutoff and known-in-advance fields
Backtest evidence changes after actuals arrive	issued forecast row was updated in place	append observations separately and join by stable forecast ID
One center's observation joins another center's forecast	forecast ID omits series identity	include center, tier, issue date, and target date in forecast ID
MAE improves while staffing misses stay expensive	peak slice absent from release gate	price high-volume underforecast cost separately
Planner receives noisy alerts	range policy has no reviewed outcomes	measure alert precision, recall, and queue volume
Alert metrics pass using an older threshold	review rows omit candidate and policy versions	bind every reviewed row to the forecast and alert policy under release
Forecast range looks reliable but later coverage fails	interval evidence is too small or in sample	measure held-out coverage by horizon and slice
Candidate advances without rollback target	receipt omits previous alias	publish immutable candidate and rollback pointer together

Next Step

Continue to Capstone: Image Damage Classifier

You can now package time-ordered forecasts, shadow evidence, and reviewed capacity alerts. Next you'll ship a model over pixels, where image quality and human confirmation guard every damage route.

PreviousCapstone: Product Ranking

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Forecasting: Principles and Practice, Third Edition.

Hyndman, R. J. & Athanasopoulos, G. · 2021

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnPortfolio CapstonesCapstone: Demand Forecasting

⚙️HardMLOps & Deployment

Capstone: Demand Forecasting

Ship a demand forecast and capacity-alert artifact with rolling backtests, alert review, and retraining policy.

16 min read

Learning path

Step 81 of 158 in the full curriculum

Capstone: Product Ranking Capstone: Image Damage Classifier

Choose the Series and Decision

Predict daily shipped parcels for each warehouse seven days ahead. Use an explicit planning contract:

Field	Contract
entity	fulfillment center and shipping service tier
target	parcels shipped per calendar day
horizon	next seven days
decision	planner reviews capacity when forecast or alert requires it
baseline	same weekday from prior week
evaluation	MAE plus underforecast cost by high-volume slice

Build a Reviewable Artifact

Your repository surface should look like:

text

demand-forecast/
  data/
    warehouse_daily_counts.parquet
    planned_events.json
    split_manifest.json
  forecasting/
    seasonal_baseline.py
    train_candidate.py
    rolling_backtest.py
  alerts/
    forecast_error_policy.json
    evaluate_alerts.py
  reports/
    backtest_metrics.json
    alert_review.csv
  tests/
    test_future_rows_excluded.py
    test_observation_join.py
    test_alert_contract.py

Freeze issued forecasts before outcomes arrive

Stream	Stored before replay	Arrival time
issued forecast	center, tier, issue date, target date, horizon, model outputs, range, known-event context	before target day
observation	forecast ID, observed date, actual parcel count	on or after target day
alert	joined forecast ID, range breach, policies, owner, resolution	after observation join

01-freeze-forecast-contract.py

from dataclasses import dataclass
from datetime import date
import json

@dataclass(frozen=True)
class IssuedForecast:
    forecast_id: str
    fold: str
    issued_at: date
    target_date: date
    horizon_day: int
    center: str
    service_tier: str
    baseline: int
    candidate: int
    lower: int
    upper: int
    scheduled_event: str | None = None
    event_known_at: date | None = None

@dataclass(frozen=True)
class Observation:
    forecast_id: str
    observed_at: date
    actual: int

@dataclass(frozen=True)
class BacktestRow:
    issued: IssuedForecast
    observation: Observation

INTERVAL_HALF_WIDTH = 6
INTERVAL_POLICY = "candidate-plus-minus-6-v1"

def issue_forecast(
    fold: str,
    issued_at: date,
    target_date: date,
    center: str,
    service_tier: str,
    baseline: int,
    candidate: int,
    scheduled_event: str | None = None,
    event_known_at: date | None = None,
) -> IssuedForecast:
    horizon_day = (target_date - issued_at).days
    return IssuedForecast(
        forecast_id=f"{center}:{service_tier}:{issued_at}:{target_date}",
        fold=fold,
        issued_at=issued_at,
        target_date=target_date,
        horizon_day=horizon_day,
        center=center,
        service_tier=service_tier,
        baseline=baseline,
        candidate=candidate,
        lower=candidate - INTERVAL_HALF_WIDTH,
        upper=candidate + INTERVAL_HALF_WIDTH,
        scheduled_event=scheduled_event,
        event_known_at=event_known_at,
    )

print("interval_policy:", INTERVAL_POLICY, "half_width:", INTERVAL_HALF_WIDTH)

Output

interval_policy: candidate-plus-minus-6-v1 half_width: 6

02-issue-forecasts-and-observations.py

ISSUED_FORECASTS = [
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 2), "FC-A", "standard", 100, 103),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 3), "FC-A", "standard", 112, 111),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 4), "FC-A", "standard", 115, 117),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 5), "FC-A", "standard", 118, 117),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 6), "FC-A", "standard", 132, 140, "seller-campaign", date(2026, 1, 29)),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 7), "FC-A", "standard", 82, 83),
    issue_forecast("fold-1", date(2026, 2, 1), date(2026, 2, 8), "FC-A", "standard", 76, 77),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 9), "FC-A", "standard", 104, 105),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 10), "FC-A", "standard", 110, 112),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 11), "FC-A", "standard", 119, 120),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 12), "FC-A", "standard", 116, 119),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 13), "FC-A", "standard", 160, 164, "seller-campaign", date(2026, 2, 5)),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 14), "FC-A", "standard", 84, 85),
    issue_forecast("fold-2", date(2026, 2, 8), date(2026, 2, 15), "FC-A", "standard", 78, 79),
]

ACTUALS = [104, 110, 119, 116, 160, 84, 78, 106, 113, 121, 118, 168, 86, 80]
OBSERVATIONS = [
    Observation(forecast.forecast_id, forecast.target_date, actual)
    for forecast, actual in zip(ISSUED_FORECASTS, ACTUALS, strict=True)
]
issued_by_id = {row.forecast_id: row for row in ISSUED_FORECASTS}
observations_by_forecast_id = {row.forecast_id: row for row in OBSERVATIONS}

print("issued forecasts:", len(ISSUED_FORECASTS))
print("later observations:", len(OBSERVATIONS))

03-join-backtest-rows.py

ROWS = [
    BacktestRow(forecast, observations_by_forecast_id[forecast.forecast_id])
    for forecast in ISSUED_FORECASTS
    if forecast.forecast_id in observations_by_forecast_id
]

folds = sorted({row.fold for row in ISSUED_FORECASTS})
for fold in folds:
    rows = [row for row in ISSUED_FORECASTS if row.fold == fold]
    print(
        f"{fold}: cutoff={rows[0].issued_at}",
        f"window={rows[0].target_date}..{rows[-1].target_date}",
        f"rows={len(rows)}",
    )
print("first forecast id:", ISSUED_FORECASTS[0].forecast_id)
print("joined backtest rows:", len(ROWS))

Output

fold-1: cutoff=2026-02-01 window=2026-02-02..2026-02-08 rows=7
fold-2: cutoff=2026-02-08 window=2026-02-09..2026-02-15 rows=7
first forecast id: FC-A:standard:2026-02-01:2026-02-02
joined backtest rows: 14

Evaluate Forecasts and Alerts Separately

04-evaluate-point-and-range-metrics.py

def mae(field: str) -> float:
    return sum(
        abs(row.observation.actual - getattr(row.issued, field))
        for row in ROWS
    ) / len(ROWS)

def peak_underforecast_cost(field: str) -> int:
    return sum(
        3 * max(row.observation.actual - getattr(row.issued, field), 0)
        for row in ROWS
        if row.observation.actual >= 150
    )

def coverage() -> float:
    return sum(
        row.issued.lower <= row.observation.actual <= row.issued.upper
        for row in ROWS
    ) / len(ROWS)

ALERT_POLICY = "outside-range-review-v1"
ALERT_OWNER = "capacity-ops"
CANDIDATE_FORECAST = "warehouse-demand-v1"
PREVIOUS_FORECAST = "warehouse-demand-v0"

forecast_metrics = {
    "baseline_mae": round(mae("baseline"), 3),
    "candidate_mae": round(mae("candidate"), 3),
    "baseline_peak_underforecast_cost": peak_underforecast_cost("baseline"),
    "candidate_peak_underforecast_cost": peak_underforecast_cost("candidate"),
    "range_coverage": round(coverage(), 3),
    "range_rows": len(ROWS),
}

print(json.dumps(forecast_metrics, indent=2))

05-build-range-alerts.py

new_alerts = [
    {
        "forecast_id": row.issued.forecast_id,
        "forecast_version": CANDIDATE_FORECAST,
        "interval_policy": INTERVAL_POLICY,
        "policy_version": ALERT_POLICY,
        "owner": ALERT_OWNER,
        "issued_at": str(row.issued.issued_at),
        "target_date": str(row.issued.target_date),
        "horizon_day": row.issued.horizon_day,
        "center": row.issued.center,
        "service_tier": row.issued.service_tier,
        "expected_range": [row.issued.lower, row.issued.upper],
        "observed_at": str(row.observation.observed_at),
        "observed": row.observation.actual,
        "error": row.observation.actual - row.issued.candidate,
        "scheduled_event": row.issued.scheduled_event,
        "event_known_at": str(row.issued.event_known_at) if row.issued.event_known_at else None,
        "resolution": None,
    }
    for row in ROWS
    if not row.issued.lower <= row.observation.actual <= row.issued.upper
]

print("new alerts:", json.dumps(new_alerts, indent=2))

Output

new alerts: [
  {
    "forecast_id": "FC-A:standard:2026-02-01:2026-02-06",
    "forecast_version": "warehouse-demand-v1",
    "interval_policy": "candidate-plus-minus-6-v1",
    "policy_version": "outside-range-review-v1",
    "owner": "capacity-ops",
    "issued_at": "2026-02-01",
    "target_date": "2026-02-06",
    "horizon_day": 5,
    "center": "FC-A",
    "service_tier": "standard",
    "expected_range": [
      134,
      146
    ],
    "observed_at": "2026-02-06",
    "observed": 160,
    "error": 20,
    "scheduled_event": "seller-campaign",
    "event_known_at": "2026-01-29",
    "resolution": null
  }
]

Publish a Shadow-Review Receipt

06-review-historical-alerts.py

REVIEWED_ALERTS = [
    {
        "alert_id": "A-101",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": True,
        "actionable": True,
    },
    {
        "alert_id": "A-102",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": True,
        "actionable": True,
    },
    {
        "alert_id": "A-103",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": True,
        "actionable": False,
    },
    {
        "alert_id": "A-104",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": False,
        "actionable": True,
    },
    {
        "alert_id": "A-105",
        "forecast_version": CANDIDATE_FORECAST,
        "policy_version": ALERT_POLICY,
        "triggered": False,
        "actionable": False,
    },
]

true_positives = sum(row["triggered"] and row["actionable"] for row in REVIEWED_ALERTS)
false_positives = sum(row["triggered"] and not row["actionable"] for row in REVIEWED_ALERTS)
false_negatives = sum(not row["triggered"] and row["actionable"] for row in REVIEWED_ALERTS)

def rate_or_none(numerator: int, denominator: int) -> float | None:
    return round(numerator / denominator, 3) if denominator else None

alert_review = {
    "forecast_version": CANDIDATE_FORECAST,
    "policy_version": ALERT_POLICY,
    "precision": rate_or_none(true_positives, true_positives + false_positives),
    "recall": rate_or_none(true_positives, true_positives + false_negatives),
    "reviewed_rows": len(REVIEWED_ALERTS),
}

print("alert_review:", alert_review)

07-backtest-release-gates.py

required_alert_fields = {
    "forecast_id", "forecast_version", "interval_policy", "policy_version", "owner",
    "issued_at", "target_date", "horizon_day", "center", "service_tier",
    "expected_range", "observed_at", "observed", "error", "scheduled_event",
    "event_known_at", "resolution",
}
release_gates = {
    "issued_forecast_ids_unique": len(issued_by_id) == len(ISSUED_FORECASTS),
    "observation_forecast_ids_unique": len(observations_by_forecast_id) == len(OBSERVATIONS),
    "observations_join_issued_forecasts": all(
        row.forecast_id in issued_by_id
        for row in OBSERVATIONS
    ),
    "issued_forecasts_have_observations": len(ROWS) == len(ISSUED_FORECASTS),
    "targets_after_cutoff": all(row.issued.issued_at < row.issued.target_date for row in ROWS),
    "horizons_match_dates": all(
        row.issued.horizon_day == (row.issued.target_date - row.issued.issued_at).days
        for row in ROWS
    ),
    "horizons_within_seven_day_contract": all(
        1 <= row.issued.horizon_day <= 7
        for row in ROWS
    ),
    "observations_arrive_on_or_after_target": all(
        row.observation.observed_at >= row.issued.target_date
        for row in ROWS
    ),
    "scheduled_events_known_by_cutoff": all(
        row.event_known_at is None or row.event_known_at <= row.issued_at
        for row in ISSUED_FORECASTS
    ),
    "multiple_rolling_origins": len({row.issued_at for row in ISSUED_FORECASTS}) >= 2,
    "candidate_beats_baseline_mae": forecast_metrics["candidate_mae"] < forecast_metrics["baseline_mae"],
    "candidate_reduces_peak_underforecast_cost": (
        forecast_metrics["candidate_peak_underforecast_cost"]
        < forecast_metrics["baseline_peak_underforecast_cost"]
    ),
    "local_range_coverage_at_least_0_85": forecast_metrics["range_coverage"] >= 0.85,
    "shadow_alert_count_at_most_3": len(new_alerts) <= 3,
    "alert_review_matches_candidate_policy": all(
        row["forecast_version"] == CANDIDATE_FORECAST
        and row["policy_version"] == ALERT_POLICY
        for row in REVIEWED_ALERTS
    ),
    "alert_precision_evidence_at_least_0_60": (
        alert_review["precision"] is not None
        and alert_review["precision"] >= 0.60
    ),
    "alert_recall_evidence_at_least_0_60": (
        alert_review["recall"] is not None
        and alert_review["recall"] >= 0.60
    ),
    "alert_rows_replayable": all(required_alert_fields <= row.keys() for row in new_alerts),
    "rollback_pointer_recorded": bool(PREVIOUS_FORECAST),
}

print("release_gates_pass:", all(release_gates.values()))

08-assemble-shadow-receipt.py

receipt = {
    "candidate_forecast": CANDIDATE_FORECAST,
    "previous_forecast": PREVIOUS_FORECAST,
    "latest_rolling_origin": str(max(row.issued_at for row in ISSUED_FORECASTS)),
    "interval_policy": INTERVAL_POLICY,
    "alert_policy": ALERT_POLICY,
    "owner": ALERT_OWNER,
    "replay": {
        "issued_forecast_rows": len(ISSUED_FORECASTS),
        "joined_observation_rows": len(ROWS),
    },
    "backtest": forecast_metrics,
    "alert_review": alert_review,
    "release_gates": release_gates,
    "candidate_decision": "candidate_for_planner_shadow_review" if all(release_gates.values()) else "hold",
}

print("candidate_decision:", receipt["candidate_decision"])

09-verify-receipt-fields.py

assert receipt["candidate_forecast"] == CANDIDATE_FORECAST
assert receipt["previous_forecast"] == PREVIOUS_FORECAST
assert receipt["backtest"]["candidate_mae"] < receipt["backtest"]["baseline_mae"]
print("receipt keys:", sorted(receipt))

10-publish-shadow-review-receipt.py

print(json.dumps(receipt, indent=2))

Output

{
  "candidate_forecast": "warehouse-demand-v1",
  "previous_forecast": "warehouse-demand-v0",
  "latest_rolling_origin": "2026-02-08",
  "interval_policy": "candidate-plus-minus-6-v1",
  "alert_policy": "outside-range-review-v1",
  "owner": "capacity-ops",
  "replay": {
    "issued_forecast_rows": 14,
    "joined_observation_rows": 14
  },
  "backtest": {
    "baseline_mae": 4.643,
    "candidate_mae": 2.643,
    "baseline_peak_underforecast_cost": 108,
    "candidate_peak_underforecast_cost": 72,
    "range_coverage": 0.929,
    "range_rows": 14
  },
  "alert_review": {
    "forecast_version": "warehouse-demand-v1",
    "policy_version": "outside-range-review-v1",
    "precision": 0.667,
    "recall": 0.667,
    "reviewed_rows": 5
  },
  "release_gates": {
    "issued_forecast_ids_unique": true,
    "observation_forecast_ids_unique": true,
    "observations_join_issued_forecasts": true,
    "issued_forecasts_have_observations": true,
    "targets_after_cutoff": true,
    "horizons_match_dates": true,
    "horizons_within_seven_day_contract": true,
    "observations_arrive_on_or_after_target": true,
    "scheduled_events_known_by_cutoff": true,
    "multiple_rolling_origins": true,
    "candidate_beats_baseline_mae": true,
    "candidate_reduces_peak_underforecast_cost": true,
    "local_range_coverage_at_least_0_85": true,
    "shadow_alert_count_at_most_3": true,
    "alert_review_matches_candidate_policy": true,
    "alert_precision_evidence_at_least_0_60": true,
    "alert_recall_evidence_at_least_0_60": true,
    "alert_rows_replayable": true,
    "rollback_pointer_recorded": true
  },
  "candidate_decision": "candidate_for_planner_shadow_review"
}

Plan Refresh and Monitoring

New outcomes arrive daily, but model replacement should happen on a scheduled or triggered review cycle. Store:

Operational item	Required decision
daily observation join	append actual count, then join it to immutable issued forecast ID
weekly accuracy report	compare baseline, production, and shadow candidate by slice
range coverage report	measure later-window coverage by center, tier, and horizon
alert resolution review	classify actionable, expected, or data issue
retraining trigger	investigate sustained cost regression before fitting replacement
promotion gate	rerun rolling backtest, protected slices, shadow review, and rollback check

Practice: break the forecast contract

Use the runnable examples as a release harness. Change one condition at a time, predict the failure, then rerun the examples.

Change every fold-2 issue date from 2026-02-08 to 2026-02-15. Which temporal gates fail?
Change fold-1 Friday candidate from 140 to 132. Which cost gets worse even though only one row changed?
Set INTERVAL_HALF_WIDTH = 0. Why can MAE stay unchanged while range and queue gates fail?
Mark A-104 as not actionable. Which alert metric improves, and which stays unchanged?
Set PREVIOUS_FORECAST = "". Which executable gate fails?
Give first observation forecast ID missing:standard:2026-02-01:2026-02-02. Which replay gate fails?
Set every reviewed alert's triggered and actionable values to False. Why do alert-evidence gates fail instead of crashing or passing?
Change every reviewed alert's policy_version to outside-range-review-v0. Which provenance gate fails?

Practice answer sketches

Mastery check

Evaluation rubric

Artifact	Strong submission demonstrates
forecast package	immutable issued rows, later observation joins, time-aware folds, seasonal baseline, expected ranges, and previous alias
forecast evaluation	MAE, peak underforecast cost, later-window range coverage, and slice review
alert workflow	replayable alert rows, reviewed precision and recall, ownership, and resolution logging
operations	shadow comparison, retraining investigation, promotion gates, and rollback plan

Common failures

Symptom	Cause	Fix
Backtest looks precise but live peaks miss	future or event leakage	freeze cutoff and known-in-advance fields
Backtest evidence changes after actuals arrive	issued forecast row was updated in place	append observations separately and join by stable forecast ID
One center's observation joins another center's forecast	forecast ID omits series identity	include center, tier, issue date, and target date in forecast ID
MAE improves while staffing misses stay expensive	peak slice absent from release gate	price high-volume underforecast cost separately
Planner receives noisy alerts	range policy has no reviewed outcomes	measure alert precision, recall, and queue volume
Alert metrics pass using an older threshold	review rows omit candidate and policy versions	bind every reviewed row to the forecast and alert policy under release
Forecast range looks reliable but later coverage fails	interval evidence is too small or in sample	measure held-out coverage by horizon and slice
Candidate advances without rollback target	receipt omits previous alias	publish immutable candidate and rollback pointer together

Next Step

Continue to Capstone: Image Damage Classifier

You can now package time-ordered forecasts, shadow evidence, and reviewed capacity alerts. Next you'll ship a model over pixels, where image quality and human confirmation guard every damage route.

PreviousCapstone: Product Ranking

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Forecasting: Principles and Practice, Third Edition.

Hyndman, R. J. & Athanasopoulos, G. · 2021

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Capstone: Demand Forecasting

Choose the Series and Decision

Build a Reviewable Artifact

Freeze issued forecasts before outcomes arrive

Evaluate Forecasts and Alerts Separately

Publish a Shadow-Review Receipt

Plan Refresh and Monitoring

Practice: break the forecast contract

Practice answer sketches

Mastery check

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Demand Forecasting

Choose the Series and Decision

Build a Reviewable Artifact

Freeze issued forecasts before outcomes arrive

Evaluate Forecasts and Alerts Separately

Publish a Shadow-Review Receipt

Plan Refresh and Monitoring

Practice: break the forecast contract

Practice answer sketches

Mastery check

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Demand Forecasting

Choose the Series and Decision

Build a Reviewable Artifact

Freeze issued forecasts before outcomes arrive

Evaluate Forecasts and Alerts Separately

Publish a Shadow-Review Receipt

Plan Refresh and Monitoring

Practice: break the forecast contract

Practice answer sketches

Which gates fail when fold-2 is issued on 2026-02-15?

What changes when the first Friday candidate falls from 140 to 132?

Why do zero-width expected ranges fail local range and queue gates without changing MAE?

What changes when A-104 becomes not actionable?

Which gate fails when PREVIOUS_FORECAST becomes an empty string?

Which gate fails when an observation points at forecast ID missing:standard:2026-02-01:2026-02-02?

What happens when reviewed history contains no triggered or actionable alerts?

Which gate fails when reviewed alerts come from outside-range-review-v0?

Mastery check

Why are rolling-origin rows stronger evidence than one random holdout?

Why keep MAE, peak underforecast cost, and alert-review metrics separate?

Why append observations separately instead of updating issued forecast rows in place?

Why is candidate_for_planner_shadow_review weaker than production promotion?

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Demand Forecasting

Choose the Series and Decision

Build a Reviewable Artifact

Freeze issued forecasts before outcomes arrive

Evaluate Forecasts and Alerts Separately

Publish a Shadow-Review Receipt

Plan Refresh and Monitoring

Practice: break the forecast contract

Practice answer sketches

Which gates fail when fold-2 is issued on 2026-02-15?

What changes when the first Friday candidate falls from 140 to 132?

Why do zero-width expected ranges fail local range and queue gates without changing MAE?

What changes when A-104 becomes not actionable?

Which gate fails when PREVIOUS_FORECAST becomes an empty string?

Which gate fails when an observation points at forecast ID missing:standard:2026-02-01:2026-02-02?

What happens when reviewed history contains no triggered or actionable alerts?

Which gate fails when reviewed alerts come from outside-range-review-v0?

Mastery check

Why are rolling-origin rows stronger evidence than one random holdout?

Why keep MAE, peak underforecast cost, and alert-review metrics separate?

Why append observations separately instead of updating issued forecast rows in place?

Why is candidate_for_planner_shadow_review weaker than production promotion?

Evaluation rubric

Common failures

Mastery Check

Discussion