Forecast parcel demand with time-aware evaluation and turn large residuals into reviewable operational alerts.
A ranking model selects which products appear now. Warehouse planners face a future question: how many parcels will leave a fulfillment center tomorrow, and which observed counts signal an unexpected disruption?
Forecasting differs from ordinary random-split prediction because observations are ordered. Tomorrow isn't exchangeable with last month. A useful evaluation must train on the past and predict a later window, repeatedly if possible.
Suppose Saturday parcel volume is normally lower than weekday volume. Predicting tomorrow using yesterday alone will overreact each Friday and Monday. A small baseline can predict each weekday from the same weekday last week.
| Day | Last week parcels | This week parcels | Weekly-lag error |
|---|---|---|---|
| Monday | 100 | 104 | 4 |
| Tuesday | 112 | 110 | -2 |
| Wednesday | 115 | 119 | 4 |
| Thursday | 118 | 116 | -2 |
| Friday | 132 | 160 | 28 |
The Friday miss might be an anomaly, or it might be a promotion planned outside the dataset. A monitoring system shouldn't announce a warehouse incident from one residual alone; it should surface the deviation with context and a review path.
Hyndman and Athanasopoulos emphasize that forecast accuracy must be evaluated on genuinely future observations, with time-series cross-validation expanding or rolling the training origin forward.[1] That rule prevents tomorrow's demand from influencing the model that claims it could have predicted tomorrow.
Mean absolute error (MAE) averages absolute forecast mistakes:
Here is observed volume and is its forecast. MAE remains in parcels, which makes it understandable to operations.
1history = [100, 112, 115, 118, 132, 82, 76]
2observed = [104, 110, 119, 116, 160, 84, 78]
3
4forecasts = history[:] # same weekday last week
5errors = [actual - forecast for actual, forecast in zip(observed, forecasts)]
6mae = sum(abs(error) for error in errors) / len(errors)
7alert_limit = 20
8alerts = [
9 (day, actual, forecast, error)
10 for day, (actual, forecast, error) in enumerate(zip(observed, forecasts, errors), start=1)
11 if abs(error) >= alert_limit
12]
13
14print("MAE parcels:", round(mae, 1))
15for day, actual, forecast, error in alerts:
16 print(f"alert day={day} observed={actual} forecast={forecast} residual={error:+d}")1MAE parcels: 6.3
2alert day=5 observed=160 forecast=132 residual=+28The example gives a baseline and one reviewable alert. In a real system, an alert threshold should reflect historical residual variation and capacity cost. A prediction interval or empirically measured quantile of prior residuals is more defensible than picking 20 without evidence.
A single future week is fragile evidence. Run several rolling-origin evaluations:
| Training history | Validation window | Question |
|---|---|---|
| January | first week of February | works after initial launch? |
| January through February | first week of March | adapts after more data? |
| January through March | first week of April | survives seasonal shift? |
Record MAE by fulfillment center, service tier, weekday, and promotion status. A national average can hide a warehouse that routinely underforecasts peak demand. For inventory decisions, also consider asymmetry: underpredicting demand may cost more than reserving excess capacity.
The target must match the decision. Forecasting shipped parcel count helps staffing; forecasting late parcels helps customer notification; detecting an unusual scan-drop rate helps incident response. Don't combine those labels into one ambiguous "operations score."
An anomaly is a residual large enough to justify inspection, not proof that the model or warehouse failed. The Friday spike may correspond to a campaign, a bulk seller import, duplicated event ingestion, or genuine demand growth.
Log an alert artifact:
| Field | Purpose |
|---|---|
| series and warehouse | route investigation |
| forecast version and training cutoff | reproduce expectation |
| observed value and residual | quantify deviation |
| interval or threshold version | explain alert policy |
| known promotion flag | reduce false alarms |
| owner and resolution | teach future models and policies |
A model that silently swallows anomalies is hard to operate. A model that pages on every seasonal pattern is equally unhelpful. Forecast evaluation and alert evaluation are related but separate: one measures prediction error, the other measures whether the alert led to useful action.
| Evidence | A production-ready answer demonstrates |
|---|---|
| temporal evaluation | uses rolling or future-window backtests rather than shuffled rows |
| forecast use | maps prediction intervals and error costs to inventory or capacity decisions |
| alert policy | detects actionable residual anomalies with escalation and review controls |
| Symptom | Cause | Fix |
|---|---|---|
| Test results collapse at launch | random split leaked future behavior | use rolling-origin validation |
| Alerts fire every Monday | weekly seasonality absent | add seasonal baseline first |
| Operations ignores alerts | no ownership or context | log threshold, residual, and resolution |