Turn delivery events into stable prediction inputs while preventing leakage and training-serving mismatch.
The previous lesson produced versioned datasets with clean splits. A prediction model still can't consume a raw event log directly. To predict whether a parcel will be late, it needs a fixed row of measurements available at the moment the promise is made.
Those measurements are features. A feature such as hours_since_last_scan compresses many warehouse events into one input value. The hard part isn't inventing columns. It's ensuring each value means the same thing during training and while serving live requests.
Suppose the product asks at 2026-05-01 09:00: will order O-204 arrive later than its promised day? Carrier scans after that timestamp aren't available to the prediction service and can't appear in its training row.
| Candidate field | Known at prediction time? | Use as feature? |
|---|---|---|
| carrier code | yes | yes, categorical |
| route distance in km | yes | yes, numeric |
| hours since most recent scan | yes | yes, numeric |
| warehouse backlog at origin | yes | yes, numeric |
| delivered timestamp | no | no, it defines the eventual label |
| customer complaint after delay | no | no, it leaks the outcome |
The label may be computed later as delivered_after_promise = 1. A feature must be computed from history ending at the prediction timestamp. If a training row contains the later complaint, offline accuracy will reward a model for reading the answer.
Sculley et al. describe production ML systems as networks of data and configuration dependencies where hidden feedback and undeclared consumers create technical debt.[1] Feature definitions are one of those dependencies: when their time boundary is unclear, the model's impressive score doesn't survive deployment.
Build a small contract for late-delivery prediction:
| Feature | Type | Missing rule | Why it can help |
|---|---|---|---|
distance_km | numeric | reject if absent | longer routes allow more disruption |
hours_since_last_scan | numeric | cap at 168 | stale movement signals delay risk |
origin_backlog | numeric | use measured queue only | congestion affects departure time |
carrier | categorical | map unseen to other | carriers have different networks |
expedited | boolean | default false only when source guarantees it | service class changes promise |
A missing value is a product decision. Filling missing origin_backlog with zero says "unknown congestion means no congestion," which is rarely defensible. Store an additional origin_backlog_missing indicator or stop scoring until the feed recovers.
Categorical values need a policy too. If a new carrier appears after training, the online encoder can't invent a new model column. An other bucket provides stable behavior while a new model candidate is evaluated.
The following tiny transformation creates a feature row using only facts visible at prediction time and leaves post-delivery fields out of the model input.
1from datetime import datetime
2
3prediction_time = datetime.fromisoformat("2026-05-01T09:00:00")
4shipment = {
5 "order_id": "O-204",
6 "carrier": "northline",
7 "distance_km": 620,
8 "origin_backlog": 18,
9 "expedited": False,
10 "last_scan_at": "2026-05-01T01:00:00",
11 "delivered_at": "2026-05-03T14:00:00",
12 "late_complaint": True,
13}
14
15future_only = {"delivered_at", "late_complaint"}
16
17def make_features(row, at):
18 assert future_only.isdisjoint({"carrier", "distance_km", "origin_backlog", "expedited", "last_scan_at"})
19 last_scan = datetime.fromisoformat(row["last_scan_at"])
20 return {
21 "carrier": row.get("carrier", "other"),
22 "distance_km": float(row["distance_km"]),
23 "hours_since_last_scan": (at - last_scan).total_seconds() / 3600,
24 "origin_backlog": float(row["origin_backlog"]),
25 "expedited": int(row.get("expedited", False)),
26 }
27
28features = make_features(shipment, prediction_time)
29print(features)
30print("contains future outcome:", any(key in features for key in future_only))1{'carrier': 'northline', 'distance_km': 620.0, 'hours_since_last_scan': 8.0, 'origin_backlog': 18.0, 'expedited': 0}
2contains future outcome: FalseThat assertion looks simple, but it expresses a release requirement: no candidate model may train on fields unavailable to live scoring. A stronger pipeline also records data source, event timestamp, transformation version, allowed range, and missing-value rate for each feature.
An offline notebook might compute backlog by scanning a completed daily table. The service might read an hourly cache. Even when both columns are named origin_backlog, differences in freshness or aggregation can change predictions. This failure is training-serving skew.
A feature platform can keep definitions versioned and obtain historical features using point-in-time correct joins. Feast documents this separation between historical retrieval for training and online retrieval for serving.[2] The tool isn't the lesson: the contract is. A model release must identify the feature definition and snapshot that produced its score.
Monitor the contract before monitoring accuracy:
| Production check | Failure it catches | Action |
|---|---|---|
| null rate by feature | upstream feed disappeared | fail closed or fallback |
| unseen-category rate | carrier catalog changed | collect labels and retrain |
| freshness lag | online values are stale | pause promotions |
| offline/online parity sample | transformations disagree | repair feature path |
| Evidence | A production-ready answer demonstrates |
|---|---|
| prediction contract | identifies prediction time, allowed fields, labels, and missing-value meaning |
| leakage control | proves future-only events cannot enter feature construction |
| parity plan | versions transformations and monitors online/offline disagreement |
| Symptom | Cause | Fix |
|---|---|---|
| Offline score is excellent, live accuracy collapses | future event entered features | enforce prediction timestamps and leakage gates |
| New carrier causes errors or silent zeros | categorical mapping wasn't versioned | reserve other and monitor its rate |
| Predictions shift after a data job rewrite | feature meaning changed without model release | version transformation and test parity |