Train a boosted ETA-risk baseline from tabular features, evaluate slices, and package deployment evidence.
The feature pipeline now gives you honest rows: distance, scan age, backlog, service class, and a later label saying whether delivery missed its promise. A practical first model for this kind of table is a gradient boosted tree model.
Boosted trees are not a fallback for teams that haven't reached neural networks. For structured data with numeric and categorical business signals, they are often an efficient, inspectable candidate. The production question is not which model family sounds newer. It is which candidate wins on frozen, time-aware evidence while meeting latency and operational constraints.
Suppose late = 1 means a parcel missed its promised delivery date. Before training, publish a split:
| Split | Calendar range | Purpose |
|---|---|---|
| train | January through March | fit model |
| validation | April | select trees, depth, threshold |
| test | May | final evidence after choices freeze |
Random rows would place near-identical traffic patterns from the same disruption into train and test. Time order better represents a model facing tomorrow's shipments.
Your first candidate can be a rule: predict late when hours_since_last_scan >= 18. It is weak, but it makes the boosted model prove its added complexity rather than receiving credit for any nonzero result.
A shallow decision tree partitions rows into a few rules. Gradient boosting adds shallow trees sequentially: each new tree moves predictions toward errors left by the earlier ensemble. Friedman describes this process as stage-wise function approximation using loss gradients.[1]
For intuition, imagine predicting delay hours rather than a binary outcome:
| Lane | Actual delay | First prediction | Residual |
|---|---|---|---|
| local standard | 2 | 6 | -4 |
| regional standard | 8 | 6 | +2 |
| cross-country economy | 20 | 6 | +14 |
A small correction tree might add little for local parcels and more for long economy routes. A learning rate applies only part of that correction, so repeated trees refine mistakes without chasing every unusual training row.
XGBoost extends tree boosting with a regularized objective, sparse-aware split handling, column blocks, and parallel techniques for scalable training.[2] For an engineer, the important artifact is still the evaluation contract: a fitted booster and its threshold must be linked to feature version, split manifest, metrics, and serving schema.
The classifier outputs a late-risk score. Operations needs a choice: intervene, notify, or leave the parcel on its normal path. A missed delay on expedited shipments may be more expensive than an unnecessary proactive notification.
1rows = [
2 {"id": "E1", "tier": "expedited", "gold": 1, "score": 0.72},
3 {"id": "E2", "tier": "expedited", "gold": 1, "score": 0.46},
4 {"id": "S1", "tier": "standard", "gold": 0, "score": 0.41},
5 {"id": "S2", "tier": "standard", "gold": 1, "score": 0.62},
6 {"id": "S3", "tier": "standard", "gold": 0, "score": 0.21},
7]
8
9def cost_at(threshold):
10 cost = 0
11 expedited_misses = 0
12 for row in rows:
13 predict_late = row["score"] >= threshold
14 if row["gold"] == 1 and not predict_late:
15 cost += 150 if row["tier"] == "expedited" else 60
16 expedited_misses += row["tier"] == "expedited"
17 if row["gold"] == 0 and predict_late:
18 cost += 10
19 return cost, expedited_misses
20
21for threshold in (0.40, 0.50, 0.70):
22 cost, misses = cost_at(threshold)
23 print(f"threshold={threshold:.2f} cost={cost} expedited_misses={misses}")1threshold=0.40 cost=10 expedited_misses=0
2threshold=0.50 cost=150 expedited_misses=1
3threshold=0.70 cost=210 expedited_misses=1Here 0.40 wins this tiny validation check because it catches both expedited delays and incurs one cheap false alarm. It isn't evidence for broad deployment. It is evidence that business cost and required slices belong beside aggregate metrics.
The candidate should export:
| Artifact | Why it matters |
|---|---|
feature_contract.json | proves column meanings and time boundary |
split_manifest.json | proves evaluation wasn't random or leaky |
booster.json | versioned fitted model |
threshold_policy.json | turns score into action |
slice_metrics.json | blocks expedited-lane misses |
serving_schema.json | validates incoming row shape |
Early stopping on validation loss can prevent unnecessary trees from fitting noise, but it also becomes part of the training decision. Store the chosen round count and metrics. When feature distributions move later, retrain a new candidate instead of mutating production in place.
| Evidence | A production-ready answer demonstrates |
|---|---|
| baseline discipline | compares boosted trees with a declared operational baseline on future holdout data |
| decision policy | converts calibrated risk into thresholded actions with explicit costs |
| slice safety | evaluates important routes, carriers, and shipment classes before release |
| Symptom | Cause | Fix |
|---|---|---|
| Great validation result, poor next month | random or stale split | evaluate on later shipments |
| Retraining changes interventions unexpectedly | threshold wasn't versioned with model | publish one scoring bundle |
| Average recall passes while premium parcels fail | no required-slice gate | gate expedited and critical lanes |