LearnProduction ML SystemsGradient Boosted Trees in Production

⚙️MediumMLOps & Deployment

Gradient Boosted Trees in Production

Train a boosted SLA-risk baseline from tabular features, evaluate slices, and package deployment evidence.

14 min read

Learning path

Step 44 of 158 in the full curriculum

Batch and Streaming Feature Pipelines Ranking and Recommendation Systems

The feature pipeline now gives you honest rows: input size, heartbeat age, backlog, priority class, and a later label saying whether a job missed its SLA. The next job is to turn those rows into a useful decision: which jobs need intervention before they miss their SLA?

A gradient boosted tree classifier is a strong candidate for this kind of table. It can learn nonlinear rules without replacing the evaluation discipline from the previous chapters. A fitted model earns promotion only when it beats a declared baseline on later jobs, survives important slices, and travels with enough evidence to reproduce the decision.

Production boosted-tree evidence dashboard with April validation loss falling from 0.6805 to a minimum of 0.5643 at round 50 before rising to 0.5922 at round 160, threshold-cost bars selecting 0.15, and May priority and large-model recall clearing their support-aware release gates. — April freezes both model size and action policy: round `50` minimizes validation loss and threshold `0.15` minimizes declared cost. Only then does May provide final evidence, where priority recall `0.944` and large-model recall `0.903` pass their support-aware gates.

Freeze the clock before training

Suppose missed_sla = 1 means a training job missed its SLA. Before training, publish a split:

Split	Calendar range	Purpose
train	January through March	fit model
validation	April	select round count and threshold
test	May	final evidence after choices freeze

Random rows could place near-identical traffic patterns from the same disruption into train and test. Time order better represents a model facing tomorrow's jobs.

Time order doesn't erase every dependency. If one job incident can contribute several rows, document the split unit and keep linked rows together whenever those correlations would make holdout artificially easy.

Build a deterministic fixture that follows that calendar. The generated labels depend on heartbeat age, backlog, service tier, input size, and a small amount of noise. May also gets a slight drift term, so the test month isn't identical to training.

build-monthly-fixture.py

import json
from hashlib import sha256

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss, roc_auc_score

rng = np.random.default_rng(11)
FEATURES = [
    "input_gb",
    "heartbeat_age_hours",
    "backlog",
    "priority",
    "large_model",
]

def sigmoid(value):
    return 1 / (1 + np.exp(-value))

def make_jobs(month, count, drift=0.0):
    input_gb = rng.integers(40, 1601, count)
    heartbeat_age = rng.integers(1, 41, count)
    backlog = rng.integers(0, 31, count)
    priority = rng.integers(0, 2, count)
    large_model = (input_gb >= 900).astype(int)
    logit = (
        -5.2
        + 0.13 * heartbeat_age
        + 0.07 * backlog
        + 0.85 * priority
        + 0.75 * large_model
        + drift
    )
    missed_sla = rng.binomial(1, sigmoid(logit))
    X = np.column_stack([input_gb, heartbeat_age, backlog, priority, large_model])
    return {"month": month, "X": X, "y": missed_sla}

train = make_jobs("Jan-Mar", 360)
valid = make_jobs("April", 140)
test = make_jobs("May", 140, drift=0.18)

for split in (train, valid, test):
    print(f"{split['month']}: rows={len(split['y'])} missed_sla={int(split['y'].sum())}")

Output

Jan-Mar: rows=360 missed_sla=119
April: rows=140 missed_sla=58
May: rows=140 missed_sla=55

The fixture is synthetic so you can rerun the whole lab. In a real training job, X must come from the point-in-time replay contract you built in Batch and Streaming Feature Pipelines. A row created after its decision timestamp doesn't belong in training.

Publish the cheapest baseline

Your first candidate can be a rule: predict an SLA miss when heartbeat_age_hours >= 18. It's weak, but it makes the boosted model prove its added complexity rather than receiving credit for any nonzero result.

Define one cost policy before comparing models:

A missed priority job costs 150.
A missed standard job costs 60.
A false alarm costs 8.

The exact numbers will differ by product. Publishing them matters because model quality is inseparable from the action the score triggers.

score-rule-baseline.py

def confusion(y, predicted):
    return {
        "tp": int(np.sum((y == 1) & (predicted == 1))),
        "fp": int(np.sum((y == 0) & (predicted == 1))),
        "fn": int(np.sum((y == 1) & (predicted == 0))),
        "tn": int(np.sum((y == 0) & (predicted == 0))),
    }

def cost_stats(split, scores, threshold):
    predicted = (scores >= threshold).astype(int)
    priority = split["X"][:, FEATURES.index("priority")] == 1
    missed = (split["y"] == 1) & (predicted == 0)
    false_alarm = (split["y"] == 0) & (predicted == 1)
    return {
        "threshold": threshold,
        "cost": int(
            150 * np.sum(missed & priority)
            + 60 * np.sum(missed & ~priority)
            + 8 * np.sum(false_alarm)
        ),
        "priority_misses": int(np.sum(missed & priority)),
        **confusion(split["y"], predicted),
    }

heartbeat_age_index = FEATURES.index("heartbeat_age_hours")
rule_valid = (valid["X"][:, heartbeat_age_index] >= 18).astype(int)
print("rule validation:", cost_stats(valid, rule_valid.astype(float), threshold=0.50))

Output

rule validation: {'threshold': 0.5, 'cost': 1560, 'priority_misses': 8, 'tp': 48, 'fp': 30, 'fn': 10, 'tn': 52}

This baseline is intentionally simple. It already catches 48 April SLA misses. A boosted model has to improve the decision policy, rather than produce a decimal score that changes no action.

Rebuild one boosting step by hand

A shallow decision tree partitions rows into a few rules. Gradient boosting adds shallow trees sequentially: each new tree moves predictions toward errors left by the earlier ensemble. Friedman describes this process as stage-wise function approximation using loss gradients.^{[1]Reference 1Greedy Function Approximation: A Gradient Boosting Machinehttps://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-Function-Approximation--A-Gradient-Boosting-Machine/10.1214/aos/1013203451.full}

For intuition, switch from a binary outcome to delay hours:

Lane	Actual delay	First prediction	Residual
local standard	2	6	-4
regional standard	8	6	+2
large-model economy	20	6	+14

A small correction tree might add little for local jobs and more for long economy workloads. For squared-error regression, its training target is the residual actual - prediction. A learning rate applies only part of that correction, so repeated trees refine mistakes without letting one tree dominate.

apply-one-residual-correction.py

actual_delay = np.array([2.0, 8.0, 20.0])
prediction = np.array([6.0, 6.0, 6.0])
large_model = np.array([0, 0, 1])
residual = actual_delay - prediction

correction = np.where(
    large_model == 1,
    residual[large_model == 1].mean(),
    residual[large_model == 0].mean(),
)
learning_rate = 0.25
updated = prediction + learning_rate * correction

print("residuals:", residual.tolist())
print("tree correction:", correction.tolist())
print("before mae:", round(float(np.mean(np.abs(actual_delay - prediction))), 2))
print("after mae:", round(float(np.mean(np.abs(actual_delay - updated))), 2))

Output

residuals: [-4.0, 2.0, 14.0]
tree correction: [-1.0, -1.0, 14.0]
before mae: 6.67
after mae: 5.5

The correction tree predicts -1 hour for shorter workloads and 14 hours for the large-model workload. Shrinkage applies only one quarter of that proposal. One round lowers mean absolute error without pretending the first correction is perfect.

Pause and predict: If learning rate changed from 0.25 to 1.0, which job would move most? The large-model job would jump by the full 14 hours because its residual-tree leaf has the largest correction.

Classification needs one extra step. An SLA-risk classifier doesn't fit raw delay-hour residuals. With log loss, each new tree fits the negative gradient of that loss. At an initial probability of 0.50, the binary log-loss gradient points toward gold - probability.

inspect-log-loss-gradient.py

gold = np.array([0.0, 1.0, 1.0])
probability = np.full(3, 0.50)
negative_gradient = gold - probability

print("initial probability:", probability.tolist())
print("gold - probability:", negative_gradient.tolist())

Output

initial probability: [0.5, 0.5, 0.5]
gold - probability: [-0.5, 0.5, 0.5]

Negative values push SLA-risk scores down. Positive values push them up. A real classification booster repeats this process across many rows and trees.

XGBoost extends tree boosting with a regularized objective, sparse-aware split handling, column blocks, and parallel techniques for scalable training.^{[2]Reference 2XGBoost: A Scalable Tree Boosting System.https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf} For an engineer, the important artifact is still the evaluation contract: a fitted booster and its threshold must be linked to feature version, split manifest, metrics, and serving schema.

Diagram showing Feature snapshot v1 time split, Rule baseline heartbeat age only, Boosted trees depth + rounds, and Validation comparison. — Feature snapshot v1 time split, Rule baseline heartbeat age only, Boosted trees depth + rounds, and Validation comparison.

Fit a booster with an explicit stopping rule

The mechanics stay the same across libraries. This lab uses scikit-learn's GradientBoostingClassifier so the code stays small. Its classifier builds an additive model stage by stage and fits trees against loss gradients. A larger production job might swap in XGBoost for its scalable training system, but the evidence contract doesn't change.

Fix tree depth and learning rate first. Train a generous maximum number of rounds on January through March, then inspect April loss after each stage.

fit-round-search.py

round_search = GradientBoostingClassifier(
    loss="log_loss",
    n_estimators=160,
    learning_rate=0.05,
    max_depth=2,
    random_state=11,
)
round_search.fit(train["X"], train["y"])
print("trained stage cap:", round_search.n_estimators_)

Output

trained stage cap: 160

choose-early-stop-round.py

validation_losses = [
    log_loss(valid["y"], probabilities[:, 1], labels=[0, 1])
    for probabilities in round_search.staged_predict_proba(valid["X"])
]
best_round = int(np.argmin(validation_losses) + 1)

print("round 1 validation loss:", round(validation_losses[0], 4))
print("best round:", best_round)
print("best validation loss:", round(validation_losses[best_round - 1], 4))
print("round 160 validation loss:", round(validation_losses[-1], 4))

Output

round 1 validation loss: 0.6805
best round: 50
best validation loss: 0.5643
round 160 validation loss: 0.5922

April loss reaches its minimum at round 50, then gets worse by round 160. More trees improved the training objective but stopped helping future jobs. That's the overfitting symptom early stopping catches.

Some libraries expose built-in early stopping flags. A production job still needs to record the exact validation slice, monitored metric, patience policy, and selected round count. Here the stopping rule is visible: choose the stage with minimum April log loss, then fit the candidate with that frozen count.

fit-frozen-candidate.py

candidate = GradientBoostingClassifier(
    loss="log_loss",
    n_estimators=best_round,
    learning_rate=0.05,
    max_depth=2,
    random_state=11,
)
candidate.fit(train["X"], train["y"])
valid_scores = candidate.predict_proba(valid["X"])[:, 1]

print("frozen rounds:", candidate.n_estimators_)
print("rule validation auc:", round(roc_auc_score(valid["y"], rule_valid), 3))
print("boosted validation auc:", round(roc_auc_score(valid["y"], valid_scores), 3))

Output

frozen rounds: 50
rule validation auc: 0.731
boosted validation auc: 0.792

The boosted model improves April AUC from 0.731 to 0.792. Area under the receiver operating characteristic curve (AUC) summarizes ranking quality across thresholds. It doesn't choose an operations policy.

Measure threshold cost, not AUC alone

The classifier outputs an SLA-risk score. Operations needs a choice: intervene, notify, or leave the job on its normal path. A missed SLA on priority jobs may be more expensive than an unnecessary proactive notification.

Search candidate thresholds on April only. The sort key minimizes declared cost and uses lower threshold as tie-breaker.

choose-sla-threshold.py

thresholds = [0.15, 0.20, 0.25, 0.30, 0.40]
for threshold in thresholds:
    stats = cost_stats(valid, valid_scores, threshold)
    print(
        f"threshold={threshold:.2f} cost={stats['cost']} "
        f"priority_misses={stats['priority_misses']}"
    )

selected_threshold = min(
    thresholds,
    key=lambda threshold: (cost_stats(valid, valid_scores, threshold)["cost"], threshold),
)
print("selected threshold:", selected_threshold)

Output

threshold=0.15 cost=1316 priority_misses=6
threshold=0.20 cost=1884 priority_misses=10
threshold=0.25 cost=1912 priority_misses=10
threshold=0.30 cost=2038 priority_misses=11
threshold=0.40 cost=2604 priority_misses=14
selected threshold: 0.15

The selected threshold is 0.15, much lower than the common default of 0.50. Missing a priority SLA miss is expensive in this policy, so the model accepts more false alarms. If alert fatigue mattered more, the cost table would need another term.

Now freeze the round count, threshold, and cost policy. May can be opened once for final candidate evidence.

score-frozen-test-month.py

test_scores = candidate.predict_proba(test["X"])[:, 1]
test_stats = cost_stats(test, test_scores, selected_threshold)

print("test auc:", round(roc_auc_score(test["y"], test_scores), 3))
print("test policy:", test_stats)

Output

test auc: 0.882
test policy: {'threshold': 0.15, 'cost': 728, 'priority_misses': 2, 'tp': 50, 'fp': 31, 'fn': 5, 'tn': 54}

The holdout result is useful, but aggregate AUC and cost can still hide a serious workload failure.

Gate slices before promotion

Pick required slices before reading test results. A slice gate also needs enough positive cases to mean anything: perfect recall on one SLA-miss job is weak evidence, and an empty slice can't produce recall at all. This lab requires at least 10 SLA-miss jobs per slice, recall of at least 0.90 for priority jobs, and recall of at least 0.85 for large-model jobs.

check-required-slices.py

def slice_gate(split, scores, threshold, name, mask, minimum_recall, minimum_positives):
    predicted = (scores >= threshold).astype(int)
    stats = confusion(split["y"][mask], predicted[mask])
    positives = stats["tp"] + stats["fn"]
    measured_recall = stats["tp"] / positives if positives else 0.0
    enough_support = positives >= minimum_positives
    passed = enough_support and measured_recall >= minimum_recall
    result = {
        "name": name,
        "rows": int(np.sum(mask)),
        "positives": positives,
        "recall": round(measured_recall, 3),
        "minimum_recall": minimum_recall,
        "minimum_positives": minimum_positives,
        "confusion": stats,
        "passed": passed,
    }
    print(
        f"{name}: positives={positives} recall={measured_recall:.3f} "
        f"minimum_recall={minimum_recall:.2f} "
        f"minimum_positives={minimum_positives} pass={passed}"
    )
    return result

priority_index = FEATURES.index("priority")
large_model_index = FEATURES.index("large_model")
required_slice_results = [
    slice_gate(
        test,
        test_scores,
        selected_threshold,
        "priority",
        test["X"][:, priority_index] == 1,
        minimum_recall=0.90,
        minimum_positives=10,
    ),
    slice_gate(
        test,
        test_scores,
        selected_threshold,
        "large_model",
        test["X"][:, large_model_index] == 1,
        minimum_recall=0.85,
        minimum_positives=10,
    ),
]
required_slices_pass = all(result["passed"] for result in required_slice_results)
print("all required slices pass:", required_slices_pass)

Output

priority: positives=36 recall=0.944 minimum_recall=0.90 minimum_positives=10 pass=True
large_model: positives=31 recall=0.903 minimum_recall=0.85 minimum_positives=10 pass=True
all required slices pass: True

Both required slices pass for the frozen candidate. Keep the gate anyway. A future retrain, feature drift, or threshold edit can break one segment while aggregate metrics still look acceptable.

To see the failure symptom, test a careless threshold edit from 0.15 to 0.50.

reproduce-threshold-regression.py

bad_threshold = 0.50
bad_slice_results = [
    slice_gate(
        test,
        test_scores,
        bad_threshold,
        "priority",
        test["X"][:, priority_index] == 1,
        minimum_recall=0.90,
        minimum_positives=10,
    ),
    slice_gate(
        test,
        test_scores,
        bad_threshold,
        "large_model",
        test["X"][:, large_model_index] == 1,
        minimum_recall=0.85,
        minimum_positives=10,
    ),
]
print("release blocked:", not all(result["passed"] for result in bad_slice_results))

Output

priority: positives=36 recall=0.722 minimum_recall=0.90 minimum_positives=10 pass=False
large_model: positives=31 recall=0.645 minimum_recall=0.85 minimum_positives=10 pass=False
release blocked: True

Nothing about the fitted trees changed. The threshold edit alone drops priority recall from 0.944 to 0.722. Versioning only the model file would miss the regression.

Ship a model that can be operated

The candidate should export:

Artifact	Why it matters
`feature_contract.json`	proves column meanings and time boundary
`split_manifest.json`	proves evaluation wasn't random or leaky
`model.skops`	versioned fitted scikit-learn model
`threshold_policy.json`	turns score into action
`slice_metrics.json`	records required-slice support and regressions
`serving_schema.json`	validates incoming row shape

Create a compact receipt for the candidate. The real deployment bundle would serialize the fitted estimator with skops.io, inspect any untrusted types before loading it, and pin Python, scikit-learn, NumPy, SciPy, and skops versions. Loading a persisted estimator across scikit-learn versions isn't supported.^{[3]Reference 3Model persistencehttps://scikit-learn.org/stable/model_persistence.html} It would also include the schema files named above.

publish-candidate-receipt.py

receipt = {
    "artifact": "sla-risk-candidate-v1",
    "feature_contract": "point-in-time-snapshot-v1",
    "model": {
        "family": "gradient_boosted_trees",
        "learning_rate": 0.05,
        "max_depth": 2,
        "rounds": best_round,
    },
    "policy": {
        "sla_risk_threshold": selected_threshold,
        "costs": {
            "false_alarm": 8,
            "standard_miss": 60,
            "priority_miss": 150,
        },
    },
    "split_manifest": {"train": "Jan-Mar", "validation": "April", "test": "May"},
    "test_auc": round(roc_auc_score(test["y"], test_scores), 3),
    "test_policy": test_stats,
    "required_slice_metrics": required_slice_results,
    "required_slices_pass": required_slices_pass,
    "release_status": "candidate_for_shadow" if required_slices_pass else "blocked",
}
payload = json.dumps(receipt, sort_keys=True)
print(json.dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])

Output

{
  "artifact": "sla-risk-candidate-v1",
  "feature_contract": "point-in-time-snapshot-v1",
  "model": {
    "family": "gradient_boosted_trees",
    "learning_rate": 0.05,
    "max_depth": 2,
    "rounds": 50
  },
  "policy": {
    "costs": {
      "false_alarm": 8,
      "priority_miss": 150,
      "standard_miss": 60
    },
    "sla_risk_threshold": 0.15
  },
  "release_status": "candidate_for_shadow",
  "required_slice_metrics": [
    {
      "confusion": {
        "fn": 2,
        "fp": 9,
        "tn": 29,
        "tp": 34
      },
      "minimum_positives": 10,
      "minimum_recall": 0.9,
      "name": "priority",
      "passed": true,
      "positives": 36,
      "recall": 0.944,
      "rows": 74
    },
    {
      "confusion": {
        "fn": 3,
        "fp": 18,
        "tn": 21,
        "tp": 28
      },
      "minimum_positives": 10,
      "minimum_recall": 0.85,
      "name": "large_model",
      "passed": true,
      "positives": 31,
      "recall": 0.903,
      "rows": 70
    }
  ],
  "required_slices_pass": true,
  "split_manifest": {
    "test": "May",
    "train": "Jan-Mar",
    "validation": "April"
  },
  "test_auc": 0.882,
  "test_policy": {
    "cost": 728,
    "fn": 5,
    "fp": 31,
    "priority_misses": 2,
    "threshold": 0.15,
    "tn": 54,
    "tp": 50
  }
}
receipt sha256: fd121184d2ff

The receipt binds feature semantics, fitted-tree choices, threshold policy, split dates, measured holdout policy, and support-aware slice evidence. Shadow traffic and production monitoring still come next. Don't mutate the live policy in place when retraining changes those values.

Explain the candidate without looking back

Practice

Change drift=0.18 for May to drift=0.80. Rebuild the fixture from the first cell and rerun the lab. Compare test AUC, cost, and slice gates.
Add a false-alarm cost of 30 instead of 8. Predict whether selected threshold should move up or down, then verify it.
Raise max_depth from 2 to 5 in both fitted models. Compare best round and April loss. Explain why more complex trees need fresh evidence.
Add a required slice for standard jobs with minimum recall 0.75. Decide whether candidate still passes.
Remove threshold from receipt. Explain why rollback is no longer reproducible.
Raise minimum_positives to 40. Confirm required slices fail closed when holdout evidence is too thin.

What strong answers show

Evidence	What a strong answer shows
baseline discipline	compares boosted trees with a declared operational baseline on future holdout data
boosting mechanics	distinguishes regression residual corrections from classification loss gradients
early stopping	records monitored validation month, metric, stage cap, and selected round
decision policy	converts risk scores into thresholded actions with explicit costs
slice safety	evaluates support-aware workloads, workload classes, and job classes before release

When promotion breaks

Symptom	Cause	Fix
Great validation result, poor next month	random or stale split	evaluate on later jobs
Training keeps improving while April loss worsens	too many boosting rounds	stop on later validation loss and record selected round
Retraining changes interventions unexpectedly	threshold wasn't versioned with model	publish one scoring bundle
Average recall passes while premium jobs fail	no required-slice gate	gate priority and critical job classes
Slice recall looks perfect on one SLA-miss job	slice gate ignored support	require minimum positive cases or report insufficient evidence

Next Step

Continue to Ranking and Recommendation Systems

You can now promote a tabular risk model only with operational evidence. Next, predict an ordering of documents, where exposure and feedback change the data you later train on.

PreviousBatch and Streaming Feature Pipelines

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Greedy Function Approximation: A Gradient Boosting Machine

Friedman, J. H. · 2001 · The Annals of Statistics

XGBoost: A Scalable Tree Boosting System.

Chen, T. & Guestrin, C. · 2016 · KDD 2016

Model persistence

scikit-learn developers · 2026 · scikit-learn User Guide

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnProduction ML SystemsGradient Boosted Trees in Production

⚙️MediumMLOps & Deployment

Gradient Boosted Trees in Production

Train a boosted SLA-risk baseline from tabular features, evaluate slices, and package deployment evidence.

14 min read

Learning path

Step 44 of 158 in the full curriculum

Batch and Streaming Feature Pipelines Ranking and Recommendation Systems

Freeze the clock before training

Suppose missed_sla = 1 means a training job missed its SLA. Before training, publish a split:

Split	Calendar range	Purpose
train	January through March	fit model
validation	April	select round count and threshold
test	May	final evidence after choices freeze

Random rows could place near-identical traffic patterns from the same disruption into train and test. Time order better represents a model facing tomorrow's jobs.

build-monthly-fixture.py

import json
from hashlib import sha256

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss, roc_auc_score

rng = np.random.default_rng(11)
FEATURES = [
    "input_gb",
    "heartbeat_age_hours",
    "backlog",
    "priority",
    "large_model",
]

def sigmoid(value):
    return 1 / (1 + np.exp(-value))

def make_jobs(month, count, drift=0.0):
    input_gb = rng.integers(40, 1601, count)
    heartbeat_age = rng.integers(1, 41, count)
    backlog = rng.integers(0, 31, count)
    priority = rng.integers(0, 2, count)
    large_model = (input_gb >= 900).astype(int)
    logit = (
        -5.2
        + 0.13 * heartbeat_age
        + 0.07 * backlog
        + 0.85 * priority
        + 0.75 * large_model
        + drift
    )
    missed_sla = rng.binomial(1, sigmoid(logit))
    X = np.column_stack([input_gb, heartbeat_age, backlog, priority, large_model])
    return {"month": month, "X": X, "y": missed_sla}

train = make_jobs("Jan-Mar", 360)
valid = make_jobs("April", 140)
test = make_jobs("May", 140, drift=0.18)

for split in (train, valid, test):
    print(f"{split['month']}: rows={len(split['y'])} missed_sla={int(split['y'].sum())}")

Output

Jan-Mar: rows=360 missed_sla=119
April: rows=140 missed_sla=58
May: rows=140 missed_sla=55

Publish the cheapest baseline

Define one cost policy before comparing models:

A missed priority job costs 150.
A missed standard job costs 60.
A false alarm costs 8.

The exact numbers will differ by product. Publishing them matters because model quality is inseparable from the action the score triggers.

score-rule-baseline.py

def confusion(y, predicted):
    return {
        "tp": int(np.sum((y == 1) & (predicted == 1))),
        "fp": int(np.sum((y == 0) & (predicted == 1))),
        "fn": int(np.sum((y == 1) & (predicted == 0))),
        "tn": int(np.sum((y == 0) & (predicted == 0))),
    }

def cost_stats(split, scores, threshold):
    predicted = (scores >= threshold).astype(int)
    priority = split["X"][:, FEATURES.index("priority")] == 1
    missed = (split["y"] == 1) & (predicted == 0)
    false_alarm = (split["y"] == 0) & (predicted == 1)
    return {
        "threshold": threshold,
        "cost": int(
            150 * np.sum(missed & priority)
            + 60 * np.sum(missed & ~priority)
            + 8 * np.sum(false_alarm)
        ),
        "priority_misses": int(np.sum(missed & priority)),
        **confusion(split["y"], predicted),
    }

heartbeat_age_index = FEATURES.index("heartbeat_age_hours")
rule_valid = (valid["X"][:, heartbeat_age_index] >= 18).astype(int)
print("rule validation:", cost_stats(valid, rule_valid.astype(float), threshold=0.50))

Output

rule validation: {'threshold': 0.5, 'cost': 1560, 'priority_misses': 8, 'tp': 48, 'fp': 30, 'fn': 10, 'tn': 52}

This baseline is intentionally simple. It already catches 48 April SLA misses. A boosted model has to improve the decision policy, rather than produce a decimal score that changes no action.

Rebuild one boosting step by hand

For intuition, switch from a binary outcome to delay hours:

Lane	Actual delay	First prediction	Residual
local standard	2	6	-4
regional standard	8	6	+2
large-model economy	20	6	+14

apply-one-residual-correction.py

actual_delay = np.array([2.0, 8.0, 20.0])
prediction = np.array([6.0, 6.0, 6.0])
large_model = np.array([0, 0, 1])
residual = actual_delay - prediction

correction = np.where(
    large_model == 1,
    residual[large_model == 1].mean(),
    residual[large_model == 0].mean(),
)
learning_rate = 0.25
updated = prediction + learning_rate * correction

print("residuals:", residual.tolist())
print("tree correction:", correction.tolist())
print("before mae:", round(float(np.mean(np.abs(actual_delay - prediction))), 2))
print("after mae:", round(float(np.mean(np.abs(actual_delay - updated))), 2))

Output

residuals: [-4.0, 2.0, 14.0]
tree correction: [-1.0, -1.0, 14.0]
before mae: 6.67
after mae: 5.5

Pause and predict: If learning rate changed from 0.25 to 1.0, which job would move most? The large-model job would jump by the full 14 hours because its residual-tree leaf has the largest correction.

inspect-log-loss-gradient.py

gold = np.array([0.0, 1.0, 1.0])
probability = np.full(3, 0.50)
negative_gradient = gold - probability

print("initial probability:", probability.tolist())
print("gold - probability:", negative_gradient.tolist())

Output

initial probability: [0.5, 0.5, 0.5]
gold - probability: [-0.5, 0.5, 0.5]

Negative values push SLA-risk scores down. Positive values push them up. A real classification booster repeats this process across many rows and trees.

Fit a booster with an explicit stopping rule

Fix tree depth and learning rate first. Train a generous maximum number of rounds on January through March, then inspect April loss after each stage.

fit-round-search.py

round_search = GradientBoostingClassifier(
    loss="log_loss",
    n_estimators=160,
    learning_rate=0.05,
    max_depth=2,
    random_state=11,
)
round_search.fit(train["X"], train["y"])
print("trained stage cap:", round_search.n_estimators_)

Output

trained stage cap: 160

choose-early-stop-round.py

validation_losses = [
    log_loss(valid["y"], probabilities[:, 1], labels=[0, 1])
    for probabilities in round_search.staged_predict_proba(valid["X"])
]
best_round = int(np.argmin(validation_losses) + 1)

print("round 1 validation loss:", round(validation_losses[0], 4))
print("best round:", best_round)
print("best validation loss:", round(validation_losses[best_round - 1], 4))
print("round 160 validation loss:", round(validation_losses[-1], 4))

Output

round 1 validation loss: 0.6805
best round: 50
best validation loss: 0.5643
round 160 validation loss: 0.5922

fit-frozen-candidate.py

candidate = GradientBoostingClassifier(
    loss="log_loss",
    n_estimators=best_round,
    learning_rate=0.05,
    max_depth=2,
    random_state=11,
)
candidate.fit(train["X"], train["y"])
valid_scores = candidate.predict_proba(valid["X"])[:, 1]

print("frozen rounds:", candidate.n_estimators_)
print("rule validation auc:", round(roc_auc_score(valid["y"], rule_valid), 3))
print("boosted validation auc:", round(roc_auc_score(valid["y"], valid_scores), 3))

Output

frozen rounds: 50
rule validation auc: 0.731
boosted validation auc: 0.792

Measure threshold cost, not AUC alone

Search candidate thresholds on April only. The sort key minimizes declared cost and uses lower threshold as tie-breaker.

choose-sla-threshold.py

thresholds = [0.15, 0.20, 0.25, 0.30, 0.40]
for threshold in thresholds:
    stats = cost_stats(valid, valid_scores, threshold)
    print(
        f"threshold={threshold:.2f} cost={stats['cost']} "
        f"priority_misses={stats['priority_misses']}"
    )

selected_threshold = min(
    thresholds,
    key=lambda threshold: (cost_stats(valid, valid_scores, threshold)["cost"], threshold),
)
print("selected threshold:", selected_threshold)

Output

threshold=0.15 cost=1316 priority_misses=6
threshold=0.20 cost=1884 priority_misses=10
threshold=0.25 cost=1912 priority_misses=10
threshold=0.30 cost=2038 priority_misses=11
threshold=0.40 cost=2604 priority_misses=14
selected threshold: 0.15

Now freeze the round count, threshold, and cost policy. May can be opened once for final candidate evidence.

score-frozen-test-month.py

test_scores = candidate.predict_proba(test["X"])[:, 1]
test_stats = cost_stats(test, test_scores, selected_threshold)

print("test auc:", round(roc_auc_score(test["y"], test_scores), 3))
print("test policy:", test_stats)

Output

test auc: 0.882
test policy: {'threshold': 0.15, 'cost': 728, 'priority_misses': 2, 'tp': 50, 'fp': 31, 'fn': 5, 'tn': 54}

The holdout result is useful, but aggregate AUC and cost can still hide a serious workload failure.

Gate slices before promotion

check-required-slices.py

def slice_gate(split, scores, threshold, name, mask, minimum_recall, minimum_positives):
    predicted = (scores >= threshold).astype(int)
    stats = confusion(split["y"][mask], predicted[mask])
    positives = stats["tp"] + stats["fn"]
    measured_recall = stats["tp"] / positives if positives else 0.0
    enough_support = positives >= minimum_positives
    passed = enough_support and measured_recall >= minimum_recall
    result = {
        "name": name,
        "rows": int(np.sum(mask)),
        "positives": positives,
        "recall": round(measured_recall, 3),
        "minimum_recall": minimum_recall,
        "minimum_positives": minimum_positives,
        "confusion": stats,
        "passed": passed,
    }
    print(
        f"{name}: positives={positives} recall={measured_recall:.3f} "
        f"minimum_recall={minimum_recall:.2f} "
        f"minimum_positives={minimum_positives} pass={passed}"
    )
    return result

priority_index = FEATURES.index("priority")
large_model_index = FEATURES.index("large_model")
required_slice_results = [
    slice_gate(
        test,
        test_scores,
        selected_threshold,
        "priority",
        test["X"][:, priority_index] == 1,
        minimum_recall=0.90,
        minimum_positives=10,
    ),
    slice_gate(
        test,
        test_scores,
        selected_threshold,
        "large_model",
        test["X"][:, large_model_index] == 1,
        minimum_recall=0.85,
        minimum_positives=10,
    ),
]
required_slices_pass = all(result["passed"] for result in required_slice_results)
print("all required slices pass:", required_slices_pass)

Output

priority: positives=36 recall=0.944 minimum_recall=0.90 minimum_positives=10 pass=True
large_model: positives=31 recall=0.903 minimum_recall=0.85 minimum_positives=10 pass=True
all required slices pass: True

Both required slices pass for the frozen candidate. Keep the gate anyway. A future retrain, feature drift, or threshold edit can break one segment while aggregate metrics still look acceptable.

To see the failure symptom, test a careless threshold edit from 0.15 to 0.50.

reproduce-threshold-regression.py

bad_threshold = 0.50
bad_slice_results = [
    slice_gate(
        test,
        test_scores,
        bad_threshold,
        "priority",
        test["X"][:, priority_index] == 1,
        minimum_recall=0.90,
        minimum_positives=10,
    ),
    slice_gate(
        test,
        test_scores,
        bad_threshold,
        "large_model",
        test["X"][:, large_model_index] == 1,
        minimum_recall=0.85,
        minimum_positives=10,
    ),
]
print("release blocked:", not all(result["passed"] for result in bad_slice_results))

Output

priority: positives=36 recall=0.722 minimum_recall=0.90 minimum_positives=10 pass=False
large_model: positives=31 recall=0.645 minimum_recall=0.85 minimum_positives=10 pass=False
release blocked: True

Nothing about the fitted trees changed. The threshold edit alone drops priority recall from 0.944 to 0.722. Versioning only the model file would miss the regression.

Ship a model that can be operated

The candidate should export:

Artifact	Why it matters
`feature_contract.json`	proves column meanings and time boundary
`split_manifest.json`	proves evaluation wasn't random or leaky
`model.skops`	versioned fitted scikit-learn model
`threshold_policy.json`	turns score into action
`slice_metrics.json`	records required-slice support and regressions
`serving_schema.json`	validates incoming row shape

publish-candidate-receipt.py

receipt = {
    "artifact": "sla-risk-candidate-v1",
    "feature_contract": "point-in-time-snapshot-v1",
    "model": {
        "family": "gradient_boosted_trees",
        "learning_rate": 0.05,
        "max_depth": 2,
        "rounds": best_round,
    },
    "policy": {
        "sla_risk_threshold": selected_threshold,
        "costs": {
            "false_alarm": 8,
            "standard_miss": 60,
            "priority_miss": 150,
        },
    },
    "split_manifest": {"train": "Jan-Mar", "validation": "April", "test": "May"},
    "test_auc": round(roc_auc_score(test["y"], test_scores), 3),
    "test_policy": test_stats,
    "required_slice_metrics": required_slice_results,
    "required_slices_pass": required_slices_pass,
    "release_status": "candidate_for_shadow" if required_slices_pass else "blocked",
}
payload = json.dumps(receipt, sort_keys=True)
print(json.dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])

Output

{
  "artifact": "sla-risk-candidate-v1",
  "feature_contract": "point-in-time-snapshot-v1",
  "model": {
    "family": "gradient_boosted_trees",
    "learning_rate": 0.05,
    "max_depth": 2,
    "rounds": 50
  },
  "policy": {
    "costs": {
      "false_alarm": 8,
      "priority_miss": 150,
      "standard_miss": 60
    },
    "sla_risk_threshold": 0.15
  },
  "release_status": "candidate_for_shadow",
  "required_slice_metrics": [
    {
      "confusion": {
        "fn": 2,
        "fp": 9,
        "tn": 29,
        "tp": 34
      },
      "minimum_positives": 10,
      "minimum_recall": 0.9,
      "name": "priority",
      "passed": true,
      "positives": 36,
      "recall": 0.944,
      "rows": 74
    },
    {
      "confusion": {
        "fn": 3,
        "fp": 18,
        "tn": 21,
        "tp": 28
      },
      "minimum_positives": 10,
      "minimum_recall": 0.85,
      "name": "large_model",
      "passed": true,
      "positives": 31,
      "recall": 0.903,
      "rows": 70
    }
  ],
  "required_slices_pass": true,
  "split_manifest": {
    "test": "May",
    "train": "Jan-Mar",
    "validation": "April"
  },
  "test_auc": 0.882,
  "test_policy": {
    "cost": 728,
    "fn": 5,
    "fp": 31,
    "priority_misses": 2,
    "threshold": 0.15,
    "tn": 54,
    "tp": 50
  }
}
receipt sha256: fd121184d2ff

Explain the candidate without looking back

Practice

Change drift=0.18 for May to drift=0.80. Rebuild the fixture from the first cell and rerun the lab. Compare test AUC, cost, and slice gates.
Add a false-alarm cost of 30 instead of 8. Predict whether selected threshold should move up or down, then verify it.
Raise max_depth from 2 to 5 in both fitted models. Compare best round and April loss. Explain why more complex trees need fresh evidence.
Add a required slice for standard jobs with minimum recall 0.75. Decide whether candidate still passes.
Remove threshold from receipt. Explain why rollback is no longer reproducible.
Raise minimum_positives to 40. Confirm required slices fail closed when holdout evidence is too thin.

What strong answers show

Evidence	What a strong answer shows
baseline discipline	compares boosted trees with a declared operational baseline on future holdout data
boosting mechanics	distinguishes regression residual corrections from classification loss gradients
early stopping	records monitored validation month, metric, stage cap, and selected round
decision policy	converts risk scores into thresholded actions with explicit costs
slice safety	evaluates support-aware workloads, workload classes, and job classes before release

When promotion breaks

Symptom	Cause	Fix
Great validation result, poor next month	random or stale split	evaluate on later jobs
Training keeps improving while April loss worsens	too many boosting rounds	stop on later validation loss and record selected round
Retraining changes interventions unexpectedly	threshold wasn't versioned with model	publish one scoring bundle
Average recall passes while premium jobs fail	no required-slice gate	gate priority and critical job classes
Slice recall looks perfect on one SLA-miss job	slice gate ignored support	require minimum positive cases or report insufficient evidence

Next Step

Continue to Ranking and Recommendation Systems

You can now promote a tabular risk model only with operational evidence. Next, predict an ordering of documents, where exposure and feedback change the data you later train on.

PreviousBatch and Streaming Feature Pipelines

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Greedy Function Approximation: A Gradient Boosting Machine

Friedman, J. H. · 2001 · The Annals of Statistics

XGBoost: A Scalable Tree Boosting System.

Chen, T. & Guestrin, C. · 2016 · KDD 2016

Model persistence

scikit-learn developers · 2026 · scikit-learn User Guide

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Gradient Boosted Trees in Production

Freeze the clock before training

Publish the cheapest baseline

Rebuild one boosting step by hand

Fit a booster with an explicit stopping rule

Measure threshold cost, not AUC alone

Gate slices before promotion

Ship a model that can be operated

Explain the candidate without looking back

Practice

What strong answers show

When promotion breaks

Mastery Check

Discussion

Gradient Boosted Trees in Production

Freeze the clock before training

Publish the cheapest baseline

Rebuild one boosting step by hand

Fit a booster with an explicit stopping rule

Measure threshold cost, not AUC alone

Gate slices before promotion

Ship a model that can be operated

Explain the candidate without looking back

Practice

What strong answers show

When promotion breaks

Mastery Check

Discussion

Gradient Boosted Trees in Production

Freeze the clock before training

Publish the cheapest baseline

Rebuild one boosting step by hand

Fit a booster with an explicit stopping rule

Measure threshold cost, not AUC alone

Gate slices before promotion

Ship a model that can be operated

Explain the candidate without looking back

Why keep a rule baseline after training a boosted model?

Why can a higher AUC candidate still fail promotion?

Why isn't the regression residual table the full story for an SLA-risk classifier?

What makes a fitted booster deployable rather than a notebook result?

Why should a slice gate record positive-case count as well as recall?

Practice

What strong answers show

When promotion breaks

Mastery Check

Discussion

Gradient Boosted Trees in Production

Freeze the clock before training

Publish the cheapest baseline

Rebuild one boosting step by hand

Fit a booster with an explicit stopping rule

Measure threshold cost, not AUC alone

Gate slices before promotion

Ship a model that can be operated

Explain the candidate without looking back

Why keep a rule baseline after training a boosted model?

Why can a higher AUC candidate still fail promotion?

Why isn't the regression residual table the full story for an SLA-risk classifier?

What makes a fitted booster deployable rather than a notebook result?

Why should a slice gate record positive-case count as well as recall?

Practice

What strong answers show

When promotion breaks

Mastery Check

Discussion