Model access-change review with decision trees from scratch: compute impurity, test a non-perfect stump on held-out cases, compare forests and boosting, and audit feature explanations.
Logistic regression scored access-change request REQ-10234, then converted that score into a human-review decision using asymmetric costs. A tree answers the same routing question with conditional rules: high ambiguity may usually require review, unless evidence for automatic handling under policy P-7 changes the case.
That access-review setting stays with us. Eight historical requests form training data; eight later requests form validation data. That train, validation, and eventual test split matters because a tree can memorize historical exceptions while failing on the next batch.
Each access-change request has two scaled evidence features:
| Feature | Meaning |
|---|---|
ambiguity | Higher value means the request text and retrieved evidence conflict more strongly. |
auto_policy_support | Higher value means policy P-7 supports automatic permission handling. |
needs_review | Target: 1 means route to a human before permission change; 0 means automatic processing is acceptable. |
The fitting example uses these eight historical requests:
| Request | ambiguity | auto_policy_support | needs_review |
|---|---|---|---|
| R1 | 1.0 | 1 | 0 |
| R2 | 1.5 | 2 | 0 |
| R3 | 2.0 | 3 | 1 |
| R4 | 3.5 | 6 | 1 |
| R5 | 4.0 | 7 | 1 |
| R6 | 4.5 | 5 | 1 |
| R7 | 2.8 | 4 | 0 |
| R8 | 3.2 | 8 | 0 |
Four requests require review and four can be automatic. No single obvious feature perfectly separates them: R3 needs review despite modest ambiguity, while R8 remains safe for automation despite higher ambiguity because the policy evidence is strong.
A decision tree repeatedly splits rows into leaves. A useful split makes each leaf more uniform in the target.
For a binary leaf with review proportion , Gini impurity is:
A pure leaf has . A leaf evenly split between classes has .
An alternative is entropy:
Both metrics reward purer child leaves. A split's impurity decrease is the parent impurity minus weighted child impurity. When entropy is the impurity metric, this decrease is called information gain. Below we use Gini decrease, or Gini gain, because its arithmetic is compact.
At the root, four of eight rows need review:
Suppose the tree tests ambiguity <= 3.35:
[0, 0, 1, 0, 0], so .[1, 1, 1], so .That split is helpful, but it isn't perfect: R3 remains a required-review request inside the mostly automatic leaf.
1# impurity-by-hand-in-code.py
2import numpy as np
3
4def gini(labels):
5 counts = np.bincount(labels, minlength=2)
6 probs = counts / counts.sum()
7 return float(1.0 - np.sum(probs**2))
8
9def entropy(labels):
10 counts = np.bincount(labels, minlength=2)
11 probs = counts[counts > 0] / counts.sum()
12 return float(-np.sum(probs * np.log2(probs)))
13
14root = np.array([0, 0, 1, 1, 1, 1, 0, 0])
15left = np.array([0, 0, 1, 0, 0])
16right = np.array([1, 1, 1])
17weighted = (len(left) * gini(left) + len(right) * gini(right)) / len(root)
18
19print(f"root gini={gini(root):.3f} entropy={entropy(root):.3f}")
20print(f"left gini={gini(left):.3f} right gini={gini(right):.3f}")
21print(f"weighted={weighted:.3f} gain={gini(root) - weighted:.3f}")1root gini=0.500 entropy=1.000
2left gini=0.320 right gini=0.000
3weighted=0.200 gain=0.300A decision stump is a tree with one split. For each numeric feature, the stump can place a threshold between adjacent observed values. The training algorithm scores each candidate using impurity decrease and keeps the best one.
Testing midpoints matters. A threshold such as 3.35 represents the boundary between observed values 3.2 and 3.5; it avoids inventing an unnecessary boundary through an actual training row.
1# rank-splits.py
2import numpy as np
3
4def gini(labels):
5 if len(labels) == 0:
6 return 0.0
7 counts = np.bincount(labels, minlength=2)
8 probs = counts / counts.sum()
9 return float(1.0 - np.sum(probs**2))
10
11def split_gain(values, labels, threshold):
12 left = labels[values <= threshold]
13 right = labels[values > threshold]
14 weighted = (len(left) * gini(left) + len(right) * gini(right)) / len(labels)
15 return gini(labels) - weighted
16
17X = np.array([
18 [1.0, 1],
19 [1.5, 2],
20 [2.0, 3],
21 [3.5, 6],
22 [4.0, 7],
23 [4.5, 5],
24 [2.8, 4],
25 [3.2, 8],
26])
27y = np.array([0, 0, 1, 1, 1, 1, 0, 0])
28names = ["ambiguity", "auto_policy_support"]
29
30for column, name in enumerate(names):
31 values = X[:, column]
32 unique = np.sort(np.unique(values))
33 thresholds = (unique[:-1] + unique[1:]) / 2
34 scored = [(threshold, split_gain(values, y, threshold)) for threshold in thresholds]
35 best_threshold, best_gain = max(scored, key=lambda item: item[1])
36 print(f"{name:14} best threshold={best_threshold:.2f} gain={best_gain:.3f}")1ambiguity best threshold=3.35 gain=0.300
2auto_policy_support best threshold=2.50 gain=0.167The best stump rule is:
The left leaf predicts review probability . The right leaf predicts .
The tree produces a score before a decision. A score of 0.20 isn't automatically "safe": the routing threshold must reflect the harm of an automatic permission change that needed review.
This implementation searches every feature and midpoint, stores leaf probabilities, then scores the training rows. That's the core mechanic hidden inside a library tree.
1# decision-stump-from-scratch.py
2import numpy as np
3
4def gini(labels):
5 if len(labels) == 0:
6 return 0.0
7 p_review = labels.mean()
8 return float(1.0 - p_review**2 - (1.0 - p_review) ** 2)
9
10def candidate_thresholds(values):
11 unique = np.sort(np.unique(values))
12 return (unique[:-1] + unique[1:]) / 2
13
14def fit_stump(X, y):
15 root_impurity = gini(y)
16 best = None
17 for feature in range(X.shape[1]):
18 for threshold in candidate_thresholds(X[:, feature]):
19 left_mask = X[:, feature] <= threshold
20 right_mask = ~left_mask
21 weighted = (
22 left_mask.mean() * gini(y[left_mask])
23 + right_mask.mean() * gini(y[right_mask])
24 )
25 gain = root_impurity - weighted
26 if best is None or gain > best["gain"]:
27 best = {
28 "feature": feature,
29 "threshold": float(threshold),
30 "gain": float(gain),
31 "left_probability": float(y[left_mask].mean()),
32 "right_probability": float(y[right_mask].mean()),
33 }
34 return best
35
36def predict_proba(stump, X):
37 goes_left = X[:, stump["feature"]] <= stump["threshold"]
38 return np.where(
39 goes_left,
40 stump["left_probability"],
41 stump["right_probability"],
42 )
43
44X_train = np.array([
45 [1.0, 1],
46 [1.5, 2],
47 [2.0, 3],
48 [3.5, 6],
49 [4.0, 7],
50 [4.5, 5],
51 [2.8, 4],
52 [3.2, 8],
53])
54y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0])
55names = ["ambiguity", "auto_policy_support"]
56
57stump = fit_stump(X_train, y_train)
58scores = predict_proba(stump, X_train)
59predictions = (scores >= 0.50).astype(int)
60
61print(
62 f"rule: {names[stump['feature']]} <= {stump['threshold']:.2f}; "
63 f"gain={stump['gain']:.3f}"
64)
65print(
66 f"leaf probabilities: left={stump['left_probability']:.2f}, "
67 f"right={stump['right_probability']:.2f}"
68)
69print("training predictions:", predictions.tolist())
70print(f"training accuracy={np.mean(predictions == y_train):.3f}")1rule: ambiguity <= 3.35; gain=0.300
2leaf probabilities: left=0.20, right=1.00
3training predictions: [0, 0, 0, 1, 1, 1, 0, 0]
4training accuracy=0.875The stump misses R3. A deeper tree can create another branch to repair that training mistake, but a repaired training label alone gives no evidence that the extra branch generalizes.
We now freeze the fitted stump and score later requests that were not used to choose its split:
| Request | ambiguity | auto_policy_support | needs_review |
|---|---|---|---|
| V1 | 1.2 | 2 | 0 |
| V2 | 2.1 | 4 | 0 |
| V3 | 2.4 | 3 | 1 |
| V4 | 3.0 | 5 | 0 |
| V5 | 3.6 | 4 | 1 |
| V6 | 3.2 | 7 | 0 |
| V7 | 4.2 | 5 | 1 |
| V8 | 2.8 | 6 | 1 |
The operational costs match the classification lesson:
| Error | Routing mistake | Cost |
|---|---|---|
| False negative | Automatically approve an access change that needed human review. | $120 |
| False positive | Send an otherwise safe request for unnecessary review. | $18 |
A 0.50 threshold minimizes neither cost nor risk by definition. We compare candidate thresholds on validation data, then would confirm a selected policy on a fresh test period before deployment.
1# evaluate-stump-on-validation.py
2import numpy as np
3
4def metrics(y_true, y_pred):
5 tp = int(np.sum((y_true == 1) & (y_pred == 1)))
6 fp = int(np.sum((y_true == 0) & (y_pred == 1)))
7 fn = int(np.sum((y_true == 1) & (y_pred == 0)))
8 precision = tp / (tp + fp) if tp + fp else 0.0
9 recall = tp / (tp + fn) if tp + fn else 0.0
10 f1 = 2 * precision * recall / (precision + recall) if precision + recall else 0.0
11 cost = 18 * fp + 120 * fn
12 return precision, recall, f1, cost, fp, fn
13
14X_valid = np.array([
15 [1.2, 2],
16 [2.1, 4],
17 [2.4, 3],
18 [3.0, 5],
19 [3.6, 4],
20 [3.2, 7],
21 [4.2, 5],
22 [2.8, 6],
23])
24y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1])
25probabilities = np.where(X_valid[:, 0] <= 3.35, 0.20, 1.00)
26
27for threshold in [0.20, 0.50]:
28 predicted = (probabilities >= threshold).astype(int)
29 precision, recall, f1, cost, fp, fn = metrics(y_valid, predicted)
30 print(
31 f"threshold={threshold:.2f} precision={precision:.3f} "
32 f"recall={recall:.3f} f1={f1:.3f} fp={fp} fn={fn} cost=${cost}"
33 )1threshold=0.20 precision=0.500 recall=1.000 f1=0.667 fp=4 fn=0 cost=$72
2threshold=0.50 precision=1.000 recall=0.500 f1=0.667 fp=0 fn=2 cost=$240Both thresholds achieve the same F1 score here. They differ sharply in business cost: the lower threshold sends more requests to review but avoids expensive false negatives on this validation batch.
A one-split tree is easy to audit but often underfits. An unconstrained tree can isolate every historical row. That becomes overfitting when the tree fits historical rows too specifically and transfers worse to new requests. The right question is whether extra depth improves held-out behavior, not whether it reaches perfect training accuracy.
1# depth-versus-validation.py
2import numpy as np
3from sklearn.metrics import f1_score, roc_auc_score
4from sklearn.tree import DecisionTreeClassifier
5
6X_train = np.array([
7 [1.0, 1],
8 [1.5, 2],
9 [2.0, 3],
10 [3.5, 6],
11 [4.0, 7],
12 [4.5, 5],
13 [2.8, 4],
14 [3.2, 8],
15])
16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0])
17X_valid = np.array([
18 [1.2, 2],
19 [2.1, 4],
20 [2.4, 3],
21 [3.0, 5],
22 [3.6, 4],
23 [3.2, 7],
24 [4.2, 5],
25 [2.8, 6],
26])
27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1])
28
29def report(name, tree):
30 tree.fit(X_train, y_train)
31 train_accuracy = tree.score(X_train, y_train)
32 probabilities = tree.predict_proba(X_valid)[:, 1]
33 predicted = (probabilities >= 0.50).astype(int)
34 fp = int(np.sum((y_valid == 0) & (predicted == 1)))
35 fn = int(np.sum((y_valid == 1) & (predicted == 0)))
36 cost = 18 * fp + 120 * fn
37 print(
38 f"{name}: train_accuracy={train_accuracy:.3f} "
39 f"val_f1={f1_score(y_valid, predicted):.3f} "
40 f"val_auc={roc_auc_score(y_valid, probabilities):.3f} cost=${cost}"
41 )
42
43report("stump", DecisionTreeClassifier(max_depth=1, random_state=0))
44report("deep tree", DecisionTreeClassifier(random_state=0))1stump: train_accuracy=0.875 val_f1=0.667 val_auc=0.750 cost=$240
2deep tree: train_accuracy=1.000 val_f1=0.571 val_auc=0.625 cost=$258In this small slice, the deep tree fixes every training mistake and performs worse on validation. Real projects tune constraints such as max_depth, min_samples_leaf, and pruning using validation or cross-validation rather than a single training score.[1][2]
A random forest trains many trees on perturbed views of the training set. Each tree receives a bootstrap sample of rows and, at each split, a random subset of candidate features. Averaging their class-probability scores reduces the instability of one fully grown tree while preserving nonlinear splits.[3]
1# random-forest-validation.py
2import numpy as np
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.metrics import f1_score
5
6X_train = np.array([
7 [1.0, 1],
8 [1.5, 2],
9 [2.0, 3],
10 [3.5, 6],
11 [4.0, 7],
12 [4.5, 5],
13 [2.8, 4],
14 [3.2, 8],
15])
16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0])
17X_valid = np.array([
18 [1.2, 2],
19 [2.1, 4],
20 [2.4, 3],
21 [3.0, 5],
22 [3.6, 4],
23 [3.2, 7],
24 [4.2, 5],
25 [2.8, 6],
26])
27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1])
28names = ["ambiguity", "auto_policy_support"]
29
30forest = RandomForestClassifier(
31 n_estimators=200,
32 max_depth=2,
33 random_state=0,
34)
35forest.fit(X_train, y_train)
36probabilities = forest.predict_proba(X_valid)[:, 1]
37
38print("validation probabilities:", np.round(probabilities, 3).tolist())
39for threshold in [0.20, 0.50]:
40 predicted = (probabilities >= threshold).astype(int)
41 fp = int(np.sum((y_valid == 0) & (predicted == 1)))
42 fn = int(np.sum((y_valid == 1) & (predicted == 0)))
43 cost = 18 * fp + 120 * fn
44 print(
45 f"threshold={threshold:.2f} "
46 f"f1={f1_score(y_valid, predicted):.3f} cost=${cost}"
47 )
48print("MDI importance:", dict(zip(names, np.round(forest.feature_importances_, 3))))1validation probabilities: [0.149, 0.464, 0.433, 0.463, 0.653, 0.588, 0.879, 0.496]
2threshold=0.20 f1=0.727 cost=$54
3threshold=0.50 f1=0.571 cost=$258
4MDI importance: {'ambiguity': np.float64(0.521), 'auto_policy_support': np.float64(0.479)}The forest emits more varied scores than the stump. As before, its threshold belongs to the routing policy and must be evaluated with the same costs and fresh evidence.
Gradient boosting builds trees sequentially. Each new tree learns a correction to predictions made so far. For squared-error regression, that correction target is the residual ; Friedman generalized the idea to differentiable losses through negative gradients.[4]
Before returning to binary routing, consider a side task: predict review-handling minutes for three access changes. Their ambiguity values are [1, 2, 5], and actual handling times are [10, 20, 60] minutes.
1# one-boosting-correction.py
2import numpy as np
3
4ambiguity = np.array([1.0, 2.0, 5.0])
5minutes = np.array([10.0, 20.0, 60.0])
6prediction = np.full(len(minutes), minutes.mean())
7residual = minutes - prediction
8
9threshold = 3.5
10goes_left = ambiguity <= threshold
11correction = np.where(
12 goes_left,
13 residual[goes_left].mean(),
14 residual[~goes_left].mean(),
15)
16learning_rate = 0.10
17updated = prediction + learning_rate * correction
18
19print("base prediction:", np.round(prediction, 1).tolist())
20print("residual targets:", np.round(residual, 1).tolist())
21print("stump correction:", np.round(correction, 1).tolist())
22print("updated prediction:", np.round(updated, 1).tolist())1base prediction: [30.0, 30.0, 30.0]
2residual targets: [-20.0, -10.0, 30.0]
3stump correction: [-15.0, -15.0, 30.0]
4updated prediction: [28.5, 28.5, 33.0]
Shrinkage reduces the impact of any single correction. More rounds can accumulate useful structure, but too many rounds or overly complex trees can still overfit.
1# boosting-several-rounds.py
2import numpy as np
3
4def best_residual_stump(x, residual):
5 values = np.sort(np.unique(x))
6 thresholds = (values[:-1] + values[1:]) / 2
7 best = None
8 for threshold in thresholds:
9 left = x <= threshold
10 prediction = np.where(
11 left,
12 residual[left].mean(),
13 residual[~left].mean(),
14 )
15 squared_error = float(np.mean((residual - prediction) ** 2))
16 if best is None or squared_error < best["error"]:
17 best = {
18 "threshold": float(threshold),
19 "prediction": prediction,
20 "error": squared_error,
21 }
22 return best
23
24x = np.array([1.0, 2.0, 5.0])
25y = np.array([10.0, 20.0, 60.0])
26prediction = np.full(len(y), y.mean())
27learning_rate = 0.10
28
29print(f"round=0 predictions={np.round(prediction, 2).tolist()} mse={np.mean((y - prediction) ** 2):.2f}")
30for round_number in range(1, 5):
31 residual = y - prediction
32 stump = best_residual_stump(x, residual)
33 prediction = prediction + learning_rate * stump["prediction"]
34 mse = np.mean((y - prediction) ** 2)
35 print(
36 f"round={round_number} threshold={stump['threshold']:.2f} "
37 f"predictions={np.round(prediction, 2).tolist()} mse={mse:.2f}"
38 )1round=0 predictions=[30.0, 30.0, 30.0] mse=466.67
2round=1 threshold=3.50 predictions=[28.5, 28.5, 33.0] mse=381.17
3round=2 threshold=3.50 predictions=[27.15, 27.15, 35.7] mse=311.91
4round=3 threshold=3.50 predictions=[25.94, 25.94, 38.13] mse=255.82
5round=4 threshold=3.50 predictions=[24.84, 24.84, 40.32] mse=210.38For needs_review, a classification boosting model uses a classification loss rather than raw minute residuals. The operational comparison remains the same: score the held-out requests, apply a candidate routing threshold, and count costly mistakes.
1# gradient-boosting-validation.py
2import numpy as np
3from sklearn.ensemble import GradientBoostingClassifier
4from sklearn.metrics import f1_score, roc_auc_score
5
6X_train = np.array([
7 [1.0, 1],
8 [1.5, 2],
9 [2.0, 3],
10 [3.5, 6],
11 [4.0, 7],
12 [4.5, 5],
13 [2.8, 4],
14 [3.2, 8],
15])
16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0])
17X_valid = np.array([
18 [1.2, 2],
19 [2.1, 4],
20 [2.4, 3],
21 [3.0, 5],
22 [3.6, 4],
23 [3.2, 7],
24 [4.2, 5],
25 [2.8, 6],
26])
27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1])
28
29boosted = GradientBoostingClassifier(
30 n_estimators=20,
31 max_depth=1,
32 learning_rate=0.10,
33 random_state=0,
34)
35boosted.fit(X_train, y_train)
36probabilities = boosted.predict_proba(X_valid)[:, 1]
37
38print("validation probabilities:", np.round(probabilities, 3).tolist())
39for threshold in [0.20, 0.50]:
40 predicted = (probabilities >= threshold).astype(int)
41 fp = int(np.sum((y_valid == 0) & (predicted == 1)))
42 fn = int(np.sum((y_valid == 1) & (predicted == 0)))
43 cost = 18 * fp + 120 * fn
44 print(
45 f"threshold={threshold:.2f} f1={f1_score(y_valid, predicted):.3f} "
46 f"auc={roc_auc_score(y_valid, probabilities):.3f} cost=${cost}"
47 )1validation probabilities: [0.17, 0.379, 0.379, 0.379, 0.879, 0.379, 0.879, 0.379]
2threshold=0.20 f1=0.727 auc=0.812 cost=$54
3threshold=0.50 f1=0.667 auc=0.812 cost=$240On this validation slice, boosting ranks review cases better than the stump but still needs a cost-aware threshold. With eight rows, treat that result as a mechanics exercise and a reason to gather more evaluation data, not as a production winner.
Tree libraries often report mean decrease in impurity (MDI): how much fitted splits on each feature reduced impurity. MDI is fast, but it describes the trained forest and can over-credit features favored by its split search.
Permutation importance asks a held-out question: how much does the chosen metric drop when one feature is shuffled? It's often more useful for validation diagnostics, though correlated features can share or mask importance.[2]
1# validation-permutation-importance.py
2import numpy as np
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.inspection import permutation_importance
5
6X_train = np.array([
7 [1.0, 1],
8 [1.5, 2],
9 [2.0, 3],
10 [3.5, 6],
11 [4.0, 7],
12 [4.5, 5],
13 [2.8, 4],
14 [3.2, 8],
15])
16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0])
17X_valid = np.array([
18 [1.2, 2],
19 [2.1, 4],
20 [2.4, 3],
21 [3.0, 5],
22 [3.6, 4],
23 [3.2, 7],
24 [4.2, 5],
25 [2.8, 6],
26])
27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1])
28names = ["ambiguity", "auto_policy_support"]
29
30forest = RandomForestClassifier(
31 n_estimators=200,
32 max_depth=2,
33 random_state=0,
34).fit(X_train, y_train)
35importance = permutation_importance(
36 forest,
37 X_valid,
38 y_valid,
39 scoring="f1",
40 n_repeats=30,
41 random_state=0,
42)
43
44for index, name in enumerate(names):
45 print(
46 f"{name}: mdi={forest.feature_importances_[index]:.3f} "
47 f"permutation_mean={importance.importances_mean[index]:.3f} "
48 f"permutation_std={importance.importances_std[index]:.3f}"
49 )1ambiguity: mdi=0.521 permutation_mean=0.126 permutation_std=0.237
2auto_policy_support: mdi=0.479 permutation_mean=0.023 permutation_std=0.133The wide permutation uncertainty is useful information: eight validation rows are too few for confident claims about which evidence field drives general behavior. Here, scoring="f1" uses the forest's default class predictions. A deployment audit should also define a scorer for documented routing cost at the selected threshold.
SHAP represents a model prediction as a baseline value plus additive feature contributions.[5] For our one-feature stump, using the training rows as background, the calculation is exact without any library:
auto_policy_support contributes 0.00 because this stump never split on it.1# exact-stump-attribution.py
2base_score = 0.50
3examples = {
4 "R5 high ambiguity": 1.00,
5 "R3 missed exception": 0.20,
6}
7
8for name, prediction in examples.items():
9 ambiguity_contribution = prediction - base_score
10 policy_contribution = 0.00
11 rebuilt = base_score + ambiguity_contribution + policy_contribution
12 print(
13 f"{name}: base={base_score:.2f} ambiguity={ambiguity_contribution:+.2f} "
14 f"policy={policy_contribution:+.2f} prediction={rebuilt:.2f}"
15 )1R5 high ambiguity: base=0.50 ambiguity=+0.50 policy=+0.00 prediction=1.00
2R3 missed exception: base=0.50 ambiguity=-0.30 policy=+0.00 prediction=0.20R3 shows the limit. The explanation faithfully reports why the stump scored the request low; it doesn't prove that low ambiguity made automatic processing correct. Explanations audit model behavior, while labels, policy review, and held-out monitoring audit decision quality.
A tree model earns deployment consideration only after the same later-request evaluation used for simpler baselines. Start with a regularized tree or ensemble, compare it against logistic regression under identical splits and costs, and inspect whether the added flexibility repairs meaningful failures.
| Model symptom | What to inspect | Practical response |
|---|---|---|
| Stump misses costly review exceptions | False negatives and affected request slices | Add evidence features or evaluate a constrained ensemble. |
| Deep tree reaches perfect training accuracy | Validation cost and leaf sample counts | Limit depth, increase minimum leaf size, or prune. |
| Forest scores improve but threshold fails | Cost curve and calibration on later requests | Select threshold on validation evidence, then retest on a later period. |
| Boosting degrades after new policy wording | Feature distributions and error slices | Retrain or revise features after confirming policy drift. |
| Importance changes sharply across batches | Permutation uncertainty and correlations | Treat explanations as diagnostics, not causal conclusions. |
Trees give transparent conditional splits; forests reduce single-tree instability; boosting accumulates targeted corrections. None of those mechanisms replaces held-out evaluation.
1. Recompute the best stump split and explain why policy support may become more useful.photo_damage_score, and extend fit_stump to consider it. Report whether validation cost improves.max_depth and min_samples_leaf for the forest using only the validation data above. Then write down what separate data you would reserve for an unbiased final check.$120 to $40. Recompute threshold choices for stump, forest, and boosting.gini impurityentropy and information gainGini gain / impurity decreasedecision stumpaxis-aligned splitsoverfitting in treesbagging vs boostinggradient boosting residual fittingfeature importanceSHAP valuesvalidation-based model comparisonYou're ready to continue when you can:
0.50 despite asymmetric errors. Fix: sweep thresholds on validation data under documented cost assumptions.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
The Elements of Statistical Learning.
Hastie, T., Tibshirani, R., Friedman, J. · 2009
Scikit-learn: Machine Learning in Python.
Pedregosa, F., et al. · 2011 · JMLR
Random Forests
Breiman, L. · 2001 · Machine Learning
Greedy Function Approximation: A Gradient Boosting Machine
Friedman, J. H. · 2001 · The Annals of Statistics
A Unified Approach to Interpreting Model Predictions
Lundberg, S. M., Lee, S. I. · 2017 · NeurIPS