LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationDecision Trees, Forests, and Boosting
📊MediumEvaluation & Benchmarks

Decision Trees, Forests, and Boosting

Model access-change review with decision trees from scratch: compute impurity, test a non-perfect stump on held-out cases, compare forests and boosting, and audit feature explanations.

17 min read
Learning path
Step 33 of 158 in the full curriculum
Logistic Regression and MetricsReinforcement Learning Basics

Logistic regression scored access-change request REQ-10234, then converted that score into a human-review decision using asymmetric costs. A tree answers the same routing question with conditional rules: high ambiguity may usually require review, unless evidence for automatic handling under policy P-7 changes the case.

That access-review setting stays with us. Eight historical requests form training data; eight later requests form validation data. That train, validation, and eventual test split matters because a tree can memorize historical exceptions while failing on the next batch.

Two-dimensional access-request feature plane split at ambiguity 3.35, with five requests in a mixed 20-percent-review leaf, three in a pure review leaf, and R3 highlighted as the stump exception. Two-dimensional access-request feature plane split at ambiguity 3.35, with five requests in a mixed 20-percent-review leaf, three in a pure review leaf, and R3 highlighted as the stump exception.
The split sends all three high-ambiguity requests to a pure review leaf. Its five-row left leaf still contains R3, making the stump's limitation visible before richer trees or ensembles are considered.

One Access-Review Table, Now With Nonlinear Rules

Each access-change request has two scaled evidence features:

FeatureMeaning
ambiguityHigher value means the request text and retrieved evidence conflict more strongly.
auto_policy_supportHigher value means policy P-7 supports automatic permission handling.
needs_reviewTarget: 1 means route to a human before permission change; 0 means automatic processing is acceptable.

The fitting example uses these eight historical requests:

Requestambiguityauto_policy_supportneeds_review
R11.010
R21.520
R32.031
R43.561
R54.071
R64.551
R72.840
R83.280

Four requests require review and four can be automatic. No single obvious feature perfectly separates them: R3 needs review despite modest ambiguity, while R8 remains safe for automation despite higher ambiguity because the policy evidence is strong.

Impurity measures mixed leaves

A decision tree repeatedly splits rows into leaves. A useful split makes each leaf more uniform in the target.

For a binary leaf with review proportion ppp, Gini impurity is:

G=1−p2−(1−p)2G = 1 - p^2 - (1-p)^2G=1−p2−(1−p)2

A pure leaf has G=0G=0G=0. A leaf evenly split between classes has G=0.5G=0.5G=0.5.

An alternative is entropy:

H=−plog⁡2(p)−(1−p)log⁡2(1−p)H = -p\log_2(p) - (1-p)\log_2(1-p)H=−plog2​(p)−(1−p)log2​(1−p)

Both metrics reward purer child leaves. A split's impurity decrease is the parent impurity minus weighted child impurity. When entropy is the impurity metric, this decrease is called information gain. Below we use Gini decrease, or Gini gain, because its arithmetic is compact.

At the root, four of eight rows need review:

Groot=1−(4/8)2−(4/8)2=0.500G_{\text{root}} = 1 - (4/8)^2 - (4/8)^2 = 0.500Groot​=1−(4/8)2−(4/8)2=0.500

Suppose the tree tests ambiguity <= 3.35:

  • Left leaf: labels [0, 0, 1, 0, 0], so Gleft=0.320G_{\text{left}} = 0.320Gleft​=0.320.
  • Right leaf: labels [1, 1, 1], so Gright=0.000G_{\text{right}} = 0.000Gright​=0.000.
  • Weighted impurity: (5/8)(0.320)+(3/8)(0.000)=0.200(5/8)(0.320) + (3/8)(0.000) = 0.200(5/8)(0.320)+(3/8)(0.000)=0.200.
  • Gini gain: 0.500−0.200=0.3000.500 - 0.200 = 0.3000.500−0.200=0.300.

That split is helpful, but it isn't perfect: R3 remains a required-review request inside the mostly automatic leaf.

impurity-by-hand-in-code.py
1# impurity-by-hand-in-code.py 2import numpy as np 3 4def gini(labels): 5 counts = np.bincount(labels, minlength=2) 6 probs = counts / counts.sum() 7 return float(1.0 - np.sum(probs**2)) 8 9def entropy(labels): 10 counts = np.bincount(labels, minlength=2) 11 probs = counts[counts > 0] / counts.sum() 12 return float(-np.sum(probs * np.log2(probs))) 13 14root = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 15left = np.array([0, 0, 1, 0, 0]) 16right = np.array([1, 1, 1]) 17weighted = (len(left) * gini(left) + len(right) * gini(right)) / len(root) 18 19print(f"root gini={gini(root):.3f} entropy={entropy(root):.3f}") 20print(f"left gini={gini(left):.3f} right gini={gini(right):.3f}") 21print(f"weighted={weighted:.3f} gain={gini(root) - weighted:.3f}")
Output
1root gini=0.500 entropy=1.000 2left gini=0.320 right gini=0.000 3weighted=0.200 gain=0.300

A stump tries every midpoint

A decision stump is a tree with one split. For each numeric feature, the stump can place a threshold between adjacent observed values. The training algorithm scores each candidate using impurity decrease and keeps the best one.

Testing midpoints matters. A threshold such as 3.35 represents the boundary between observed values 3.2 and 3.5; it avoids inventing an unnecessary boundary through an actual training row.

rank-splits.py
1# rank-splits.py 2import numpy as np 3 4def gini(labels): 5 if len(labels) == 0: 6 return 0.0 7 counts = np.bincount(labels, minlength=2) 8 probs = counts / counts.sum() 9 return float(1.0 - np.sum(probs**2)) 10 11def split_gain(values, labels, threshold): 12 left = labels[values <= threshold] 13 right = labels[values > threshold] 14 weighted = (len(left) * gini(left) + len(right) * gini(right)) / len(labels) 15 return gini(labels) - weighted 16 17X = np.array([ 18 [1.0, 1], 19 [1.5, 2], 20 [2.0, 3], 21 [3.5, 6], 22 [4.0, 7], 23 [4.5, 5], 24 [2.8, 4], 25 [3.2, 8], 26]) 27y = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 28names = ["ambiguity", "auto_policy_support"] 29 30for column, name in enumerate(names): 31 values = X[:, column] 32 unique = np.sort(np.unique(values)) 33 thresholds = (unique[:-1] + unique[1:]) / 2 34 scored = [(threshold, split_gain(values, y, threshold)) for threshold in thresholds] 35 best_threshold, best_gain = max(scored, key=lambda item: item[1]) 36 print(f"{name:14} best threshold={best_threshold:.2f} gain={best_gain:.3f}")
Output
1ambiguity best threshold=3.35 gain=0.300 2auto_policy_support best threshold=2.50 gain=0.167

The best stump rule is:

ambiguity≤3.35\texttt{ambiguity} \le 3.35ambiguity≤3.35

The left leaf predicts review probability 1/5=0.201/5 = 0.201/5=0.20. The right leaf predicts 3/3=1.003/3 = 1.003/3=1.00.

Diagram showing Eight historical access requests root Gini = 0.500, ambiguity <= 3.35?, yes: 5 rows, and automatic leaf p(review) = 0.20. Diagram showing Eight historical access requests root Gini = 0.500, ambiguity <= 3.35?, yes: 5 rows, and automatic leaf p(review) = 0.20.
Eight historical access requests root Gini = 0.500, ambiguity <= 3.35?, yes: 5 rows, and automatic leaf p(review) = 0.20.

The tree produces a score before a decision. A score of 0.20 isn't automatically "safe": the routing threshold must reflect the harm of an automatic permission change that needed review.

Build the Stump in NumPy

This implementation searches every feature and midpoint, stores leaf probabilities, then scores the training rows. That's the core mechanic hidden inside a library tree.

decision-stump-from-scratch.py
1# decision-stump-from-scratch.py 2import numpy as np 3 4def gini(labels): 5 if len(labels) == 0: 6 return 0.0 7 p_review = labels.mean() 8 return float(1.0 - p_review**2 - (1.0 - p_review) ** 2) 9 10def candidate_thresholds(values): 11 unique = np.sort(np.unique(values)) 12 return (unique[:-1] + unique[1:]) / 2 13 14def fit_stump(X, y): 15 root_impurity = gini(y) 16 best = None 17 for feature in range(X.shape[1]): 18 for threshold in candidate_thresholds(X[:, feature]): 19 left_mask = X[:, feature] <= threshold 20 right_mask = ~left_mask 21 weighted = ( 22 left_mask.mean() * gini(y[left_mask]) 23 + right_mask.mean() * gini(y[right_mask]) 24 ) 25 gain = root_impurity - weighted 26 if best is None or gain > best["gain"]: 27 best = { 28 "feature": feature, 29 "threshold": float(threshold), 30 "gain": float(gain), 31 "left_probability": float(y[left_mask].mean()), 32 "right_probability": float(y[right_mask].mean()), 33 } 34 return best 35 36def predict_proba(stump, X): 37 goes_left = X[:, stump["feature"]] <= stump["threshold"] 38 return np.where( 39 goes_left, 40 stump["left_probability"], 41 stump["right_probability"], 42 ) 43 44X_train = np.array([ 45 [1.0, 1], 46 [1.5, 2], 47 [2.0, 3], 48 [3.5, 6], 49 [4.0, 7], 50 [4.5, 5], 51 [2.8, 4], 52 [3.2, 8], 53]) 54y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 55names = ["ambiguity", "auto_policy_support"] 56 57stump = fit_stump(X_train, y_train) 58scores = predict_proba(stump, X_train) 59predictions = (scores >= 0.50).astype(int) 60 61print( 62 f"rule: {names[stump['feature']]} <= {stump['threshold']:.2f}; " 63 f"gain={stump['gain']:.3f}" 64) 65print( 66 f"leaf probabilities: left={stump['left_probability']:.2f}, " 67 f"right={stump['right_probability']:.2f}" 68) 69print("training predictions:", predictions.tolist()) 70print(f"training accuracy={np.mean(predictions == y_train):.3f}")
Output
1rule: ambiguity <= 3.35; gain=0.300 2leaf probabilities: left=0.20, right=1.00 3training predictions: [0, 0, 0, 1, 1, 1, 0, 0] 4training accuracy=0.875

The stump misses R3. A deeper tree can create another branch to repair that training mistake, but a repaired training label alone gives no evidence that the extra branch generalizes.

Validation Turns a Fitted Rule Into Evidence

We now freeze the fitted stump and score later requests that were not used to choose its split:

Requestambiguityauto_policy_supportneeds_review
V11.220
V22.140
V32.431
V43.050
V53.641
V63.270
V74.251
V82.861

The operational costs match the classification lesson:

ErrorRouting mistakeCost
False negativeAutomatically approve an access change that needed human review.$120
False positiveSend an otherwise safe request for unnecessary review.$18

A 0.50 threshold minimizes neither cost nor risk by definition. We compare candidate thresholds on validation data, then would confirm a selected policy on a fresh test period before deployment.

evaluate-stump-on-validation.py
1# evaluate-stump-on-validation.py 2import numpy as np 3 4def metrics(y_true, y_pred): 5 tp = int(np.sum((y_true == 1) & (y_pred == 1))) 6 fp = int(np.sum((y_true == 0) & (y_pred == 1))) 7 fn = int(np.sum((y_true == 1) & (y_pred == 0))) 8 precision = tp / (tp + fp) if tp + fp else 0.0 9 recall = tp / (tp + fn) if tp + fn else 0.0 10 f1 = 2 * precision * recall / (precision + recall) if precision + recall else 0.0 11 cost = 18 * fp + 120 * fn 12 return precision, recall, f1, cost, fp, fn 13 14X_valid = np.array([ 15 [1.2, 2], 16 [2.1, 4], 17 [2.4, 3], 18 [3.0, 5], 19 [3.6, 4], 20 [3.2, 7], 21 [4.2, 5], 22 [2.8, 6], 23]) 24y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 25probabilities = np.where(X_valid[:, 0] <= 3.35, 0.20, 1.00) 26 27for threshold in [0.20, 0.50]: 28 predicted = (probabilities >= threshold).astype(int) 29 precision, recall, f1, cost, fp, fn = metrics(y_valid, predicted) 30 print( 31 f"threshold={threshold:.2f} precision={precision:.3f} " 32 f"recall={recall:.3f} f1={f1:.3f} fp={fp} fn={fn} cost=${cost}" 33 )
Output
1threshold=0.20 precision=0.500 recall=1.000 f1=0.667 fp=4 fn=0 cost=$72 2threshold=0.50 precision=1.000 recall=0.500 f1=0.667 fp=0 fn=2 cost=$240

Both thresholds achieve the same F1 score here. They differ sharply in business cost: the lower threshold sends more requests to review but avoids expensive false negatives on this validation batch.

Depth Can Memorize the Exception

A one-split tree is easy to audit but often underfits. An unconstrained tree can isolate every historical row. That becomes overfitting when the tree fits historical rows too specifically and transfers worse to new requests. The right question is whether extra depth improves held-out behavior, not whether it reaches perfect training accuracy.

depth-versus-validation.py
1# depth-versus-validation.py 2import numpy as np 3from sklearn.metrics import f1_score, roc_auc_score 4from sklearn.tree import DecisionTreeClassifier 5 6X_train = np.array([ 7 [1.0, 1], 8 [1.5, 2], 9 [2.0, 3], 10 [3.5, 6], 11 [4.0, 7], 12 [4.5, 5], 13 [2.8, 4], 14 [3.2, 8], 15]) 16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 17X_valid = np.array([ 18 [1.2, 2], 19 [2.1, 4], 20 [2.4, 3], 21 [3.0, 5], 22 [3.6, 4], 23 [3.2, 7], 24 [4.2, 5], 25 [2.8, 6], 26]) 27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 28 29def report(name, tree): 30 tree.fit(X_train, y_train) 31 train_accuracy = tree.score(X_train, y_train) 32 probabilities = tree.predict_proba(X_valid)[:, 1] 33 predicted = (probabilities >= 0.50).astype(int) 34 fp = int(np.sum((y_valid == 0) & (predicted == 1))) 35 fn = int(np.sum((y_valid == 1) & (predicted == 0))) 36 cost = 18 * fp + 120 * fn 37 print( 38 f"{name}: train_accuracy={train_accuracy:.3f} " 39 f"val_f1={f1_score(y_valid, predicted):.3f} " 40 f"val_auc={roc_auc_score(y_valid, probabilities):.3f} cost=${cost}" 41 ) 42 43report("stump", DecisionTreeClassifier(max_depth=1, random_state=0)) 44report("deep tree", DecisionTreeClassifier(random_state=0))
Output
1stump: train_accuracy=0.875 val_f1=0.667 val_auc=0.750 cost=$240 2deep tree: train_accuracy=1.000 val_f1=0.571 val_auc=0.625 cost=$258

In this small slice, the deep tree fixes every training mistake and performs worse on validation. Real projects tune constraints such as max_depth, min_samples_leaf, and pruning using validation or cross-validation rather than a single training score.[1]Reference 1The Elements of Statistical Learning.https://hastie.su.domains/ElemStatLearn/[2]Reference 2Scikit-learn: Machine Learning in Python.https://jmlr.org/papers/v12/pedregosa11a.html

Random forests average across perturbed trees

A random forest trains many trees on perturbed views of the training set. Each tree receives a bootstrap sample of rows and, at each split, a random subset of candidate features. Averaging their class-probability scores reduces the instability of one fully grown tree while preserving nonlinear splits.[3]Reference 3Random Forestshttps://doi.org/10.1023/A:1010933404324

Conceptual ensemble comparison where three parallel tree probabilities 0.20, 0.60, and 0.80 average to 0.53, while boosting moves a base score from 0.40 to 0.58 and then 0.53 through sequential corrections. Conceptual ensemble comparison where three parallel tree probabilities 0.20, 0.60, and 0.80 average to 0.53, while boosting moves a base score from 0.40 to 0.58 and then 0.53 through sequential corrections.
A forest averages independently fitted tree probabilities. Boosting starts from the current score and adds each new tree's correction, so later trees depend on earlier errors.
random-forest-validation.py
1# random-forest-validation.py 2import numpy as np 3from sklearn.ensemble import RandomForestClassifier 4from sklearn.metrics import f1_score 5 6X_train = np.array([ 7 [1.0, 1], 8 [1.5, 2], 9 [2.0, 3], 10 [3.5, 6], 11 [4.0, 7], 12 [4.5, 5], 13 [2.8, 4], 14 [3.2, 8], 15]) 16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 17X_valid = np.array([ 18 [1.2, 2], 19 [2.1, 4], 20 [2.4, 3], 21 [3.0, 5], 22 [3.6, 4], 23 [3.2, 7], 24 [4.2, 5], 25 [2.8, 6], 26]) 27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 28names = ["ambiguity", "auto_policy_support"] 29 30forest = RandomForestClassifier( 31 n_estimators=200, 32 max_depth=2, 33 random_state=0, 34) 35forest.fit(X_train, y_train) 36probabilities = forest.predict_proba(X_valid)[:, 1] 37 38print("validation probabilities:", np.round(probabilities, 3).tolist()) 39for threshold in [0.20, 0.50]: 40 predicted = (probabilities >= threshold).astype(int) 41 fp = int(np.sum((y_valid == 0) & (predicted == 1))) 42 fn = int(np.sum((y_valid == 1) & (predicted == 0))) 43 cost = 18 * fp + 120 * fn 44 print( 45 f"threshold={threshold:.2f} " 46 f"f1={f1_score(y_valid, predicted):.3f} cost=${cost}" 47 ) 48print("MDI importance:", dict(zip(names, np.round(forest.feature_importances_, 3))))
Output
1validation probabilities: [0.149, 0.464, 0.433, 0.463, 0.653, 0.588, 0.879, 0.496] 2threshold=0.20 f1=0.727 cost=$54 3threshold=0.50 f1=0.571 cost=$258 4MDI importance: {'ambiguity': np.float64(0.521), 'auto_policy_support': np.float64(0.479)}

The forest emits more varied scores than the stump. As before, its threshold belongs to the routing policy and must be evaluated with the same costs and fresh evidence.

Boosting adds small corrections

Gradient boosting builds trees sequentially. Each new tree learns a correction to predictions made so far. For squared-error regression, that correction target is the residual y−y^y - \hat{y}y−y^​; Friedman generalized the idea to differentiable losses through negative gradients.[4]Reference 4Greedy Function Approximation: A Gradient Boosting Machinehttps://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-Function-Approximation--A-Gradient-Boosting-Machine/10.1214/aos/1013203451.full

Before returning to binary routing, consider a side task: predict review-handling minutes for three access changes. Their ambiguity values are [1, 2, 5], and actual handling times are [10, 20, 60] minutes.

one-boosting-correction.py
1# one-boosting-correction.py 2import numpy as np 3 4ambiguity = np.array([1.0, 2.0, 5.0]) 5minutes = np.array([10.0, 20.0, 60.0]) 6prediction = np.full(len(minutes), minutes.mean()) 7residual = minutes - prediction 8 9threshold = 3.5 10goes_left = ambiguity <= threshold 11correction = np.where( 12 goes_left, 13 residual[goes_left].mean(), 14 residual[~goes_left].mean(), 15) 16learning_rate = 0.10 17updated = prediction + learning_rate * correction 18 19print("base prediction:", np.round(prediction, 1).tolist()) 20print("residual targets:", np.round(residual, 1).tolist()) 21print("stump correction:", np.round(correction, 1).tolist()) 22print("updated prediction:", np.round(updated, 1).tolist())
Output
1base prediction: [30.0, 30.0, 30.0] 2residual targets: [-20.0, -10.0, 30.0] 3stump correction: [-15.0, -15.0, 30.0] 4updated prediction: [28.5, 28.5, 33.0]
One boosting round for review-time targets 10, 20, and 60 minutes, showing baseline predictions at 30, residual-stump outputs minus 15 and plus 30, and learning-rate-scaled updates to 28.5, 28.5, and 33. One boosting round for review-time targets 10, 20, and 60 minutes, showing baseline predictions at 30, residual-stump outputs minus 15 and plus 30, and learning-rate-scaled updates to 28.5, 28.5, and 33.
The stump proposes residual corrections `-15` and `+30`, but learning rate `0.10` applies only `-1.5` and `+3.0`. The next round starts from predictions `28.5`, `28.5`, and `33.0` and recomputes what remains.

Shrinkage reduces the impact of any single correction. More rounds can accumulate useful structure, but too many rounds or overly complex trees can still overfit.

boosting-several-rounds.py
1# boosting-several-rounds.py 2import numpy as np 3 4def best_residual_stump(x, residual): 5 values = np.sort(np.unique(x)) 6 thresholds = (values[:-1] + values[1:]) / 2 7 best = None 8 for threshold in thresholds: 9 left = x <= threshold 10 prediction = np.where( 11 left, 12 residual[left].mean(), 13 residual[~left].mean(), 14 ) 15 squared_error = float(np.mean((residual - prediction) ** 2)) 16 if best is None or squared_error < best["error"]: 17 best = { 18 "threshold": float(threshold), 19 "prediction": prediction, 20 "error": squared_error, 21 } 22 return best 23 24x = np.array([1.0, 2.0, 5.0]) 25y = np.array([10.0, 20.0, 60.0]) 26prediction = np.full(len(y), y.mean()) 27learning_rate = 0.10 28 29print(f"round=0 predictions={np.round(prediction, 2).tolist()} mse={np.mean((y - prediction) ** 2):.2f}") 30for round_number in range(1, 5): 31 residual = y - prediction 32 stump = best_residual_stump(x, residual) 33 prediction = prediction + learning_rate * stump["prediction"] 34 mse = np.mean((y - prediction) ** 2) 35 print( 36 f"round={round_number} threshold={stump['threshold']:.2f} " 37 f"predictions={np.round(prediction, 2).tolist()} mse={mse:.2f}" 38 )
Output
1round=0 predictions=[30.0, 30.0, 30.0] mse=466.67 2round=1 threshold=3.50 predictions=[28.5, 28.5, 33.0] mse=381.17 3round=2 threshold=3.50 predictions=[27.15, 27.15, 35.7] mse=311.91 4round=3 threshold=3.50 predictions=[25.94, 25.94, 38.13] mse=255.82 5round=4 threshold=3.50 predictions=[24.84, 24.84, 40.32] mse=210.38

Classification Boosting on Access Routing

For needs_review, a classification boosting model uses a classification loss rather than raw minute residuals. The operational comparison remains the same: score the held-out requests, apply a candidate routing threshold, and count costly mistakes.

gradient-boosting-validation.py
1# gradient-boosting-validation.py 2import numpy as np 3from sklearn.ensemble import GradientBoostingClassifier 4from sklearn.metrics import f1_score, roc_auc_score 5 6X_train = np.array([ 7 [1.0, 1], 8 [1.5, 2], 9 [2.0, 3], 10 [3.5, 6], 11 [4.0, 7], 12 [4.5, 5], 13 [2.8, 4], 14 [3.2, 8], 15]) 16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 17X_valid = np.array([ 18 [1.2, 2], 19 [2.1, 4], 20 [2.4, 3], 21 [3.0, 5], 22 [3.6, 4], 23 [3.2, 7], 24 [4.2, 5], 25 [2.8, 6], 26]) 27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 28 29boosted = GradientBoostingClassifier( 30 n_estimators=20, 31 max_depth=1, 32 learning_rate=0.10, 33 random_state=0, 34) 35boosted.fit(X_train, y_train) 36probabilities = boosted.predict_proba(X_valid)[:, 1] 37 38print("validation probabilities:", np.round(probabilities, 3).tolist()) 39for threshold in [0.20, 0.50]: 40 predicted = (probabilities >= threshold).astype(int) 41 fp = int(np.sum((y_valid == 0) & (predicted == 1))) 42 fn = int(np.sum((y_valid == 1) & (predicted == 0))) 43 cost = 18 * fp + 120 * fn 44 print( 45 f"threshold={threshold:.2f} f1={f1_score(y_valid, predicted):.3f} " 46 f"auc={roc_auc_score(y_valid, probabilities):.3f} cost=${cost}" 47 )
Output
1validation probabilities: [0.17, 0.379, 0.379, 0.379, 0.879, 0.379, 0.879, 0.379] 2threshold=0.20 f1=0.727 auc=0.812 cost=$54 3threshold=0.50 f1=0.667 auc=0.812 cost=$240

On this validation slice, boosting ranks review cases better than the stump but still needs a cost-aware threshold. With eight rows, treat that result as a mechanics exercise and a reason to gather more evaluation data, not as a production winner.

Importance Is a Diagnostic, Not a Cause

Tree libraries often report mean decrease in impurity (MDI): how much fitted splits on each feature reduced impurity. MDI is fast, but it describes the trained forest and can over-credit features favored by its split search.

Permutation importance asks a held-out question: how much does the chosen metric drop when one feature is shuffled? It's often more useful for validation diagnostics, though correlated features can share or mask importance.[2]Reference 2Scikit-learn: Machine Learning in Python.https://jmlr.org/papers/v12/pedregosa11a.html

validation-permutation-importance.py
1# validation-permutation-importance.py 2import numpy as np 3from sklearn.ensemble import RandomForestClassifier 4from sklearn.inspection import permutation_importance 5 6X_train = np.array([ 7 [1.0, 1], 8 [1.5, 2], 9 [2.0, 3], 10 [3.5, 6], 11 [4.0, 7], 12 [4.5, 5], 13 [2.8, 4], 14 [3.2, 8], 15]) 16y_train = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 17X_valid = np.array([ 18 [1.2, 2], 19 [2.1, 4], 20 [2.4, 3], 21 [3.0, 5], 22 [3.6, 4], 23 [3.2, 7], 24 [4.2, 5], 25 [2.8, 6], 26]) 27y_valid = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 28names = ["ambiguity", "auto_policy_support"] 29 30forest = RandomForestClassifier( 31 n_estimators=200, 32 max_depth=2, 33 random_state=0, 34).fit(X_train, y_train) 35importance = permutation_importance( 36 forest, 37 X_valid, 38 y_valid, 39 scoring="f1", 40 n_repeats=30, 41 random_state=0, 42) 43 44for index, name in enumerate(names): 45 print( 46 f"{name}: mdi={forest.feature_importances_[index]:.3f} " 47 f"permutation_mean={importance.importances_mean[index]:.3f} " 48 f"permutation_std={importance.importances_std[index]:.3f}" 49 )
Output
1ambiguity: mdi=0.521 permutation_mean=0.126 permutation_std=0.237 2auto_policy_support: mdi=0.479 permutation_mean=0.023 permutation_std=0.133

The wide permutation uncertainty is useful information: eight validation rows are too few for confident claims about which evidence field drives general behavior. Here, scoring="f1" uses the forest's default class predictions. A deployment audit should also define a scorer for documented routing cost at the selected threshold.

A Local Explanation for a One-Feature Stump

SHAP represents a model prediction as a baseline value plus additive feature contributions.[5]Reference 5A Unified Approach to Interpreting Model Predictionshttps://arxiv.org/abs/1705.07874 For our one-feature stump, using the training rows as background, the calculation is exact without any library:

  • Baseline review score across the training background: 0.500.500.50.
  • Left-leaf score: 0.200.200.20, so ambiguity contributes −0.30-0.30−0.30.
  • Right-leaf score: 1.001.001.00, so ambiguity contributes +0.50+0.50+0.50.
  • auto_policy_support contributes 0.00 because this stump never split on it.
exact-stump-attribution.py
1# exact-stump-attribution.py 2base_score = 0.50 3examples = { 4 "R5 high ambiguity": 1.00, 5 "R3 missed exception": 0.20, 6} 7 8for name, prediction in examples.items(): 9 ambiguity_contribution = prediction - base_score 10 policy_contribution = 0.00 11 rebuilt = base_score + ambiguity_contribution + policy_contribution 12 print( 13 f"{name}: base={base_score:.2f} ambiguity={ambiguity_contribution:+.2f} " 14 f"policy={policy_contribution:+.2f} prediction={rebuilt:.2f}" 15 )
Output
1R5 high ambiguity: base=0.50 ambiguity=+0.50 policy=+0.00 prediction=1.00 2R3 missed exception: base=0.50 ambiguity=-0.30 policy=+0.00 prediction=0.20

R3 shows the limit. The explanation faithfully reports why the stump scored the request low; it doesn't prove that low ambiguity made automatic processing correct. Explanations audit model behavior, while labels, policy review, and held-out monitoring audit decision quality.

Choosing a Model on Later Requests

A tree model earns deployment consideration only after the same later-request evaluation used for simpler baselines. Start with a regularized tree or ensemble, compare it against logistic regression under identical splits and costs, and inspect whether the added flexibility repairs meaningful failures.

Model symptomWhat to inspectPractical response
Stump misses costly review exceptionsFalse negatives and affected request slicesAdd evidence features or evaluate a constrained ensemble.
Deep tree reaches perfect training accuracyValidation cost and leaf sample countsLimit depth, increase minimum leaf size, or prune.
Forest scores improve but threshold failsCost curve and calibration on later requestsSelect threshold on validation evidence, then retest on a later period.
Boosting degrades after new policy wordingFeature distributions and error slicesRetrain or revise features after confirming policy drift.
Importance changes sharply across batchesPermutation uncertainty and correlationsTreat explanations as diagnostics, not causal conclusions.

Trees give transparent conditional splits; forests reduce single-tree instability; boosting accumulates targeted corrections. None of those mechanisms replaces held-out evaluation.

Practice tasks

  1. Change R8's target to 1. Recompute the best stump split and explain why policy support may become more useful.
  2. Add a third feature, photo_damage_score, and extend fit_stump to consider it. Report whether validation cost improves.
  3. Tune max_depth and min_samples_leaf for the forest using only the validation data above. Then write down what separate data you would reserve for an unbiased final check.
  4. Modify the false-negative cost from $120 to $40. Recompute threshold choices for stump, forest, and boosting.
  5. Implement one more boosting regression round with a smaller learning rate. Compare training MSE after four rounds.

Key concepts

  • gini impurity
  • entropy and information gain
  • Gini gain / impurity decrease
  • decision stump
  • axis-aligned splits
  • overfitting in trees
  • bagging vs boosting
  • gradient boosting residual fitting
  • feature importance
  • SHAP values
  • validation-based model comparison

Evaluation rubric

You're ready to continue when you can:

  • Compute Gini impurity and Gini gain for the access-review stump by hand, then explain how entropy-based information gain follows the same weighted-child pattern.
  • Implement a numeric decision stump that searches feature midpoints.
  • Explain why R3 exposes underfitting and why a perfect deep-tree training score isn't sufficient evidence.
  • Compare a stump, forest, and boosted classifier with held-out F1, ranking quality, and business cost.
  • Distinguish model attribution from causal or policy justification.

Follow-up questions

Common pitfalls

  • Symptom: a tree reaches perfect training accuracy, yet later costly exceptions are missed. Cause: depth memorized historical rows. Fix: tune tree constraints using held-out cost and inspect leaf sample counts.
  • Symptom: a forest or boosted model has acceptable F1, but permission-rollback cost remains high. Cause: threshold stayed at 0.50 despite asymmetric errors. Fix: sweep thresholds on validation data under documented cost assumptions.
  • Symptom: a feature attribution is presented as policy evidence. Cause: model explanation was mistaken for a causal justification. Fix: pair attributions with labels, policy review, and later-request monitoring.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.At a root with four reviews in eight rows, Gini is 0.500. A split sends labels [0, 0, 1, 0, 0] left and [1, 1, 1] right. Why is the Gini gain 0.300 rather than 0.500?
2.A stump implementation sorts ambiguity values as [1.0, 1.5, 2.0, 2.8, 3.2, 3.5, 4.0, 4.5]. Which thresholds should it score for that feature?
3.On a validation batch, threshold 0.20 gives tp=4, fp=4, fn=0; threshold 0.50 gives tp=2, fp=0, fn=2. False positives cost 18andfalsenegativescost18 and false negatives cost 18andfalsenegativescost120. Why can the F1 scores tie while costs differ?
4.A depth-unconstrained tree gets training accuracy 1.000, validation F1 0.571, and validation cost 258.Astumpgetstrainingaccuracy0.875,validationF10.667,andvalidationcost258. A stump gets training accuracy 0.875, validation F1 0.667, and validation cost 258.Astumpgetstrainingaccuracy0.875,validationF10.667,andvalidationcost240 at the same threshold. What conclusion follows before deployment?
5.How does a random forest's ensemble mechanism differ from gradient boosting's ensemble mechanism?
6.For handling-time regression, current predictions are [30.0, 30.0, 30.0]. A residual stump predicts corrections [-15.0, -15.0, 30.0]. With learning_rate 0.10, what are the updated predictions?
7.A forest reports MDI importance 0.521 for ambiguity and 0.479 for policy support. On eight validation rows, permutation importance for F1 has means 0.126 and 0.023 with large standard deviations. What conclusion is warranted?
8.For a one-feature stump, the training-background baseline score is 0.50. R3 falls in the low-ambiguity leaf with prediction 0.20; the stump never splits on policy support, and the true label is needs_review=1. What does the local attribution establish?

8 questions remaining.

Next Step
Continue to Introduction to Reinforcement Learning Fundamentals

Trees choose a review action from labeled examples; reinforcement learning asks how action sequences earn reward over time.

PreviousLogistic Regression and Metrics
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

Scikit-learn: Machine Learning in Python.

Pedregosa, F., et al. · 2011 · JMLR

Random Forests

Breiman, L. · 2001 · Machine Learning

Greedy Function Approximation: A Gradient Boosting Machine

Friedman, J. H. · 2001 · The Annals of Statistics

A Unified Approach to Interpreting Model Predictions

Lundberg, S. M., Lee, S. I. · 2017 · NeurIPS