返回 Guide 目录

统计学习 / Statistical Learning

树模型

Tree Methods

本页结构

核心概念

  • 划分准则与树深控制 Split criteria and tree depth control
  • 用 bagging 和随机森林降低方差 Bagging and random forests for variance reduction
  • 用 boosting 逐步修正误差 Boosting for sequential error correction

学习顺序

  1. 用偏差-方差框架比较不同树模型。 Use bias-variance language to compare tree methods.
  2. 理解随机森林为什么要降低单棵树之间的相关性。 Know why random forests decorrelate individual trees.
  3. 讨论可解释性、数据泄漏和过拟合控制。 Mention interpretability, leakage and overfitting controls.

概览

Overview

Tree-based methods are powerful, non-linear machine learning techniques widely used for their ability to capture complex interactions and non-linear relationships in data, which are often missed by traditional linear models.

基于树的方法是强大的非线性机器学习技术,因其能够捕获数据中复杂的交互和非线性关系而被广泛使用,而传统的线性模型经常会忽略这些。

一、决策树(单棵树)

I. Decision Trees (Single Trees)

A Decision Tree partitions the feature space into a set of non-overlapping regions. For any given observation, the prediction is the mean of the response values (for regression) or the most frequent class (for classification) of the training observations that fall into that region.

决策树将特征空间划分为一组不重叠的区域。对于任何给定的观察,预测是属于该区域的训练观察的响应值(对于回归)或最常见的类别(对于分类)的平均值。

Splitting Criteria

分裂标准

The process of building a tree involves recursively splitting the data based on the feature and split point that maximizes the "purity" of the resulting nodes.

构建树的过程涉及根据特征和分割点递归地分割数据,从而最大化结果节点的“纯度”。

Task Splitting Criterion (Impurity Measure) Goal
Classification Gini Index or Entropy/Information Gain Maximize the reduction in impurity (heterogeneity) of the classes within the resulting nodes.
Regression Residual Sum of Squares (RSS) or Mean Squared Error (MSE) Minimize the variance of the response variable within the resulting nodes.
任务 分裂准则(杂质测量) 目标
分类 基尼指数熵/信息增益 最大限度地减少结果节点中类的杂质(异质性)。
回归 残差平方和 (RSS)均方误差 (MSE) 最小化结果节点内响应变量的方差。

Advantages and Disadvantages

优点和缺点

  • Pros: Easy to interpret (white-box model), can handle non-linear relationships, and naturally handles categorical predictors.
  • Cons: High variance (small changes in data can lead to a very different tree), prone to overfitting, and generally lower predictive accuracy than ensemble methods.
  • 优点:易于解释(白盒模型),可以处理非线性关系,并且自然地处理分类预测变量。
  • 缺点:方差高(数据的微小变化可能导致生成非常不同的树),容易过度拟合,并且预测精度通常低于集成方法。

二、集成方法(降低方差与偏差)

II. Ensemble Methods (Reducing Variance and Bias)

Ensemble methods combine multiple individual decision trees to improve overall predictive performance and robustness.

集成方法结合了多个单独的决策树,以提高整体预测性能和鲁棒性。

1. Bagging (Bootstrap Aggregating)

1. 装袋(Bootstrap Aggregating)

Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.

Bagging 是一种通用程序,用于减少统计学习方法的方差

  • Mechanism: Generate BB bootstrap samples (sampling with replacement) from the original training data. Train a full, unpruned decision tree on each bootstrap sample. Aggregate the predictions: average the predictions (regression) or take a majority vote (classification).
  • Out-of-Bag (OOB) Error: Since each tree is trained on only about 2/32/3 of the data, the remaining 1/31/3 (OOB observations) can be used as a validation set to estimate the test error without the need for cross-validation.
  • 机制:从原始训练数据生成BB引导样本(带替换的采样)。在每个引导样本上训练一个完整的、未修剪的决策树。聚合预测:对预测进行平均(回归)或采取多数投票(分类)。
  • 袋外 (OOB) 错误:由于每棵树仅在大约 2/32/3 的数据上进行训练,因此剩余的 1/31/3OOB 观测值)可以用作验证集来估计测试误差,而无需交叉验证。

2. Random Forests

2. 随机森林

Random Forests are an improvement over bagging that aims to decorrelate the trees, further reducing variance.

随机森林是对 bagging 的改进,旨在“去相关”树,进一步减少方差。

  • Mechanism: Use the bagging procedure (bootstrap samples). At each split in the tree-building process, only a random subset of mm predictors is considered as split candidates, where mpm \ll p (total number of predictors).
  • Hyperparameter mm: Typically set to p\sqrt{p} for classification and p/3p/3 for regression. By forcing the algorithm to ignore the strongest predictor in some trees, the resulting trees are less correlated, leading to a greater reduction in variance when averaged.
  • Feature Importance: Random Forests provide a robust measure of Variable Importance by calculating the total decrease in node impurity (e.g., Gini index) averaged over all trees.
  • 机制:使用装袋程序(引导样本)。在树构建过程中的每次分割中,只有 mm 预测变量的随机子集被视为分割候选者,其中 mpm \ll p (预测变量总数)。
  • 超参数 mm:通常设置为 p\sqrt{p} 进行分类,设置为 p/3p/3 进行回归。通过强制算法忽略某些树中最强的预测变量,生成的树相关性较低,从而导致平均时方差更大程度地减少。
  • 特征重要性:随机森林通过计算所有树的平均节点杂质(例如基尼指数)的总减少量,提供了变量重要性的可靠度量。

3. Boosting

3. 提升

Boosting is an ensemble technique that focuses on sequentially building trees to reduce bias.

Boosting 是一种集成技术,专注于顺序构建树以减少偏差

  • Mechanism: Start with a simple model (e.g., a single tree). Sequentially fit new trees to the residuals (or pseudo-residuals in generalized boosting) of the previous step. Each new tree attempts to correct the errors of the previous ensemble. Each new tree's contribution is scaled by a small learning rate λ\lambda to slow down the learning process, which improves generalization.
  • Key Algorithms: AdaBoost (Adaptive Boosting) and Gradient Boosting Machines (GBM), including modern implementations like XGBoost and LightGBM.
  • Tradeoff: Boosting generally achieves higher predictive accuracy than bagging/Random Forests but is more prone to overfitting if the learning rate is too high or the number of trees is too large.
  • 机制:从一个简单的模型(例如,一棵树)开始。依次将新树拟合到上一步的残差(或广义提升中的伪残差)。每棵新树都会尝试纠正前一个集合的错误。每棵新树的贡献都会按一个小的学习率 λ\lambda 进行缩放,以减慢学习过程,从而提高泛化能力。
  • 关键算法AdaBoost(自适应增强)和梯度增强机 (GBM),包括 XGBoostLightGBM 等现代实现。
  • 权衡:Boosting 通常比 bagging/Random Forests 能实现更高的预测精度,但如果学习率太高或树的数量太大,则更容易出现过度拟合。

补充讲解

树模型是在切分特征空间

Trees partition feature space

决策树通过递归切分特征空间来近似非线性函数。这带来可解释性,也让单棵树对数据扰动较敏感。

Decision trees approximate nonlinear functions by recursive partitions. This makes them interpretable, but also sensitive to small data changes.

Bagging 与 boosting 目标不同

Bagging and boosting solve different problems

Bagging 通过平均低相关树降低方差;boosting 通过顺序关注残差或难样本来降低偏差。

Bagging reduces variance by averaging decorrelated trees. Boosting reduces bias by sequentially focusing on residual errors or difficult observations.

量化数据必须检查泄漏

Quant data needs leakage checks

只有确认时间顺序、幸存者偏差处理和目标信息没有泄漏到特征中后,特征重要性才有解释价值。

Feature importance is only meaningful after confirming time ordering, survivorship treatment, and that target information does not leak into features.