sklearn.GBDT 源码解读（宏观把握）

来源：互联网发布：apache服务器优点编辑：程序博客网时间：2024/06/11 11:50

sklearn.GBDT源码解读
2017/01/09 22:05 V.0.1 第一版不注重源码的细节把握，注重的是代码的整体把控。后续版本会更新具体源码细节部分。
2017/01/11 01:25 V.0.2 第一版不注重源码的细节把握，注重的是代码的整体把控。后续版本会更新具体源码细节部分。
最近一直玩数据挖掘，GBDT使用了一点，就想看看源码是怎么实现的。
当训练一个GBDT模型的时候

gbdt=sklearn.ensemble.GradientBoostingClassifier(param)

s所以我们找到对应文件夹的代码

class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin): _SUPPORTED_LOSS = ('deviance', 'exponential')    def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,                 subsample=1.0, min_samples_split=2,                 min_samples_leaf=1, min_weight_fraction_leaf=0.,                 max_depth=3, init=None, random_state=None,                 max_features=None, verbose=0,                 max_leaf_nodes=None, warm_start=False,                 presort='auto'):        super(GradientBoostingClassifier, self).__init__(            loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,            min_samples_split=min_samples_split,            min_samples_leaf=min_samples_leaf,            min_weight_fraction_leaf=min_weight_fraction_leaf,            max_depth=max_depth, init=init, subsample=subsample,            max_features=max_features,            random_state=random_state, verbose=verbose,            max_leaf_nodes=max_leaf_nodes, warm_start=warm_start,            presort=presort)

w我们看到GradientBoostingClassifier继承了一个父类BaseGradientBoosting，同时我们也会发现GradientBoostingRegressor也继承了这个父类。我们会在这个父类中找到如下代码片：

class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,                                              _LearntSelectorMixin)):....def fit(self, X, y, sample_weight=None, monitor=None):        """Fit the gradient boosting model.

a
x显然，当我们训练模型时调用的就是这个fit函数。

clf = clf.fit(train[predictors],train[target])

x下面我们深入到这个fit代码里面一探究竟，看看GBDT到底是怎么来训练模型的。其中比如warm_start,check_X_y,check_random_state等一看便是基本的检查数据合法性，直接略过不看。直接找到最重要的代码片：

        # fit the boosting stages        n_stages = self._fit_stages(X, y, y_pred, sample_weight, random_state,                                    begin_at_stage, monitor, X_idx_sorted)       # change shape of arrays after fit (early-stopping or additional ests)        if n_stages != self.estimators_.shape[0]:            self.estimators_ = self.estimators_[:n_stages]            self.train_score_ = self.train_score_[:n_stages]            if hasattr(self, 'oob_improvement_'):                self.oob_improvement_ = self.oob_improvement_[:n_stages]        return self

s首先解释一下上面的代码片，其中包括训练模型的_fit_stages,以及迭代的次数n_stages进入该代码片def _fit_stages拖到最后可以看到return i+1,显然n_stages对应的就是迭代次数。然而注意到reture i+1 这句代码一定是串行执行的代码，也和我们认知到的GBDT的第t颗树的构建依赖t-1棵树的结果是一致的。
z在fit代码片中找到

def _fit_stages(self, X, y, y_pred, sample_weight, random_state,                    begin_at_stage=0, monitor=None, X_idx_sorted=None):                    .                    .                    .   # perform boosting iterations        i = begin_at_stage        for i in range(begin_at_stage, self.n_estimators):            # subsampling            if do_oob:                .                .                .            # fit next stage of trees            y_pred = self._fit_stage(i, X, y, y_pred, sample_weight,                                     sample_mask, random_state, X_idx_sorted,                                     X_csc, X_csr)            if do_oob:                .                .                .            else:                # no need to fancy index w/ no subsampling                self.train_score_[i] = loss_(y, y_pred, sample_weight)

h具体的迭代过程就在这个for循环里面，在这里我们会看到do_oob，不同的童鞋直接百度
out-of-bag,很多blog有详细说明，不在赘述。我们重点来看 y_pred = self._fit_stage这句话，需要注意的是与_fit_stages不同的是这句话表示训练下一颗树。串行构建树，而不是并行构建。同样的在xgboost中也是串行构建树，但是其并行特点是在构建一棵树的时候在feature的细粒度下的并行，同样也不是tree的粗粒度的并行。下面我们来详细看看GBDT中的一棵树到底是如何构建的。

def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,                   random_state, X_idx_sorted, X_csc=None, X_csr=None):        """Fit another stage of ``n_classes_`` trees to the boosting model. """        assert sample_mask.dtype == np.bool        loss = self.loss_        original_y = y        for k in range(loss.K):            if loss.is_multi_class:                y = np.array(original_y == k, dtype=np.float64)            residual = loss.negative_gradient(y, y_pred, k=k,                                              sample_weight=sample_weight)            # induce regression tree on residuals            tree = DecisionTreeRegressor(                criterion='friedman_mse',                splitter='best',                max_depth=self.max_depth,                min_samples_split=self.min_samples_split,                min_samples_leaf=self.min_samples_leaf,                min_weight_fraction_leaf=self.min_weight_fraction_leaf,                max_features=self.max_features,                max_leaf_nodes=self.max_leaf_nodes,                random_state=random_state,                presort=self.presort)            if self.subsample < 1.0:                # no inplace multiplication!                sample_weight = sample_weight * sample_mask.astype(np.float64)            if X_csc is not None:                tree.fit(X_csc, residual, sample_weight=sample_weight,                         check_input=False, X_idx_sorted=X_idx_sorted)            else:                tree.fit(X, residual, sample_weight=sample_weight,                         check_input=False, X_idx_sorted=X_idx_sorted)            # update tree leaves            if X_csr is not None:                loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred,                                             sample_weight, sample_mask,                                             self.learning_rate, k=k)            else:                loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,                                             sample_weight, sample_mask,                                             self.learning_rate, k=k)            # add tree to ensemble            self.estimators_[i, k] = tree        return y_pred

w为什么说GBDT是回归树呢？为什么说GBDT是朝着reduce residual的目标来构建树呢？我想上面代码片中的：

            if loss.is_multi_class:                y = np.array(original_y == k, dtype=np.float64)            residual = loss.negative_gradient(y, y_pred, k=k,                                              sample_weight=sample_weight)            # induce regression tree on residuals            tree = DecisionTreeRegressor(                criterion='friedman_mse',                splitter='best',                max_depth=self.max_depth,                min_samples_split=self.min_samples_split,                min_samples_leaf=self.min_samples_leaf,                min_weight_fraction_leaf=self.min_weight_fraction_leaf,                max_features=self.max_features,                max_leaf_nodes=self.max_leaf_nodes,                random_state=random_state,                presort=self.presort)`

g我们注意到loss.negative_gradient()这个函数，见字如面。给了我们很好的回答。随后进入构建好的回归树的训练中，当然训练的目标就是极小化这个residual，这也是residual regression的概念：

if X_csc is not None:                tree.fit(X_csc, residual, sample_weight=sample_weight,                         check_input=False, X_idx_sorted=X_idx_sorted)            else:                tree.fit(X, residual, sample_weight=sample_weight,                         check_input=False, X_idx_sorted=X_idx_sorted)

OK,让我们首先找到tree.fit这个函数在哪,如果读者是在windows开发环境下，建议先去装个linux。

#from ..tree.tree import DecisionTreeRegressorcd ..lscd anaconda2find -name sklearncd anaconda2/pkgs/scikit-learn-0.17.1-np111py27_2/lib/python2.7/site-packages/sklearncd ensemblelsvi gradient_boosting.pyy以上是找到gradient_boosting.py的方式.cd ..ls   #找到tree文件夹cd treevi tree.py      #打开tree.py，找到DecisionTreeRegressor

r如上方式找到DecisionTreeRegressor，

class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):    """A decision tree regressor.    .    .    .    def __init__(self,                 criterion="mse",                 splitter="best",                 max_depth=None,                 min_samples_split=2,                 min_samples_leaf=1,                 min_weight_fraction_leaf=0.,                 max_features=None,                 random_state=None,                 max_leaf_nodes=None,                 presort=False):        super(DecisionTreeRegressor, self).__init__(            criterion=criterion,            splitter=splitter,            max_depth=max_depth,            min_samples_split=min_samples_split,            min_samples_leaf=min_samples_leaf,            min_weight_fraction_leaf=min_weight_fraction_leaf,            max_features=max_features,            max_leaf_nodes=max_leaf_nodes,            random_state=random_state,            presort=presort)

t同样的，找到继承的父类BaseDecisionTree，

class BaseDecisionTree(six.with_metaclass(ABCMeta, BaseEstimator,                                          _LearntSelectorMixin)):                                          .                                          .                                          . def fit(self, X, y, sample_weight=None, check_input=True,            X_idx_sorted=None):        """Build a decision tree from the training set (X, y).

z至此我们就找到了刚才tree.fit的函数。我们知道GBDT也是基于决策树（C3/C4.5）的，RF和GBDT是两种不同的策略，RF采用的是Bagging，GBDT是采用的回归方法拟合来降低label的残差。那么这两种forest的形式都是基于最基础的决策树。整体来说呢，我们在这里目前不关注一些细节实现，比方说下面的代码：

    if is_classification:           .           .           .        else:            self.classes_ = [None] * self.n_outputs_            self.n_classes_ = [1] * self.n_outputs_

显然这只是一个参数赋值的过程，等我们浏览完整体的GBDT代码以后再来仔细研究这些代码。首先我们只看树的构建：

       # Build tree       #CRITERIA_CLF = {"gini": _criterion.Gini, "entropy": _criterion.Entropy}       #CRITERIA_REG = {"mse": _criterion.MSE, "friedman_mse": _criterion.FriedmanMSE}        criterion = self.criterion        if not isinstance(criterion, Criterion):            if is_classification:                criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,                                                         self.n_classes_)            else:                criterion = CRITERIA_REG[self.criterion](self.n_outputs_)

选择gini增益还是entropy增益。返回区分度最大的特征，如果读者了解决策树的基本概念，这应该没什么难度，如果不了解，建议先去阅读决策树。
构建一颗树可以有深度和广度优先两种方式，在GBDT中这个构建方式是由max_leaf_nodes控制

# Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise        if max_leaf_nodes < 0:            builder = DepthFirstTreeBuilder(...)        else:            builder = BestFirstTreeBuilder(...)        builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)

然后是reture self，如果还记得之前的代码，这个self应该是BaseDecisionTree。我们看看构建完成forest中的第一颗树以后GBDT会干什么，当然是更新预测的label啦，

     if X_csc is not None:                tree.fit(X_csc, residual, sample_weight=sample_weight,                         check_input=False, X_idx_sorted=X_idx_sorted)            else:                tree.fit(X, residual, sample_weight=sample_weight,                         check_input=False, X_idx_sorted=X_idx_sorted)            # update tree leaves            if X_csr is not None:                loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred,                                             sample_weight, sample_mask,                                             self.learning_rate, k=k)            else:                loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,                                             sample_weight, sample_mask,                                             self.learning_rate, k=k)

其中上半部分代码是训练第一棵树，也就是训练好了第一颗树以后我们就需要更新一下整个forest中的参数，包括residual、predict_label
loss.update_terminal_regions这个loss是你自定义的loss function作为param输入给模型的。简单来说呢，后续部分就是依次构建每一颗树。也就是下面代码片的功能：

def _fit_stages():    .    .    .     i = begin_at_stage     for i in range(begin_at_stage, self.n_estimators):         .         .         .         y_pred = self._fit_stage()     return i + 1

以上就是GBDT的源码的宏观把握的地方。接下来的部分将会仔细对代码中的细节部分进行解读。

0 0