sklearn.GBDT 源码解读(宏观把握)
来源:互联网 发布:apache服务器优点 编辑:程序博客网 时间:2024/06/11 11:50
sklearn.GBDT源码解读
2017/01/09 22:05 V.0.1 第一版不注重源码的细节把握,注重的是代码的整体把控。后续版本会更新具体源码细节部分。
2017/01/11 01:25 V.0.2 第一版不注重源码的细节把握,注重的是代码的整体把控。后续版本会更新具体源码细节部分。
最近一直玩数据挖掘,GBDT使用了一点,就想看看源码是怎么实现的。
当训练一个GBDT模型的时候
gbdt=sklearn.ensemble.GradientBoostingClassifier(param)
s所以我们找到对应文件夹的代码
class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin): _SUPPORTED_LOSS = ('deviance', 'exponential') def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0., max_depth=3, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort='auto'): super(GradientBoostingClassifier, self).__init__( loss=loss, learning_rate=learning_rate, n_estimators=n_estimators, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, min_weight_fraction_leaf=min_weight_fraction_leaf, max_depth=max_depth, init=init, subsample=subsample, max_features=max_features, random_state=random_state, verbose=verbose, max_leaf_nodes=max_leaf_nodes, warm_start=warm_start, presort=presort)
w我们看到GradientBoostingClassifier继承了一个父类BaseGradientBoosting,同时我们也会发现GradientBoostingRegressor也继承了这个父类。我们会在这个父类中找到如下代码片:
class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble, _LearntSelectorMixin)):....def fit(self, X, y, sample_weight=None, monitor=None): """Fit the gradient boosting model.
a
x显然,当我们训练模型时调用的就是这个fit函数。
clf = clf.fit(train[predictors],train[target])
x下面我们深入到这个fit代码里面一探究竟,看看GBDT到底是怎么来训练模型的。其中比如warm_start,check_X_y,check_random_state等一看便是基本的检查数据合法性,直接略过不看。直接找到最重要的代码片:
# fit the boosting stages n_stages = self._fit_stages(X, y, y_pred, sample_weight, random_state, begin_at_stage, monitor, X_idx_sorted) # change shape of arrays after fit (early-stopping or additional ests) if n_stages != self.estimators_.shape[0]: self.estimators_ = self.estimators_[:n_stages] self.train_score_ = self.train_score_[:n_stages] if hasattr(self, 'oob_improvement_'): self.oob_improvement_ = self.oob_improvement_[:n_stages] return self
s首先解释一下上面的代码片,其中包括训练模型的_fit_stages,以及迭代的次数n_stages进入该代码片def _fit_stages拖到最后可以看到return i+1,显然n_stages对应的就是迭代次数。然而注意到reture i+1 这句代码一定是串行执行的代码,也和我们认知到的GBDT的第t颗树的构建依赖t-1棵树的结果是一致的。
z在fit代码片中找到
def _fit_stages(self, X, y, y_pred, sample_weight, random_state, begin_at_stage=0, monitor=None, X_idx_sorted=None): . . . # perform boosting iterations i = begin_at_stage for i in range(begin_at_stage, self.n_estimators): # subsampling if do_oob: . . . # fit next stage of trees y_pred = self._fit_stage(i, X, y, y_pred, sample_weight, sample_mask, random_state, X_idx_sorted, X_csc, X_csr) if do_oob: . . . else: # no need to fancy index w/ no subsampling self.train_score_[i] = loss_(y, y_pred, sample_weight)
h具体的迭代过程就在这个for循环里面,在这里我们会看到do_oob,不同的童鞋直接百度
out-of-bag,很多blog有详细说明,不在赘述。我们重点来看 y_pred = self._fit_stage这句话,需要注意的是与_fit_stages不同的是这句话表示训练下一颗树。串行构建树,而不是并行构建。同样的在xgboost中也是串行构建树,但是其并行特点是在构建一棵树的时候在feature的细粒度下的并行,同样也不是tree的粗粒度的并行。下面我们来详细看看GBDT中的一棵树到底是如何构建的。
def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask, random_state, X_idx_sorted, X_csc=None, X_csr=None): """Fit another stage of ``n_classes_`` trees to the boosting model. """ assert sample_mask.dtype == np.bool loss = self.loss_ original_y = y for k in range(loss.K): if loss.is_multi_class: y = np.array(original_y == k, dtype=np.float64) residual = loss.negative_gradient(y, y_pred, k=k, sample_weight=sample_weight) # induce regression tree on residuals tree = DecisionTreeRegressor( criterion='friedman_mse', splitter='best', max_depth=self.max_depth, min_samples_split=self.min_samples_split, min_samples_leaf=self.min_samples_leaf, min_weight_fraction_leaf=self.min_weight_fraction_leaf, max_features=self.max_features, max_leaf_nodes=self.max_leaf_nodes, random_state=random_state, presort=self.presort) if self.subsample < 1.0: # no inplace multiplication! sample_weight = sample_weight * sample_mask.astype(np.float64) if X_csc is not None: tree.fit(X_csc, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted) else: tree.fit(X, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted) # update tree leaves if X_csr is not None: loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred, sample_weight, sample_mask, self.learning_rate, k=k) else: loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred, sample_weight, sample_mask, self.learning_rate, k=k) # add tree to ensemble self.estimators_[i, k] = tree return y_pred
w为什么说GBDT是回归树呢?为什么说GBDT是朝着reduce residual的目标来构建树呢?我想上面代码片中的:
if loss.is_multi_class: y = np.array(original_y == k, dtype=np.float64) residual = loss.negative_gradient(y, y_pred, k=k, sample_weight=sample_weight) # induce regression tree on residuals tree = DecisionTreeRegressor( criterion='friedman_mse', splitter='best', max_depth=self.max_depth, min_samples_split=self.min_samples_split, min_samples_leaf=self.min_samples_leaf, min_weight_fraction_leaf=self.min_weight_fraction_leaf, max_features=self.max_features, max_leaf_nodes=self.max_leaf_nodes, random_state=random_state, presort=self.presort)`
g我们注意到loss.negative_gradient()这个函数,见字如面。给了我们很好的回答。随后进入构建好的回归树的训练中,当然训练的目标就是极小化这个residual,这也是residual regression的概念:
if X_csc is not None: tree.fit(X_csc, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted) else: tree.fit(X, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted)
OK,让我们首先找到tree.fit这个函数在哪,如果读者是在windows开发环境下,建议先去装个linux。
#from ..tree.tree import DecisionTreeRegressorcd ..lscd anaconda2find -name sklearncd anaconda2/pkgs/scikit-learn-0.17.1-np111py27_2/lib/python2.7/site-packages/sklearncd ensemblelsvi gradient_boosting.pyy以上是找到gradient_boosting.py的方式.cd ..ls #找到tree文件夹cd treevi tree.py #打开tree.py,找到DecisionTreeRegressor
r如上方式找到DecisionTreeRegressor,
class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin): """A decision tree regressor. . . . def __init__(self, criterion="mse", splitter="best", max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0., max_features=None, random_state=None, max_leaf_nodes=None, presort=False): super(DecisionTreeRegressor, self).__init__( criterion=criterion, splitter=splitter, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, min_weight_fraction_leaf=min_weight_fraction_leaf, max_features=max_features, max_leaf_nodes=max_leaf_nodes, random_state=random_state, presort=presort)
t同样的,找到继承的父类BaseDecisionTree,
class BaseDecisionTree(six.with_metaclass(ABCMeta, BaseEstimator, _LearntSelectorMixin)): . . . def fit(self, X, y, sample_weight=None, check_input=True, X_idx_sorted=None): """Build a decision tree from the training set (X, y).
z至此我们就找到了刚才tree.fit的函数。我们知道GBDT也是基于决策树(C3/C4.5)的,RF和GBDT是两种不同的策略,RF采用的是Bagging,GBDT是 采用的回归方法拟合来降低label的残差。那么这两种forest的形式都是基于最基础的决策树。整体来说呢,我们在这里目前不关注一些细节实现,比方说下面的代码:
if is_classification: . . . else: self.classes_ = [None] * self.n_outputs_ self.n_classes_ = [1] * self.n_outputs_
显然这只是一个参数赋值的过程,等我们浏览完整体的GBDT代码以后再来仔细研究这些代码。首先我们只看树的构建:
# Build tree #CRITERIA_CLF = {"gini": _criterion.Gini, "entropy": _criterion.Entropy} #CRITERIA_REG = {"mse": _criterion.MSE, "friedman_mse": _criterion.FriedmanMSE} criterion = self.criterion if not isinstance(criterion, Criterion): if is_classification: criterion = CRITERIA_CLF[self.criterion](self.n_outputs_, self.n_classes_) else: criterion = CRITERIA_REG[self.criterion](self.n_outputs_)
选择gini增益还是entropy增益。返回区分度最大的特征,如果读者了解决策树的基本概念,这应该没什么难度,如果不了解,建议先去阅读决策树。
构建一颗树可以有深度和广度优先两种方式,在GBDT中这个构建方式是由max_leaf_nodes控制
# Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise if max_leaf_nodes < 0: builder = DepthFirstTreeBuilder(...) else: builder = BestFirstTreeBuilder(...) builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
然后是reture self,如果还记得之前的代码,这个self应该是BaseDecisionTree。我们看看构建完成forest中的第一颗树以后GBDT会干什么,当然是更新预测的label啦,
if X_csc is not None: tree.fit(X_csc, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted) else: tree.fit(X, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted) # update tree leaves if X_csr is not None: loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred, sample_weight, sample_mask, self.learning_rate, k=k) else: loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred, sample_weight, sample_mask, self.learning_rate, k=k)
其中上半部分代码是训练第一棵树,也就是训练好了第一颗树以后我们就需要更新一下整个forest中的参数,包括residual、predict_label
loss.update_terminal_regions这个loss是你自定义的loss function作为param输入给模型的。简单来说呢,后续部分就是依次构建每一颗树。也就是下面代码片的功能:
def _fit_stages(): . . . i = begin_at_stage for i in range(begin_at_stage, self.n_estimators): . . . y_pred = self._fit_stage() return i + 1
以上就是GBDT的源码的宏观把握的地方。接下来的部分将会仔细对代码中的细节部分进行解读。
- sklearn.GBDT 源码解读(宏观把握)
- sklearn中gbdt源码解读笔记
- sklearn.GBDT 源码阅读(细节掌握)
- sklearn的GBDT源码笔记
- J2EE总结(宏观把握)
- sklearn:GBDT
- sklearn.ensemble之RandomForestClassifier源码解读(一)
- sklearn.ensemble之RandomForestClassifier源码解读(二)
- 宏观把握Hadoop生态系统
- C++第一遍宏观把握
- 软件工程之宏观把握
- 用三张图宏观把握数据库
- 宏观把握SpringMVC框架
- 宏观上把握DbUtils
- 软工总结(1-4)——把握宏观
- 软件工程(哈工大视频)第一话——宏观把握
- GBDT(sklearn)参数详解
- GBDT(sklearn)进行回归
- 兔子-IllegalArgumentException: cannot add an action twice: Generate Butterknife Injections (
- zoj1154
- 将博客搬至CSDN
- JavaSE Get和Post传递参数乱码
- 5. Longest Palindromic Substring
- sklearn.GBDT 源码解读(宏观把握)
- Java创建和结束线程
- Java I/O流详解
- 编程之路(各种技术的路线图)
- 个人记录-LeetCode 78. Subsets
- 安装Ubuntu
- 11、数值的整数次方
- 欢迎使用CSDN-markdown编辑器
- python中多态