随机森林原理与应用

来源：互联网发布：linux配置ntp时间同步编辑：程序博客网时间：2024/06/07 02:01

简化理解，随机森林RF（Random Forest）是Bagging算法和决策树DT分类器的一种结合，能够执行分类和回归任务。除此之外，模型组合+决策器还有一种比较基本的形式是梯度提升决策树GBDT（Gradient Boost Decision Tree）。随机森林的优势之一是能够处理特征数量巨大的数据，比如基因芯片数据等。

1. 随机森林算法

（1）从原始数据集中有放回地随机采样出n个样本，构造子数据集；（行采样）

说明：Bootstrapping算法就是利用有限的样本经由多次重复抽样，重新建立起足以代表母体样本分布之新样本。

（2）从所有特征中随机选择k个特征，在该子数据集上构建决策树；（列选择）

（3）重复以上步骤m次，生成m棵决策树，形成随机森林；

（4）对于新数据，经过每棵树决策。（在分类时采用多数投票，在回归时采用平均）。

2. 随机森林应用

（1）RandomForestClassifier类

在sklearn中，RandomForestClassifier类，如下所示：

sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

解析：

1）n_estimators：森林中树的数目，默认为10。

2）criterion：包括gini（Gini impurity），entropy（information gain）。默认为gini。

3）max_depth：树的最大深度。If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

4）min_samples_split：区分一个内部节点需要的最少的样本数。

5）min_samples_leaf：一个叶节点所需要的最小样本数。

6）min_weight_fraction_leaf：一个叶节点的输入样本所需要的最小的加权分数。

7）max_features：随机选择特征最大数目。

If "auto", then max_features=sqrt(n_features).
If "sqrt", then max_features=sqrt(n_features) (same as "auto").
If "log2", then max_features=log2(n_features).
If None, then max_features=n_features.

8）max_leaf_nodes：

（2）Titanic：Machine Learning from Disaster [8][13]

3. 随机森林优缺点

（1）优点

它能够处理很高维度的数据，并且不用做特征选择；
由于随机选择样本导致的每次学习决策树使用不同训练集，所以可以一定程度上避免过拟合；
适合并行计算，并且实现比较简单。

（2）缺点

随机森林已经被证明在某些噪音较大的分类或回归问题上会过拟合；
对于有不同级别的属性的数据，级别划分较多的属性会对随机森林产生更大的影响，所以随机森林在这种数据上产出的属性权值是不可信的。

关于随机森林收敛定理、泛化误差界和袋外估计（out of bag，OOB）三个部分比较复杂的数学推导参考“随机森林理论浅析”。[11]

参考文献：

[1] 机器学习中的算法——决策树模型组合之随机森林与GBDT：http://www.36dsj.com/archives/21036

[2] Python实现的随机森林：http://www.oschina.net/translate/random-forests-in-python?cmp

[3] 随机森林：http://baike.baidu.com/link?url=vtMu4505ng0mVCfK1c8erLzm1AqDw4j26TDL1BT4MFd75y1Pu1aavsiLDG3ZLy-ATtZFmE4MhKGnGTqVwPwFV_

[4] 随机森林算法：http://blog.jasonding.top/2015/07/23/Machine%20Learning/【机器学习基础】随机森林算法/

[5] 梯度提升决策树：http://blog.csdn.net/jasonding1354/article/details/47066929

[6] A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)：https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

[7] 随机森林：http://www.cnblogs.com/wentingtu/archive/2011/12/22/2297405.html

[8] 随机森林（原理/样例实现/参数调优）：http://blog.csdn.net/y0367/article/details/51501780

[9] 随机之美，随机森林：http://blog.jobbole.com/99536/

[10] 随机森林入门攻略：http://developer.51cto.com/art/201509/491308.htm

[11] 随机森林理论浅析：http://xueshu.baidu.com/s?wd=paperuri%3A%282d68174af73cb80dc58ebcad296cf68b%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fwww.doc88.com%2Fp-4857326901187.html&ie=utf-8&sc_us=17088827434296157960

[12] 用随机森林模型替代常用的回归和分类模型：http://blog.sciencenet.cn/blog-661364-728330.html

[13] Kaggle竞赛itanic：Machine Learning from Disaster：http://www.cnblogs.com/tosouth/p/4889599.html

1 0