scikit-learn 中的随机森林用法

来源：互联网发布：js判断ie版本是否大于8 编辑：程序博客网时间：2024/04/30 11:14

随机森林是一种以决策树为基分类器的常用集成分类器，使用取平均方法组合基分类器来预测样本类别。在Python的机器学习包scikit-learn中已经有具体实现。

下面给出使用方法

from sklearn.ensemble import RandomForestClassifier  model = RandomForestClassifier(n_estimators=10)  model.fit(train_x, train_y)

其中 train_x为训练样本特征集，train_y为对应的样本标签。
下面给出RandomForestClassifier函数的输入参数：

sklearn.ensemble.RandomForestClassifier(n_estimators=10,     criterion='gini', max_depth=None, min_samples_split=2,     min_samples_leaf=1, min_weight_fraction_leaf=0.0,     max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07,     bootstrap=True, oob_score=False, n_jobs=1, random_state=None,     verbose=0, warm_start=False, class_weight=None)

主要参数（Parameters）有:

n_estimators : 森林中树的数量，默认为10。
criterion : 结点属性划分度量准则，可选择“gini”准则，即基尼不纯度度量准则，或者是“entropy”准则, 即信息增益度量准则，默认为“gini”准则。此参数为决策树分类器独有。
max_features: 寻找最佳属性划分时所使用的特征数量。
If int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
max_depth : 树的最大深度。默认 max_depth=None, 此时结点会一直增长，直到结点下所有样本均为同一类别，或者样本数目不大于min_samples_split 。
min_samples_split :(default=2)，内部结点所需划分的最小样本数，如果是int类型，那么当属于该结点的样本数不大于该值时，不再进行分裂。如果是float类型，min_samples_split 是比例系数，最小样本数为ceil(min_samples_split * n_samples) 。
min_samples_leaf : 叶子结点最小样本数。
min_weight_fraction_leaf : The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_leaf_nodes :树的最大叶结点数，如果是None，则不限制。
min_impurity_split : float, optional (default=1e-7)
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
oob_score : bool (default=False)
Whether to use out-of-bag samples to estimate the generalization accuracy.
n_jobs : integer, optional (default=1)
并行计算时使用的核数目。为-1时，使用所有核。
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controls the verbosity of the tree building process.
warm_start : bool, optional (default=False)
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
class_weight : dict, list of dicts, “balanced”,
“balanced_subsample” or None, optional (default=None) Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
比较重要的几个模型属性：
feature_importances_ : array of shape = [n_features]
特征重要性，值越大，特征相对越重要。
n_features_ : int 模型拟合时使用的特征数量。
Reference:http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

阅读全文

0 0