scikit-learn 中的随机森林用法

来源:互联网 发布:js判断ie版本是否大于8 编辑:程序博客网 时间:2024/04/30 11:14

随机森林是一种以决策树为基分类器的常用集成分类器,使用取平均方法组合基分类器来预测样本类别。在Python的机器学习包scikit-learn中已经有具体实现。

下面给出使用方法

from sklearn.ensemble import RandomForestClassifier  model = RandomForestClassifier(n_estimators=10)  model.fit(train_x, train_y)

其中 train_x为训练样本特征集,train_y为对应的样本标签。
下面给出RandomForestClassifier函数的输入参数:

sklearn.ensemble.RandomForestClassifier(n_estimators=10,     criterion='gini', max_depth=None, min_samples_split=2,     min_samples_leaf=1, min_weight_fraction_leaf=0.0,     max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07,     bootstrap=True, oob_score=False, n_jobs=1, random_state=None,     verbose=0, warm_start=False, class_weight=None)

主要参数(Parameters)有:

  1. n_estimators : 森林中树的数量,默认为10。
  2. criterion : 结点属性划分度量准则,可选择“gini”准则,即基尼不纯度度量准则,或者是“entropy”准则, 即信息增益度量准则,默认为“gini”准则。此参数为决策树分类器独有。
  3. max_features: 寻找最佳属性划分时所使用的特征数量。
    If int, then consider max_features features at each split.
    If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
    If “auto”, then max_features=sqrt(n_features).
    If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
    If “log2”, then max_features=log2(n_features).
    If None, then max_features=n_features.
  4. max_depth : 树的最大深度。默认 max_depth=None, 此时结点会一直增长,直到结点下所有样本均为同一类别,或者样本数目不大于min_samples_split 。
  5. min_samples_split :(default=2),内部结点所需划分的最小样本数,如果是int类型,那么当属于该结点的样本数不大于该值时,不再进行分裂。如果是float类型,min_samples_split 是比例系数,最小样本数为ceil(min_samples_split * n_samples) 。
  6. min_samples_leaf : 叶子结点最小样本数。
  7. min_weight_fraction_leaf : The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
  8. max_leaf_nodes :树的最大叶结点数,如果是None,则不限制。
  9. min_impurity_split : float, optional (default=1e-7)
    Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
  10. bootstrap : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.
  11. oob_score : bool (default=False)
    Whether to use out-of-bag samples to estimate the generalization accuracy.
  12. n_jobs : integer, optional (default=1)
    并行计算时使用的核数目。为-1时,使用所有核。
  13. random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
  14. verbose : int, optional (default=0)
    Controls the verbosity of the tree building process.
  15. warm_start : bool, optional (default=False)
    When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
  16. class_weight : dict, list of dicts, “balanced”,
    “balanced_subsample” or None, optional (default=None) Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
    The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
    The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.
    For multi-output, the weights of each column of y will be multiplied.
    Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
    比较重要的几个模型属性:
  17. feature_importances_ : array of shape = [n_features]
    特征重要性,值越大,特征相对越重要。
  18. n_features_ : int 模型拟合时使用的特征数量。

    Reference:http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

原创粉丝点击