2017-5-2决策树sklearn库学习

来源：互联网发布：淘宝开店卖虚拟编辑：程序博客网时间：2024/06/06 17:47

决策树sklearn库学习—
title: 2017-5-2
tags: 新建,模板,小书匠

grammar_cjkRuby: true

一、参数

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_split=1e-07, class_weight=None, presort=False)
class sklearn.tree.DecisionTreeRegressor(criterion=’mse’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_split=1e-07, presort=False)[source]

1.criterion : string, optional (default=”gini”)
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
节点分割标准，默认按照GINI基尼系数分割，也可以设置成“entropy”，按照信息增益分割。一般说使用默认的基尼系数”gini”就可以了，即CART算法。除非你更喜欢类似ID3, C4.5的最优特征选择方法。
回归树：默认是MSE最小二乘方差，可以选择MAE绝对均值误差。推荐使用默认的”mse”。一般来说”mse”比”mae”更加精确。
2. splitter : string, optional (default=”best”)
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
可以使用”best”或者”random”。前者在特征的所有划分点中找出最优的划分点。后者是随机的在部分划分点中找局部最优的划分点。

默认的”best”适合样本量不大的时候，而如果样本数据量非常大，此时决策树构建推荐”random”
3. max_features : int, float, string or None, optional (default=None)
The number of features to consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
划分时考虑的最大特征数（样本特征太多，可以不用全部考虑）。
默认是”None”,意味着划分时考虑所有的特征数；如果是”log2”意味着划分时最多考虑log2N个特征；如果是”sqrt”或者”auto”意味着划分时最多考虑N−−√N个特征。如果是整数，代表考虑的特征绝对数。如果是浮点数，代表考虑特征百分比，即考虑（百分比xN）取整后的特征数。其中N为样本总特征数。
一般来说，如果样本特征数不多，比如小于50，我们用默认的”None”就可以了，如果特征数非常多，我们可以灵活使用刚才描述的其他取值来控制划分时考虑的最大特征数，以控制决策树的生成时间。
4. max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
决策树的最大深度，默认是None，节点会一直扩展直到全部分开，或者小于min_samples_split samples设定值
一般来说，数据少或者特征少的时候可以不管这个值。如果模型样本量多，特征也多的情况下，推荐限制这个最大深度，具体的取值取决于数据的分布。常用的可以取值10-100之间。
5. min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for percentages.
内部节点再划分所需样本数目，默认是2。除了整数外，也可以设置成样本数目的百分比
如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。
6. min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
叶子节点最少样本数。如果某子节点进行划分后的叶子节点所含样本数少于该给定值，则停止该子节点的划分，否则继续划分（注意与5的区别）。
如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。
7. min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起分割。默认是0，就是不考虑权重问题。一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。
8. max_leaf_nodes : int or None, optional (default=None)
Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

最大叶子节点数
默认是”None”，即不限制最大的叶子节点数。如果加了限制，算法会建立在最大叶子节点数内最优的决策树。如果特征不多，可以不考虑这个值，但是如果特征多的话，可以加以限制，具体的值可以通过交叉验证得到。
9. class_weight : dict, list of dicts, “balanced” or None, optional (default=None)
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
类别权重
指定样本各类别的的权重，主要是为了防止训练集某些类别的样本过多，导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重，或者用“balanced”，如果使用“balanced”，则算法会自己计算权重，样本量少的类别所对应的样本权重会高。当然，如果你的样本类别分布没有明显的偏倚，则可以不管这个参数，选择默认的”None”
回归树：不适用回归树
10. random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
如果是int,random_state 是随机数字发生器的种子；如果是RandomState，random_state是随机数字发生器，如果是None，随机数字发生器是np.random使用的RandomState instance.
11. min_impurity_split : float, optional (default=1e-7)
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
New in version 0.18.
最小不纯度值分割，提前停止的应用策略，一个节点只有其不纯度值大于该阈值时才会被分割，否则该节点为叶子结点，不再分割。

决策树有两种节点：中间节点和叶子节点。

每个中间节点有4个参数：
a) 判定函数。是一个特征的取值。当特征小于等于这个值得时候决策路径走左边，当特征大于这个值得时候决策树走右边。
b) 不纯度值(impurity value). 是当前节点的不纯度值. 关于不纯度值得意义后面会讲到. 他反应了当前节点的预测能力. 信息熵(info entropy)和基尼指数(gini index)就是两种常用的不纯度函数
c) 覆盖样本个数(n_samples). 是指参与此节点决策的样本个数. 父亲节点(p)和两个孩子节点(l,r)的样本个数的关系为: n_samples(p) = n_samples(l) + n_samples(r) 覆盖样本个数越多, 说明判定函数越稳定. 实际上很容易看出来, 所有的叶子节点所对应的样本是总样本的一个划分.
d) 节点取值(node value). 节点取值是一个数组. 数组的长度为类目个数. value = [997, 1154] 表示在2151个样本数中, 有997个属于class1, 1154属于class2. (这是分类树的意义, 回归数不同.)

每个叶子节点有3个参数. 除了没有决策函数之外, 其他参数意义一样.
12. presort : bool, optional (default=False)
Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.
样本是否预排序
这个值是布尔值，默认是False不排序。一般来说，如果样本量少或者限制了一个深度很小的决策树，设置为true可以让划分点选择更加快，决策树建立的更加快。如果样本量太大的话，反而没有什么好处。问题是样本量少的时候，我速度本来就不慢。所以这个值一般懒得理它就可以了。

二. 属性:

classes_ : array of shape = [n_classes] or a list of such arrays
The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).
预测的类列表
feature_importances_ : array of shape = [n_features]
The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance
特征重要度
max_features_ : int,
The inferred value of max_features.
最大特征值
n_classes_ : int or list
The number of classes (for single output problems), or a list containing the number of classes for each output (for multi-output problems).
类的数目
n_features_ : int
The number of features when fit is performed.
Fit函数产生的特征数目
n_outputs_ : int
The number of outputs when fit is performed.
Fit函数输出结果数目
tree_ : Tree object
The underlying Tree object.
产生的树对象

三、调参注意点

除了这些参数要注意以外，其他在调参时的注意点有：

　　　　1）当样本少数量但是样本特征非常多的时候，决策树很容易过拟合，一般来说，样本数比特征数多一些会比较容易建立健壮的模型

　　　　2）如果样本数量少但是样本特征非常多，在拟合决策树模型前，推荐先做维度规约，比如主成分分析（PCA），特征选择（Losso）或者独立成分分析（ICA）。这样特征的维度会大大减小。再来拟合决策树模型效果会好。

　　　　3）推荐多用决策树的可视化，同时先限制决策树的深度（比如最多3层），这样可以先观察下生成的决策树里数据的初步拟合情况，然后再决定是否要增加深度。

　　　　4）在训练模型先，注意观察样本的类别情况（主要指分类树），如果类别分布非常不均匀，就要考虑用class_weight来限制模型过于偏向样本多的类别。

　　　　5）决策树的数组使用的是numpy的float32类型，如果训练数据不是这样的格式，算法会先做copy再运行。

　　　　6）如果输入的样本矩阵是稀疏的，推荐在拟合前调用csc_matrix稀疏化，在预测前调用csr_matrix稀疏化。

0 0