决策树

来源：互联网发布：网剧数据分析编辑：程序博客网时间：2024/06/06 07:11

本内容整理自coursera，欢迎交流转载。

1 决策树举例

　　比方说，我们希望评价一个人的贷款风险评级，可以如下：
这里写图片描述

2 决策树学习——贪心算法（greedy algorithm)

　　
先来看一个定义：
Erroe=num of error predictionstotal number
算法步骤：
step 1:start with an empty treestep2:select a feature to split dataFor each split of the tree step3:if nothing more to continue then stop step4:otherwise go to step2 to repeat
这里有３个问题需要解决：

如何选择特征？
停止条件？
如何递归？

下面我们分别解决这些问题。

3 选择最优特征来分割

　　我们希望使得每次分类的错误率最低。
　　比方说我们有４０个数据，２２个是安全的，１８个危险的。我们定义每一类都是“少数服从多数”，即一组数据里安全的的数据多，我们就认为这一类全部都是安全的数据，否则全部都是危险的数据。
　　按照这个定义，一开始我们会认为我们所有的数据都是安全的，这时候错误率是：
　　1818+22=0.45
　　假设第一层我们按照信用好坏来分为２类，第一类excellent:９个安全数据，第二类fair:９个安全数据，４个危险数据；第三类risky:４个安全数据，１４个危险数据。如下图：
　　这里写图片描述
　　这时候我们有８个错误分类，因此错误率变为：840=0.2,错误率减小。
　　我们为了找到最合适的特征，需要遍历所有特征找到错误率最小的特征作为本次分类特征，后面依次寻找每次分类错误率最小的特征。

4 终止条件

　　我们迭代到什么时候停止呢？
　　我们有三种情况需要终止算法：
　　　　1. 每一类都属于同一类　　
　　　　 2. 没有特征可以用了　　
　　　　3. 达到算法规定的决策树的最大深度

5 使用模型进行预测

算法基本流程：
基本思想是递归迭代：
predict(tree_node, input) if current tree_node is a leaf: return class of this leaf else: next_node=child of tree_node whose feature value agrees with input return predict(next_node, input)

6 实际数据如何使用

　　一般情况下我们的数据不会只有０或１，好或坏。我们的数据尽管是离散的，但是我们每个特征可能有很多中，比如工资可以是６５７８元，１２３４６元……显然我们不可能把数据集分成成千上万个类，那么我们该怎么办呢？
　　最简单的办法我们可以以某个阈值把数据分为两类或三类，这样我们的数据种类就会变得比较少，有助于建立决策树。
　　那么每个特征的阈值该怎么确定呢？我们可以参考以下算法：
　　这里写图片描述

7 自己编程实现决策树的建立

数据和代码文件可以在这里下载.
这些代码可能有一些难度，需要对递归有所理解，可以参考这个递归介绍了解基础知识。

import graphlabloans = graphlab.SFrame('lending-club-data.gl/')#Like the previous assignment, we reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)loans = loans.remove_column('bad_loans')#确定我们使用的特征，为了简化我们选取几个特征features = ['grade',              # grade of the loan            'term',               # the term of the loan            'home_ownership',     # home_ownership status: own, mortgage or rent            'emp_length',         # number of years of employment           ]target = 'safe_loans'loans = loans[features + [target]]#均衡数据，使得safe和risky的数据量大致相等safe_loans_raw = loans[loans[target] == 1]risky_loans_raw = loans[loans[target] == -1]# Since there are less risky loans than safe loans, find the ratio of the sizes# and use that percentage to undersample the safe loans.percentage = len(risky_loans_raw)/float(len(safe_loans_raw))safe_loans = safe_loans_raw.sample(percentage, seed = 1)risky_loans = risky_loans_rawloans_data = risky_loans.append(safe_loans)print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))print "Total number of loans in our new dataset :", len(loans_data)#根据上面介绍的，对实际的数值数据进行阈值分类，每个特征分为两类binary class#此处的代码需要一些python和graphlab的技巧，可以不必理解，先使用loans_data = risky_loans.append(safe_loans)for feature in features:    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})        loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)    # Change None's to 0's    for column in loans_data_unpacked.column_names():        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)    loans_data.remove_column(feature)    loans_data.add_columns(loans_data_unpacked)features = loans_data.column_names()features.remove('safe_loans')  # Remove the response variabletrain_data, test_data = loans_data.random_split(.8, seed=1)#编写函数计算每一类里面分类错误的个数def intermediate_node_num_mistakes(labels_in_node):    # Corner case: If labels_in_node is empty, return 0    if len(labels_in_node) == 0:        return 0    # Count the number of 1's (safe loans)    ## YOUR CODE HERE    loan1=labels_in_node.apply(lambda x:1 if x==1 else 0)    num1 = sum(loan1)    # Count the number of -1's (risky loans)    ## YOUR CODE HERE    numris = sum(labels_in_node.apply(lambda x:1 if x==-1 else 0))                # Return the number of mistakes that the majority classifier makes.    ## YOUR CODE HERE    if num1>numris:        fenzi=numris    else:        fenzi=num1    return fenzi#根据算法计算某次迭代最好的分类特征def best_splitting_feature(data, features, target):    best_feature = None # Keep track of the best feature     best_error = 10     # Keep track of the best error so far     # Note: Since error is always <= 1, we should intialize it with something larger than 1.    # Convert to float to make sure error gets computed correctly.    num_data_points = float(len(data))      # Loop through each feature to consider splitting on that feature    for feature in features:        # The left split will have all data points where the feature value is 0        left_split = data[data[feature] == 0]        # The right split will have all data points where the feature value is 1        ## YOUR CODE HERE        right_split = data[data[feature] == 1]         # Calculate the number of misclassified examples in the left split.        # Remember that we implemented a function for this! (It was called intermediate_node_num_mistakes)        # YOUR CODE HERE        left_mistakes =  intermediate_node_num_mistakes(left_split[target])                   # Calculate the number of misclassified examples in the right split.        ## YOUR CODE HERE        right_mistakes = intermediate_node_num_mistakes(right_split[target])        # Compute the classification error of this split.        # Error = (# of mistakes (left) + # of mistakes (right)) / (# of data points)        ## YOUR CODE HERE        error = (left_mistakes+right_mistakes+0.0)/num_data_points        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error        ## YOUR CODE HERE        if error < best_error:            best_error = error            best_feature=feature     return best_feature # Return the best feature we found#建立叶子节点的函数'''５个参数   'is_leaf'            : True/False.   'prediction'         : Prediction at the leaf node.   'left'               : (dictionary corresponding to the left tree).   'right'              : (dictionary corresponding to the right tree).   'splitting_feature'  : The feature that this node splits on.'''def create_leaf(target_values):    # Create a leaf node    leaf = {'splitting_feature' : None,            'left' : None,            'right' : None,            'is_leaf':   True  }   ## YOUR CODE HERE    # Count the number of data points that are +1 and -1 in this node.    num_ones = len(target_values[target_values == +1])    num_minus_ones = len(target_values[target_values == -1])    # For the leaf node, set the prediction to be the majority class.    # Store the predicted class (1 or -1) in leaf['prediction']    if num_ones > num_minus_ones:        leaf['prediction'] =   +1       ## YOUR CODE HERE    else:        leaf['prediction'] =   -1       ## YOUR CODE HERE    # Return the leaf node            return leaf #建立决策树'''首先remaining_features用来存储该层剩余的特征三个停止条件，建立最后的叶子节点如果没有达到停止条件：首先选择最优的分类特征（使用之前写好的函数）把选择的特征的数据分成２类left和right如果某个类里的类全部是统一的，则创建叶子节点，此部分就结束了否则就以刚刚分好的left&right分别递归返回一个代表节点结构的字典结构'''def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10):    remaining_features = features[:] # Make a copy of the features.    target_values = data[target]    print "--------------------------------------------------------------------"    print "Subtree, depth = %s (%s data points)." % (current_depth, len(target_values))    # Stopping condition 1    # (Check if there are mistakes at current node.    # Recall you wrote a function intermediate_node_num_mistakes to compute this.)    if  intermediate_node_num_mistakes(target_values)== 0:  ## YOUR CODE HERE        print "Stopping condition 1 reached."             # If not mistakes at current node, make current node a leaf node        return create_leaf(target_values)    # Stopping condition 2 (check if there are remaining features to consider splitting on)    if remaining_features ==None :   ## YOUR CODE HERE        print "Stopping condition 2 reached."            # If there are no remaining features to consider, make current node a leaf node        return create_leaf(target_values)        # Additional stopping condition (limit tree depth)    if current_depth >=max_depth :  ## YOUR CODE HERE        print "Reached maximum depth. Stopping for now."        # If the max tree depth has been reached, make current node a leaf node        return create_leaf(target_values)    # Find the best splitting feature (recall the function best_splitting_feature implemented above)    ## YOUR CODE HERE    splitting_feature = best_splitting_feature(data,features,target)    # Split on the best feature that we found.     left_split = data[data[splitting_feature] == 0]    right_split = data[data[splitting_feature]==1]      ## YOUR CODE HERE    remaining_features.remove(splitting_feature)    print "Split on feature %s. (%s, %s)" % (\                      splitting_feature, len(left_split), len(right_split))    # Create a leaf node if the split is "perfect"    if len(left_split) == len(data):        print "Creating leaf node."        return create_leaf(left_split[target])    if len(right_split) == len(data):        print "Creating leaf node."        ## YOUR CODE HERE        return create_leaf(right_split[target])    # Repeat (recurse) on left and right subtrees    left_tree = decision_tree_create(left_split, remaining_features, target, current_depth + 1, max_depth)            ## YOUR CODE HERE    right_tree = decision_tree_create(right_split,remaining_features,target,current_depth+1,max_depth)    return {'is_leaf'          : False,             'prediction'       : None,            'splitting_feature': splitting_feature,            'left'             : left_tree,             'right'            : right_tree}#节点数量def count_nodes(tree):    if tree['is_leaf']:        return 1    return 1 + count_nodes(tree['left']) + count_nodes(tree['right'])#训练一个模型my_decision_tree = decision_tree_create(train_data,features,target,current_depth=0,max_depth=6)#预测函数def classify(tree, x, annotate = False):       # if the node is a leaf node.    if tree['is_leaf']:        if annotate:             print "At leaf, predicting %s" % tree['prediction']        return tree['prediction']     else:        # split on feature.        split_feature_value = x[tree['splitting_feature']]        if annotate:             print "Split on %s = %s" % (tree['splitting_feature'], split_feature_value)        if split_feature_value == 0:            return classify(tree['left'], x, annotate)        else:            return classify(tree['right'], x, annotate)               ### YOUR CODE HERE#计算错误率def evaluate_classification_error(tree, data, target):    # Apply the classify(tree, x) to each row in your data    prediction = data.apply(lambda x: classify(tree, x))    real = data[target]    err = sum(abs(prediction-real))/2    # Once you've made the predictions, calculate the classification error and return it    ## YOUR CODE HERE    return err/(len(data)+0.0)#打印节点def print_stump(tree, name = 'root'):    split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months'    if split_name is None:        print "(leaf, label: %s)" % tree['prediction']        return None    split_feature, split_value = split_name.split('.')    print '                       %s' % name    print '         |---------------|----------------|'    print '         |                                |'    print '         |                                |'    print '         |                                |'    print '  [{0} == 0]               [{0} == 1]    '.format(split_name)    print '         |                                |'    print '         |                                |'    print '         |                                |'    print '    (%s)                         (%s)' \        % (('leaf, label: ' + str(tree['left']['prediction']) if tree['left']['is_leaf'] else 'subtree'),           ('leaf, label: ' + str(tree['right']['prediction']) if tree['right']['is_leaf'] else 'subtree'))

1 0