决策树
来源:互联网 发布:网剧数据分析 编辑:程序博客网 时间:2024/06/06 07:11
本内容整理自coursera,欢迎交流转载。
1 决策树举例
比方说,我们希望评价一个人的贷款风险评级,可以如下:
2 决策树学习——贪心算法(greedy algorithm)
先来看一个定义:
算法步骤:
这里有3个问题需要解决:
- 如何选择特征?
- 停止条件?
- 如何递归?
下面我们分别解决这些问题。
3 选择最优特征来分割
我们希望使得每次分类的错误率最低。
比方说我们有40个数据,22个是安全的,18个危险的。我们定义每一类都是“少数服从多数”,即一组数据里安全的的数据多,我们就认为这一类全部都是安全的数据,否则全部都是危险的数据。
按照这个定义,一开始我们会认为我们所有的数据都是安全的,这时候错误率是:
假设第一层我们按照信用好坏来分为2类,第一类excellent:9个安全数据,第二类fair:9个安全数据,4个危险数据;第三类risky:4个安全数据,14个危险数据。如下图:
这时候我们有8个错误分类,因此错误率变为:
我们为了找到最合适的特征,需要遍历所有特征找到错误率最小的特征作为本次分类特征,后面依次寻找每次分类错误率最小的特征。
4 终止条件
我们迭代到什么时候停止呢?
我们有三种情况需要终止算法:
1. 每一类都属于同一类
2. 没有特征可以用了
3. 达到算法规定的决策树的最大深度
5 使用模型进行预测
算法基本流程:
基本思想是递归迭代:
6 实际数据如何使用
一般情况下我们的数据不会只有0或1,好或坏。我们的数据尽管是离散的,但是我们每个特征可能有很多中,比如工资可以是6578元,12346元……显然我们不可能把数据集分成成千上万个类,那么我们该怎么办呢?
最简单的办法我们可以以某个阈值把数据分为两类或三类,这样我们的数据种类就会变得比较少,有助于建立决策树。
那么每个特征的阈值该怎么确定呢?我们可以参考以下算法:
7 自己编程实现决策树的建立
数据和代码文件可以在这里下载.
这些代码可能有一些难度,需要对递归有所理解,可以参考这个递归介绍了解基础知识。
import graphlabloans = graphlab.SFrame('lending-club-data.gl/')#Like the previous assignment, we reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)loans = loans.remove_column('bad_loans')#确定我们使用的特征,为了简化我们选取几个特征features = ['grade', # grade of the loan 'term', # the term of the loan 'home_ownership', # home_ownership status: own, mortgage or rent 'emp_length', # number of years of employment ]target = 'safe_loans'loans = loans[features + [target]]#均衡数据,使得safe和risky的数据量大致相等safe_loans_raw = loans[loans[target] == 1]risky_loans_raw = loans[loans[target] == -1]# Since there are less risky loans than safe loans, find the ratio of the sizes# and use that percentage to undersample the safe loans.percentage = len(risky_loans_raw)/float(len(safe_loans_raw))safe_loans = safe_loans_raw.sample(percentage, seed = 1)risky_loans = risky_loans_rawloans_data = risky_loans.append(safe_loans)print "Percentage of safe loans :", len(safe_loans) / float(len(loans_data))print "Percentage of risky loans :", len(risky_loans) / float(len(loans_data))print "Total number of loans in our new dataset :", len(loans_data)#根据上面介绍的,对实际的数值数据进行阈值分类,每个特征分为两类binary class#此处的代码需要一些python和graphlab的技巧,可以不必理解,先使用loans_data = risky_loans.append(safe_loans)for feature in features: loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1}) loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature) # Change None's to 0's for column in loans_data_unpacked.column_names(): loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0) loans_data.remove_column(feature) loans_data.add_columns(loans_data_unpacked)features = loans_data.column_names()features.remove('safe_loans') # Remove the response variabletrain_data, test_data = loans_data.random_split(.8, seed=1)#编写函数计算每一类里面分类错误的个数def intermediate_node_num_mistakes(labels_in_node): # Corner case: If labels_in_node is empty, return 0 if len(labels_in_node) == 0: return 0 # Count the number of 1's (safe loans) ## YOUR CODE HERE loan1=labels_in_node.apply(lambda x:1 if x==1 else 0) num1 = sum(loan1) # Count the number of -1's (risky loans) ## YOUR CODE HERE numris = sum(labels_in_node.apply(lambda x:1 if x==-1 else 0)) # Return the number of mistakes that the majority classifier makes. ## YOUR CODE HERE if num1>numris: fenzi=numris else: fenzi=num1 return fenzi#根据算法计算某次迭代最好的分类特征def best_splitting_feature(data, features, target): best_feature = None # Keep track of the best feature best_error = 10 # Keep track of the best error so far # Note: Since error is always <= 1, we should intialize it with something larger than 1. # Convert to float to make sure error gets computed correctly. num_data_points = float(len(data)) # Loop through each feature to consider splitting on that feature for feature in features: # The left split will have all data points where the feature value is 0 left_split = data[data[feature] == 0] # The right split will have all data points where the feature value is 1 ## YOUR CODE HERE right_split = data[data[feature] == 1] # Calculate the number of misclassified examples in the left split. # Remember that we implemented a function for this! (It was called intermediate_node_num_mistakes) # YOUR CODE HERE left_mistakes = intermediate_node_num_mistakes(left_split[target]) # Calculate the number of misclassified examples in the right split. ## YOUR CODE HERE right_mistakes = intermediate_node_num_mistakes(right_split[target]) # Compute the classification error of this split. # Error = (# of mistakes (left) + # of mistakes (right)) / (# of data points) ## YOUR CODE HERE error = (left_mistakes+right_mistakes+0.0)/num_data_points # If this is the best error we have found so far, store the feature as best_feature and the error as best_error ## YOUR CODE HERE if error < best_error: best_error = error best_feature=feature return best_feature # Return the best feature we found#建立叶子节点的函数'''5个参数 'is_leaf' : True/False. 'prediction' : Prediction at the leaf node. 'left' : (dictionary corresponding to the left tree). 'right' : (dictionary corresponding to the right tree). 'splitting_feature' : The feature that this node splits on.'''def create_leaf(target_values): # Create a leaf node leaf = {'splitting_feature' : None, 'left' : None, 'right' : None, 'is_leaf': True } ## YOUR CODE HERE # Count the number of data points that are +1 and -1 in this node. num_ones = len(target_values[target_values == +1]) num_minus_ones = len(target_values[target_values == -1]) # For the leaf node, set the prediction to be the majority class. # Store the predicted class (1 or -1) in leaf['prediction'] if num_ones > num_minus_ones: leaf['prediction'] = +1 ## YOUR CODE HERE else: leaf['prediction'] = -1 ## YOUR CODE HERE # Return the leaf node return leaf #建立决策树'''首先remaining_features用来存储该层剩余的特征三个停止条件,建立最后的叶子节点如果没有达到停止条件:首先选择最优的分类特征(使用之前写好的函数)把选择的特征的数据分成2类left和right如果某个类里的类全部是统一的,则创建叶子节点,此部分就结束了否则就以刚刚分好的left&right分别递归返回一个代表节点结构的字典结构'''def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10): remaining_features = features[:] # Make a copy of the features. target_values = data[target] print "--------------------------------------------------------------------" print "Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)) # Stopping condition 1 # (Check if there are mistakes at current node. # Recall you wrote a function intermediate_node_num_mistakes to compute this.) if intermediate_node_num_mistakes(target_values)== 0: ## YOUR CODE HERE print "Stopping condition 1 reached." # If not mistakes at current node, make current node a leaf node return create_leaf(target_values) # Stopping condition 2 (check if there are remaining features to consider splitting on) if remaining_features ==None : ## YOUR CODE HERE print "Stopping condition 2 reached." # If there are no remaining features to consider, make current node a leaf node return create_leaf(target_values) # Additional stopping condition (limit tree depth) if current_depth >=max_depth : ## YOUR CODE HERE print "Reached maximum depth. Stopping for now." # If the max tree depth has been reached, make current node a leaf node return create_leaf(target_values) # Find the best splitting feature (recall the function best_splitting_feature implemented above) ## YOUR CODE HERE splitting_feature = best_splitting_feature(data,features,target) # Split on the best feature that we found. left_split = data[data[splitting_feature] == 0] right_split = data[data[splitting_feature]==1] ## YOUR CODE HERE remaining_features.remove(splitting_feature) print "Split on feature %s. (%s, %s)" % (\ splitting_feature, len(left_split), len(right_split)) # Create a leaf node if the split is "perfect" if len(left_split) == len(data): print "Creating leaf node." return create_leaf(left_split[target]) if len(right_split) == len(data): print "Creating leaf node." ## YOUR CODE HERE return create_leaf(right_split[target]) # Repeat (recurse) on left and right subtrees left_tree = decision_tree_create(left_split, remaining_features, target, current_depth + 1, max_depth) ## YOUR CODE HERE right_tree = decision_tree_create(right_split,remaining_features,target,current_depth+1,max_depth) return {'is_leaf' : False, 'prediction' : None, 'splitting_feature': splitting_feature, 'left' : left_tree, 'right' : right_tree}#节点数量def count_nodes(tree): if tree['is_leaf']: return 1 return 1 + count_nodes(tree['left']) + count_nodes(tree['right'])#训练一个模型my_decision_tree = decision_tree_create(train_data,features,target,current_depth=0,max_depth=6)#预测函数def classify(tree, x, annotate = False): # if the node is a leaf node. if tree['is_leaf']: if annotate: print "At leaf, predicting %s" % tree['prediction'] return tree['prediction'] else: # split on feature. split_feature_value = x[tree['splitting_feature']] if annotate: print "Split on %s = %s" % (tree['splitting_feature'], split_feature_value) if split_feature_value == 0: return classify(tree['left'], x, annotate) else: return classify(tree['right'], x, annotate) ### YOUR CODE HERE#计算错误率def evaluate_classification_error(tree, data, target): # Apply the classify(tree, x) to each row in your data prediction = data.apply(lambda x: classify(tree, x)) real = data[target] err = sum(abs(prediction-real))/2 # Once you've made the predictions, calculate the classification error and return it ## YOUR CODE HERE return err/(len(data)+0.0)#打印节点def print_stump(tree, name = 'root'): split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months' if split_name is None: print "(leaf, label: %s)" % tree['prediction'] return None split_feature, split_value = split_name.split('.') print ' %s' % name print ' |---------------|----------------|' print ' | |' print ' | |' print ' | |' print ' [{0} == 0] [{0} == 1] '.format(split_name) print ' | |' print ' | |' print ' | |' print ' (%s) (%s)' \ % (('leaf, label: ' + str(tree['left']['prediction']) if tree['left']['is_leaf'] else 'subtree'), ('leaf, label: ' + str(tree['right']['prediction']) if tree['right']['is_leaf'] else 'subtree'))
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 决策树
- 实验_Login UI
- jsp动作元素forward和jsp内置对象response的sendRediret()方法的区别
- 51nod——1091 线段的重叠(排序,贪心)
- 配置原生hadoop2.7版
- 编程习惯
- 决策树
- 017 Letter Combinations of a Phone Number
- C++与设计模式(14)——职责链模式
- RESTful Web Service最佳实践
- windows安装scala
- 三、调试运行Matlab版OpenTLD
- 单词检测源程序(升级版)
- 12.3
- 用户登录系统及一次性验证码的简单实现