机器学习实战决策树之眼镜男买眼镜

来源:互联网 发布:网络涉黄举报 编辑:程序博客网 时间:2024/05/02 19:38

欢迎关注我的个人博客blog.timene.com

决策树是个极其易懂的算法,建好模型后就是一连串嵌套的if..else...或嵌套的switch。

优点:计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据;

缺点:可能会产生过度匹配的问题;

适用数据类型:数值型和标称型。


决策树的Python实现:

(一)先实现几个工具函数:计算熵函数,划分数据集工具函数,计算最大概率属性;

(1)计算熵:熵代表集合的无序程度,集合越无序,熵越大;

def entropy(dataset):from math import log  log2 = lambda x:log(x)/log(2) results={}  for row in dataset:  r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys():  p = float(results[r]) / len(dataset)  ent=ent-p*log2(p)  return ent  

(2)按属性和值获取数据集:

def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]
这个函数只有短短一行,他的意义是:从dataset序列中取得第k列的值为v的子集,并从获得的子集中去掉第k列。python的简单优美显现无遗。

(3)计算最大概率属性。在构建决策树时,在处理所有决策属性后,还不能唯一区分数据时,我们采用多数表决的方法来选择最终分类:

def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count =  sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]

(二)选取最优数据划分方式函数:

选择集合的最优划分方式:以哪一列的值划分集合,从而获得最大的信息增益呢?

def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn feature

(三)递归构建决策树:

def build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_tree


(四)使用决策树

def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_label

(五)决策树持久化

(1)存储

def store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()

(2)读取

def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)

(六)到了最后了,该回到主题了,给眼镜男配眼镜了。

下面的隐形眼镜数据集来自UCI数据库,它包含很多患者眼部状况的观察条件以及医生推荐的隐形眼镜类型,隐形眼镜类型包括硬材料、软材料和不适合佩戴隐形眼镜。

数据如下:

youngmyopenoreducedno lensesyoungmyopenonormalsoftyoungmyopeyesreducedno lensesyoungmyopeyesnormalhardyounghypernoreducedno lensesyounghypernonormalsoftyounghyperyesreducedno lensesyounghyperyesnormalhardpremyopenoreducedno lensespremyopenonormalsoftpremyopeyesreducedno lensespremyopeyesnormalhardprehypernoreducedno lensesprehypernonormalsoftprehyperyesreducedno lensesprehyperyesnormalno lensespresbyopicmyopenoreducedno lensespresbyopicmyopenonormalno lensespresbyopicmyopeyesreducedno lensespresbyopicmyopeyesnormalhardpresbyopichypernoreducedno lensespresbyopichypernonormalsoftpresbyopichyperyesreducedno lensespresbyopichyperyesnormalno lenses


测试程序如下:

def test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)
我这里测试结果如下:

 


眼镜男终于可以买到合适的眼镜啦。。。


所有代码黏在下面:

def entropy(dataset):from math import log  log2 = lambda x:log(x)/log(2) results={}  for row in dataset:  r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys():  p = float(results[r]) / len(dataset)  ent=ent-p*log2(p)  return ent  def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count =  sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn featuredef build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_treedef store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_labeldef test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)return lense_treeif __name__ == "__main__":tree = test()print tree






原创粉丝点击