机器学习实战决策树之眼镜男买眼镜
来源:互联网 发布:网络涉黄举报 编辑:程序博客网 时间:2024/05/02 19:38
欢迎关注我的个人博客blog.timene.com
决策树是个极其易懂的算法,建好模型后就是一连串嵌套的if..else...或嵌套的switch。
优点:计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据;
缺点:可能会产生过度匹配的问题;
适用数据类型:数值型和标称型。
决策树的Python实现:
(一)先实现几个工具函数:计算熵函数,划分数据集工具函数,计算最大概率属性;
(1)计算熵:熵代表集合的无序程度,集合越无序,熵越大;
def entropy(dataset):from math import log log2 = lambda x:log(x)/log(2) results={} for row in dataset: r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys(): p = float(results[r]) / len(dataset) ent=ent-p*log2(p) return ent
(2)按属性和值获取数据集:
def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]这个函数只有短短一行,他的意义是:从dataset序列中取得第k列的值为v的子集,并从获得的子集中去掉第k列。python的简单优美显现无遗。
(3)计算最大概率属性。在构建决策树时,在处理所有决策属性后,还不能唯一区分数据时,我们采用多数表决的方法来选择最终分类:
def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count = sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]
(二)选取最优数据划分方式函数:
选择集合的最优划分方式:以哪一列的值划分集合,从而获得最大的信息增益呢?
def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn feature
(三)递归构建决策树:
def build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_tree
(四)使用决策树
def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_label
(五)决策树持久化
(1)存储
def store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()
(2)读取
def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)
(六)到了最后了,该回到主题了,给眼镜男配眼镜了。
下面的隐形眼镜数据集来自UCI数据库,它包含很多患者眼部状况的观察条件以及医生推荐的隐形眼镜类型,隐形眼镜类型包括硬材料、软材料和不适合佩戴隐形眼镜。
数据如下:
youngmyopenoreducedno lensesyoungmyopenonormalsoftyoungmyopeyesreducedno lensesyoungmyopeyesnormalhardyounghypernoreducedno lensesyounghypernonormalsoftyounghyperyesreducedno lensesyounghyperyesnormalhardpremyopenoreducedno lensespremyopenonormalsoftpremyopeyesreducedno lensespremyopeyesnormalhardprehypernoreducedno lensesprehypernonormalsoftprehyperyesreducedno lensesprehyperyesnormalno lensespresbyopicmyopenoreducedno lensespresbyopicmyopenonormalno lensespresbyopicmyopeyesreducedno lensespresbyopicmyopeyesnormalhardpresbyopichypernoreducedno lensespresbyopichypernonormalsoftpresbyopichyperyesreducedno lensespresbyopichyperyesnormalno lenses
def test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)我这里测试结果如下:
眼镜男终于可以买到合适的眼镜啦。。。
所有代码黏在下面:
def entropy(dataset):from math import log log2 = lambda x:log(x)/log(2) results={} for row in dataset: r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys(): p = float(results[r]) / len(dataset) ent=ent-p*log2(p) return ent def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count = sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn featuredef build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_treedef store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_labeldef test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)return lense_treeif __name__ == "__main__":tree = test()print tree
- 机器学习实战决策树之眼镜男买眼镜
- 机器学习实战决策树之眼镜男买眼镜
- 买眼镜
- 眼镜
- 机器学习实战之决策树
- 机器学习实战之决策树
- 机器学习实战之决策树
- 机器学习实战之决策树
- 《机器学习实战》之决策树
- 机器学习实战之决策树
- 机器学习实战之决策树
- 机器学习实战之决策树
- 机器学习实战之决策树
- 《机器学习实战》之决策树
- 丹阳买眼镜攻略1
- 《机器学习实战》之ID3决策树算法
- 3.机器学习实战之决策树
- 机器学习实战之决策树ID3算法
- fsck修复受损的文件系统
- android 四种常用的存储方式
- JavaWeb-JDBC处理大数据、批处理、事物
- CSS 基础:HTML 标记与文档结构(1)<思维导图>
- 怎么做才能拥有良好的网站用户体验
- 机器学习实战决策树之眼镜男买眼镜
- 怎么看php有没有支持mysql
- Dialog about college and College Entrance Examination
- JavaWeb-JDBC连接池、JDBC框架
- 这十几年编程的不同认知层次——摘自聊天记录
- java获取各种常用时间方法
- 谷歌搜索引擎的语音功能特色j
- 第一年续签总结以及工作第二年的发展计划
- 网站如何做到完全不需要jQuery