Decision Trees
来源:互联网 发布:电脑双肩包 知乎 编辑:程序博客网 时间:2024/06/05 00:36
To use decision trees to distill data into knowledge.
Pesudo Code:Check if every item in the dataset is in the same class: if so: return the class label else: find the best feature to split the data split the dataset create a branch node for each split: call createBranch and add the result to the branch node return branch node
First, we should define a function to calculate Shannon entropy whenever we input a dataset, it’s simple.
#!/user/bin/env pythonfrom math import logfrom numpy import *import operatordef f1(dataset): num=len(dataset) dic={} for i in dataset: if i[-1] not in dic.keys(): dic[i[-1]]=0 dic[i[-1]]+=1 entropy=0.0 for i in dic.keys(): # Without changing to float type it could # raise math domain error for pr is zero pr=float(dic[i])/num entropy-=pr*math.log(pr,2) return entropy
Next, while given a feature to split the dataset , we should have the ability to return a sub-set of the original dataset. How to do this? We call our dataset D and it is a n*m matrix with n pieces of data. Suppose that data D[i] is the i-th data in the dataset and it is a m-dimension vector, with which the last element D[i][m] is the classification of vector D[i], and the others (i.e. D[i][START:END-1]) all features of D[i]. If we choose the k-th feature D[i][k] to split dataset, D[i][k] might not equal to D[j][k] when (i != k), and thus we use these difference of the k-th feature to split our dataset. We find out all the data with the same k-th feature value, like D[r], r=1:N, with the same k-th feature value, we delete the k-th value and put this (m-1)-dimension vector into our returning tuple.
def f2(dataset, axis, value): ret=[] temp=[] for i in dataset: if i[axis]==value: temp=i[:axis] temp.extend(i[axis+1: ]) ret.append(temp) temp=[] return ret
Third, once we have a dataset, we should choose out the best feature to split the dataset. To do this, we utilize Shannon entropy and select the feature with the most minimal entropy.
def f3(dataset): bestEntropy=f1(dataset) bestFeature=-1 en=0.0 numOfFeatures=len(dataset[0])-1 for i in range(numOfFeatures): li=[example[i] for example in dataset] v=set(li) for value in v: sub=f2(dataset, i, value) en+=float(len(sub))/len(dataset)*f1(sub) if en<bestEntropy: bestEntropy=en bestFeature=i return bestFeature
Furthermore, take this into account: the best situation is that the leaf of this decision tree contains only the same members, but what if a leaf members are not the same (i.e. with different classifications) ? We choose its majority classification.
def f4(classList): # Return a class label c={} for i in classList.keys(): if i not in c.keys(): c[i]=0 c[i]+=1 sortedClassCount=sorted(c.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]
Finally, we could build our decision tree.
def f5(dataset, o_labels): labels=o_labels[:] classList=[i[-1] for i in dataset] if classList.count(classList[0])==len(classList): return classList[0] if len(dataset[0])==1: return f4(classList) bestFeature=f3(dataset) bestFeatLabel=labels[bestFeature] mytree={labels[bestFeature]: {}} del(labels[bestFeature]) vlist=[i[bestFeature] for i in dataset] v=set(vlist) for value in v: subLabels=labels[:] sub=f2(dataset, bestFeature, value) mytree[bestFeatLabel][value]=f5(sub, subLabels) return mytree
After building up our decision tree, we can easily distinguish a new vector’s classification. In this case, suppose we get a new data Di without knowing its classification, thus it is a (m-1)-dimension vector (the m-th is its classification and we do not get it). If we get into a dictionary, we should again do this searching and if we get a string type leaf, we just find it.
def classify(inputTree, featLabels, testVec): # Here inputTree.keys() is just a tuple, # and by adding '[0]' it becomes a string firstStr=inputTree.keys()[0] secondDict=inputTree[firstStr] featIndex=featLabels.index(firstStr) classLabel='' for key in secondDict.keys(): # Here key is a string type if we # read file for building up our dataset if testVec[featIndex]==int(key): # The type can be a 'str' -- which means a leaf, or a 'dict' # -- which means to require more iterations if type(secondDict[key]).__name__=='dict': classLabel=classify(secondDict[key], featLabels, testVec) else: classLabel=secondDict[key] return classLabel
We could use following functions to store your tree.
def storeTree(inputTree, filename): import pickle fw=open(filename, 'w') pickle.dump(inputTree, fw) fw.close()def grabTree(filename): import pickle fr=open(filename) return pickle.load(fr)
Here we use Lenses dataset retrieved from the UCI Machine Learning Repository to test our decision tree.
>>>import trees
>>>fr=open(‘lenses.txt’)
>>>lenses=[i.strip().split(’ ‘) for i in fr.readines()]
>>>lensesLabels=[‘age’, ‘prescript’, ‘astigmatic’, ‘tearRate’]
>>>lensesTree=trees.createTree(lenses, lensesLabels)
>>>lensesTree
{‘tearRate’: {‘1’: ‘3’,
‘2’: {‘astigmatic’: {‘1’: {‘age’: {‘1’: ‘2’,
‘2’: ‘2’,
‘3’: {‘prescript’: {‘1’: ‘3’, ‘2’: ‘2’}}}},
‘2’: {‘prescript’: {‘1’: ‘1’,
‘2’: {‘age’: {‘1’: ‘1’, ‘2’: ‘3’, ‘3’: ‘3’}}}}}}}}
/>>> trees.classify(mytree, lensesLabels, [3,2,2,2])
‘3’
We can see that we do a good job.
- Decision Trees
- Decision Trees
- Decision Trees
- Decision Trees - 决策树
- 决策树(Decision Trees)
- 1.10. Decision Trees
- Decision Trees C4.5 Tutorial
- Study notes for Decision Trees
- sklearn.tree之Decision Trees
- Building decision trees to identify the intent
- 8.4.1 决策树(Decision trees)
- 机器学习算法之:决策树 (decision trees)
- scikit-learn学习1.10. 决策树(Decision Trees)
- 机器学习实战:决策树(decision Trees)
- Selective Ensemble of Decision Trees(周志华)
- 1.10. Decision Trees : sklearn.tree.DecisionTreeClassifier
- Gradient Boosted Decision Trees(GBDT)详解
- Information Theory in Data Mining & Decision Trees learning
- linux的aix下的makefile参考
- Android之线程池的使用
- 使用传统Android组件实现高效数据加载
- [NOIP 2001]数的划分 DP
- Windows程序设计--文本输出(二)
- Decision Trees
- C++基础——C++风格的类型转换(static_cast、const_cast、dynamic_cast、reinterpret_cast)
- 有关今后的学习计划
- Hadoop 元模式之作业归并
- iOS NSCondition结合代码以及项目进行详解
- [NOIP 2013]转圈游戏 快速幂
- Rabbitmq队列高可用的策略
- java中的快速排序实现
- 自定义动画CABasicAnimation