Decision Trees

来源:互联网 发布:电脑双肩包 知乎 编辑:程序博客网 时间:2024/06/05 00:36

To use decision trees to distill data into knowledge.

Pesudo Code:Check if every item in the dataset is in the same class:    if so: return the class label    else:        find the best feature to split the data        split the dataset        create a branch node        for each split:            call createBranch and add the result to the branch node    return branch node

First, we should define a function to calculate Shannon entropy whenever we input a dataset, it’s simple.

#!/user/bin/env pythonfrom math import logfrom numpy import *import operatordef f1(dataset):    num=len(dataset)    dic={}    for i in dataset:        if i[-1] not in dic.keys():            dic[i[-1]]=0        dic[i[-1]]+=1    entropy=0.0    for i in dic.keys():        # Without changing to float type it could         #   raise math domain error for pr is zero        pr=float(dic[i])/num        entropy-=pr*math.log(pr,2)    return entropy

Next, while given a feature to split the dataset , we should have the ability to return a sub-set of the original dataset. How to do this? We call our dataset D and it is a n*m matrix with n pieces of data. Suppose that data D[i] is the i-th data in the dataset and it is a m-dimension vector, with which the last element D[i][m] is the classification of vector D[i], and the others (i.e. D[i][START:END-1]) all features of D[i]. If we choose the k-th feature D[i][k] to split dataset, D[i][k] might not equal to D[j][k] when (i != k), and thus we use these difference of the k-th feature to split our dataset. We find out all the data with the same k-th feature value, like D[r], r=1:N, with the same k-th feature value, we delete the k-th value and put this (m-1)-dimension vector into our returning tuple.

def f2(dataset, axis, value):    ret=[]    temp=[]    for i in dataset:        if i[axis]==value:            temp=i[:axis]            temp.extend(i[axis+1: ])            ret.append(temp)            temp=[]    return ret

Third, once we have a dataset, we should choose out the best feature to split the dataset. To do this, we utilize Shannon entropy and select the feature with the most minimal entropy.

def f3(dataset):    bestEntropy=f1(dataset)    bestFeature=-1    en=0.0    numOfFeatures=len(dataset[0])-1    for i in range(numOfFeatures):        li=[example[i] for example in dataset]        v=set(li)        for value in v:            sub=f2(dataset, i, value)            en+=float(len(sub))/len(dataset)*f1(sub)        if en<bestEntropy:            bestEntropy=en            bestFeature=i    return bestFeature

Furthermore, take this into account: the best situation is that the leaf of this decision tree contains only the same members, but what if a leaf members are not the same (i.e. with different classifications) ? We choose its majority classification.

def f4(classList):    # Return a class label    c={}    for i in classList.keys():        if i not in c.keys():            c[i]=0        c[i]+=1    sortedClassCount=sorted(c.iteritems(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]

Finally, we could build our decision tree.

def f5(dataset, o_labels):    labels=o_labels[:]    classList=[i[-1] for i in dataset]    if classList.count(classList[0])==len(classList):        return classList[0]    if len(dataset[0])==1:        return f4(classList)    bestFeature=f3(dataset)    bestFeatLabel=labels[bestFeature]    mytree={labels[bestFeature]: {}}    del(labels[bestFeature])    vlist=[i[bestFeature] for i in dataset]    v=set(vlist)    for value in v:        subLabels=labels[:]        sub=f2(dataset, bestFeature, value)        mytree[bestFeatLabel][value]=f5(sub, subLabels)    return mytree

After building up our decision tree, we can easily distinguish a new vector’s classification. In this case, suppose we get a new data Di without knowing its classification, thus it is a (m-1)-dimension vector (the m-th is its classification and we do not get it). If we get into a dictionary, we should again do this searching and if we get a string type leaf, we just find it.

def classify(inputTree, featLabels, testVec):    # Here inputTree.keys() is just a tuple,     #   and by adding '[0]' it becomes a string    firstStr=inputTree.keys()[0]    secondDict=inputTree[firstStr]    featIndex=featLabels.index(firstStr)    classLabel=''    for key in secondDict.keys():        # Here key is a string type if we         #   read file for building up our dataset        if testVec[featIndex]==int(key):            # The type can be a 'str' -- which means a leaf, or a 'dict'            #   -- which means to require more iterations            if type(secondDict[key]).__name__=='dict':                classLabel=classify(secondDict[key], featLabels, testVec)            else:                classLabel=secondDict[key]    return classLabel

We could use following functions to store your tree.

def storeTree(inputTree, filename):    import pickle    fw=open(filename, 'w')    pickle.dump(inputTree, fw)    fw.close()def grabTree(filename):    import pickle    fr=open(filename)    return pickle.load(fr)

Here we use Lenses dataset retrieved from the UCI Machine Learning Repository to test our decision tree.

>>>import trees
>>>lenses=[i.strip().split(’ ‘) for i in fr.readines()]
>>>lensesLabels=[‘age’, ‘prescript’, ‘astigmatic’, ‘tearRate’]
>>>lensesTree=trees.createTree(lenses, lensesLabels)
{‘tearRate’: {‘1’: ‘3’,
‘2’: {‘astigmatic’: {‘1’: {‘age’: {‘1’: ‘2’,
‘2’: ‘2’,
‘3’: {‘prescript’: {‘1’: ‘3’, ‘2’: ‘2’}}}},
‘2’: {‘prescript’: {‘1’: ‘1’,
‘2’: {‘age’: {‘1’: ‘1’, ‘2’: ‘3’, ‘3’: ‘3’}}}}}}}}
/>>> trees.classify(mytree, lensesLabels, [3,2,2,2])

We can see that we do a good job.

0 0