数据挖掘算法（二）--决策树

来源：互联网发布：map添加数据编辑：程序博客网时间：2024/06/11 09:56

数据挖掘算法学习笔记汇总
数据挖掘算法（一）–K近邻算法（KNN）
数据挖掘算法（二）–决策树
数据挖掘算法（三）–logistic回归

算法简介

首先我们来看一组数据，如下所示，

Age Education   Income  Status  Purchase?36-55   master's    high    single  will buy18-35   high school low single  will buy36-55   master's    low single  won't buy18-35   bachelor's  high    single  won't buygt 18   high school low single  will buy18-35   bachelor's  high    married won't buy

假设我们要对数据进行分类，然后根据年龄、教育程度、收入、婚姻状况等条件判断一个人是否会购买。假设我们先按照是否结婚的来对上面的几组数据进行划分：

#未婚Age Education   Income  Status  Purchase?36-55   master's    high    single  will buy18-35   high school low single  will buy36-55   master's    low single  won't buy18-35   bachelor's  high    single  won't buygt 18   high school low single  will buy

#已婚Age Education   Income  Status  Purchase?18-35   bachelor's  high    married won't buy

下面可以再按照收入进一步划分数据

Age Education   Income  Status  Purchase?18-35   bachelor's  high    single  won't buy36-55   master's    high    single  will buy

Age Education   Income  Status  Purchase?18-35   high school low single  will buy36-55   master's    low single  won't buygt 18   high school low single  will buy

然后我们需要再按照其他的特征（如年龄，或者收入）进行再次分类，直到每组数据都归类了为止。这样我们构建出来了一个颗树形结构的图。
这里写图片描述

这个分析过程就是一个决策树的过程。那每一次分类过程中我们都需要选择一个特征（年龄、教育程度、收入、婚姻状况）进行分类，到底选哪一个好？这就需要用到熵作为一个选取标准了。

信息熵：它是随机变量不确定度的度量。一个离散随机变量X的概率密度函数为p(x)，那么X的熵可以定义为
$H (x) = - \sum x p (x) l o g 2 p (x)$ 使用2为底的对数函数，熵的量纲一般情况下为比特（bite）。当对数底位e的情况，熵的单位为奈特（nat）。在平均意义下，它是为了描述改随机变量所需的比特数。

怎么使用熵，就需要利用ID3算法了

ID3算法以原始数据集合S 作为根节点开始。在算法的每次迭代中，它遍历数据集合S 的每个未使用的特征，并计算熵H（S）（或信息增益IG（S）)。然后选择具有最小熵（或最大信息增益）值的属性。然后将集合S分割为所选属性（例如收入高和低），以产生数据的子集。算法继续在每个子集上递归，仅考虑之前从未选择的属性。

算法流程：
1、使用数据集S计算每个属性的熵
2、将数据集S分解为子集，其中使用生成熵（分割后）最小的属性（或等效地，信息增益最大）
3、制作包含该属性的决策树节点
4、使用剩余属性重新计算子集。

ID3算法不保证最佳解决方案；它可能会陷入局部最佳状态。它使用贪心的方法，在每次迭代中选择最佳属性来分割数据集。
ID3可以对训练数据进行补充。为了避免过度拟合，较大的决策树应优先于较小的决策树。该算法通常产生小树，但并不总是产生尽可能小的树。
ID3难用于连续数据。如果任何给定属性的值是连续的，那么还有更多的地方可以分割此属性上的数据，并且搜索最佳值进行拆分可能是耗时的。

代码实现

本文代码运行环境：
python：3.5.1
pandas：0.19.2
sklearn：0.18.1
其他环境可能有细微差别

第一步读取数据

# -*coding:utf-8*-import numpy as npimport pandas as pdimport operatorimport math# 读取数据dataSet = pd.read_csv("./tree.txt", sep='\t')labels = list(dataSet) #获取所有列的名字labels = labels[0:-1] #获取特征名字

里面前面给出的计算香农熵的公式，计算熵。

def calcShannonEnt(dataFrame):    numEntries = len(dataFrame)    labelCounts = {}    for index, row in dataFrame.iterrows():          currentLabel = row[-1]        labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1    shannonEnt = 0.0    for key in labelCounts:        prob = float(labelCounts[key]) / numEntries        shannonEnt -= prob * math.log(prob, 2)  # log base 2    return shannonEnt

分隔dataFrame

def splitDataSet(dataFrame, column_name, value):    retDataSet = dataFrame[dataFrame[column_name] != value]  # 选取clomun_name列不等于value的行    retDataSet = retDataSet.drop(column_name, 1)  # 1删除列，0删列行    return retDataSet

选择最佳的特征和值进行分类，并且返回最佳的列名

def chooseBestFeatureToSplit(dataFrame):    baseEntropy = calcShannonEnt(dataFrame)    bestInfoGain = -100#此处由于计算出来的增益可能为负的    bestFeature = ""    FeatureNames = list(dataFrame)    for FeatureName in FeatureNames[0:-1]:  # 最后一列为标签        featList = dataFrame[FeatureName]  # create a list of all the examples of this feature        uniqueVals = set(featList)  # get a set of unique values        newEntropy = 0.0        for value in uniqueVals:            subDataSet = splitDataSet(dataFrame, FeatureName, value)            prob = subDataSet.shape[0] / dataFrame.shape[0]            newEntropy += prob * calcShannonEnt(subDataSet)        infoGain = baseEntropy - newEntropy        if (infoGain > bestInfoGain):  # compare this to the best gain so far            bestInfoGain = infoGain  # if better than current best, set to best            bestFeature = FeatureName    return bestFeature  #返回列名

def majorityCnt(classList):    classCount = {}    for vote in classList:        classCount[vote] = classCount.get(vote, 0) + 1    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]#生成决策树def createTree(dataFrame):    classList = dataFrame.iloc[:, -1]  # 标签列表    if len(classList.unique()) == 1:  # 标签只剩下一种值        return classList.unique()[0]  # stop splitting when all of the classes are equal    if len(list(dataFrame)) == 2:  # 只剩下一种特征了        return majorityCnt(classList)    bestFeatLabel = chooseBestFeatureToSplit(dataFrame)    myTree = {bestFeatLabel: {}}    featValues = dataFrame[bestFeatLabel]    uniqueVals = set(featValues)    for value in uniqueVals:        DataSubset = splitDataSet(dataFrame, bestFeatLabel, value)        myTree[bestFeatLabel][value] = createTree(DataSubset)    return myTreedef classify(inputTree, featLabels, testVec):    keys_list = list(inputTree.keys())    firstStr = keys_list[0]    secondDict = inputTree[firstStr]    featIndex = featLabels.index(firstStr)    key = testVec[featIndex]    valueOfFeat = secondDict[key]    if isinstance(valueOfFeat, dict):        classLabel = classify(valueOfFeat, featLabels, testVec)    else:        classLabel = valueOfFeat    return classLabeltree = createTree(dataSet)print(classify(tree, labels, ["36-55", "master's", "high", "single"]))

构建出来的树存在一个字典里面，下面是格式化之后好看点的数据。如下所示：

{    'Status': {        'married': {            'Income': {                'high': {                    'Education': {                        'highschool': 'willbuy',                        "master's": 'willbuy',                        "bachelor's": 'willbuy'                    }                },                'low': {                    'Education': {                        'highschool': 'willbuy',                        "master's": "won't buy",                        "bachelor's": 'willbuy'                    }                }            }        },        'single': {            'Income': {                'high': {                    'Age': {                        '36-55': 'willbuy',                        'gt55': "won't buy",                        'lt18': "won't buy"                    }                },                'low': {                    'Education': {                        'highschool': "won't buy",                        "bachelor's": 'willbuy'                    }                }            }        }    }}

代码和测试数据下载地址：
链接：http://pan.baidu.com/s/1hsOTpfm 密码：en6g

参考资料：
1、《机器学习实战》
2、http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=1

阅读全文

0 0