机器学习笔记一

来源：互联网发布：王鹏博士杨浦云计算编辑：程序博客网时间：2024/05/22 13:07

机器学习笔记一

算法学习的主要方法 k-临近算法线性回归朴素贝叶斯算法局部加权线性回归支持向量机 Ridge回归决策树 Lasso最小回归系数估计 k-均值最大期望算法 DBSCAN Parzen窗设计

机器学习主要步骤：

收集数据
准备输入数据
分析输入数据
训练算法
测试算法
使用算法

k-近邻算法

算法思路：

算法很普通，对于输入的数据，与已有的数据样本进行匹配，根据匹配算法将匹配度最高的前k个数据取出，选择在k个中出现频率最高的数据结果分类作为输入数据的结果分类。

简而言之，就是找最像的。

算法步骤：

计算已知类别数据集中的点和当前点的距离
按照距离递增次序排序
选取与当前距离最小的k个点
确定前k个点所在类别的出现频率
返回预测的类别类型

举例：

数据值类别 1，1 A 1.1,1 A 2,2 B 1.9,2.1 B 1.1,1.1 ?

这里度量函数选择每一个数据点在二维平面之间的距离，k取2（一般不超过20）

对第一个点计算距离：l1=sqrt(0.12+0.12)

根据l的大小排序，可得，根据这个预测方法得到预测点为A类型。

算法特点

优点：精度高，对异常值不敏感，无输入数据假定
缺点：计算复杂度高，空间复杂度高，无法’理解‘数据本质，无法给出基础的结构信息，无法知晓样本和典型实例样本具有什么特征。
适用数据范围：数值型和标称型

决策树

算法思路：

根据训练数据的各种特性将数据分类，然后根据熵（集合中数据的不一致性）决定划分的先后顺序，最后得到一颗树，类似与带终止模块的流程图，从上向下开始走。

划分数据集的大原则：将无序数据变得有序。

信息增益：在划分数据集前后信息发生的变化，为信息增益。

熵的概念：集合信息的度量方式称为香农熵，简称熵。是信息的期望值。

l(xi)=−log2p(xi) => xi的信息定义,p(xi)是选择分类的概率

H=−SGM[n,..1]p(xi)log2p(xi),SGM是求和符号，这个公式为何要有(log2p(xi)),反正记住这是香农大佬发明的衡量信息熵的公式，熵越高，表示数据越混乱。

基尼不纯度：度量被错误分类到其他分组的概率

算法步骤：

划分数据集，根据熵值的大小构建决策树，构建时，优先选择熵值小的划分
然后就可以根据决策树进行比对，获得数据分类结果

举例：

from math import logimport operatorimport matplotlib.pylab as pltdef calcShannonEnt(dataSet): #计算熵值    numberEntries = len(dataSet)    labelCounts = {}    for featVec in dataSet:        # print(featVec)        curLabel = featVec[-1]        if curLabel not in labelCounts.keys():            labelCounts[curLabel] = 0        labelCounts[curLabel] += 1    shannonEnt = 0.0    for key in labelCounts:        prob = float(labelCounts[key]) / numberEntries;        shannonEnt += prob * log(prob, 2)    return shannonEntdef createDataSet():    dataSet = [[1, 1, 'yes'],                [1, 1, 'yes'],                [1, 0, 'no'],                [0, 1, 'no'],                [0, 1, 'no']]    labels = ['no surfacing', 'flippers']    return dataSet, labels# axis is the split index, values is the split value at the indexdef splitDataSet(dataSet, axis, values):    retDataSet = []    for featVec in dataSet:        # print( featVec[axis] )        if featVec[axis] == values:            reducedFeatVec = featVec[: axis]            reducedFeatVec.extend(featVec[axis + 1:])            retDataSet.append(reducedFeatVec)    return retDataSetdef chooseBestFeatureToSplit(dataSet):  #根据熵选择最佳的划分元素    numberFeatures = len(dataSet[0]) - 1    baseEntropy = calcShannonEnt(dataSet)    bestInfoGain = 0.0; bestFeature = -1    for i in range(numberFeatures):        featList = [example[i] for example in dataSet]        uniqueVals = set(featList)        newEntropy = 0.0        for value in uniqueVals:            subDataSet = splitDataSet(dataSet, i, value)            prob = len(subDataSet) / float(len(dataSet))            newEntropy += prob * calcShannonEnt(subDataSet)        infoGain = newEntropy - baseEntropy        if infoGain > bestInfoGain:            bestInfoGain = infoGain            bestFeature = i    return bestFeaturedef majoriatyCnt(calssList):    #返回出现频率最高的特征    classCount = {}    for vote in classList:        if vote not in classCount.keys():            classCount[vote] = 0        classCount[vote] += 1    SortedClassCount = sorted(classCount.iteritems(),                              key=operator.itemgetter(), reverse=True)    return SortedClassCount[0][0]def createTree(dataSet, labels):  #构建决策树，使用递归的方式    classList = [example[-1] for example in dataSet]     #print("classList: "); print(classList)    if classList.count(classList[0]) == len(classList): #剩余为同样的元素        return classList[0]    #print("dataSet"); print(dataSet[0]); print("len: "); print(len(dataSet[0]))    if len(dataSet[0]) == 1:  #到最底层了，无法递归        return majoriatyCnt(classList)    bestFeat = chooseBestFeatureToSplit(dataSet)    bestFeatLabel = labels[bestFeat]    myTree = {bestFeatLabel: {}}    del(labels[bestFeat])    featvalues = [example[bestFeat] for example in dataSet]    uniqueVals = set(featvalues)    for value in uniqueVals:        subLabel = labels[:]        myTree[bestFeatLabel][value] = createTree(            splitDataSet(dataSet, bestFeat, value), subLabel)    return myTree#以下代码为绘制决策树的代码，本人不是很特别懂，可以照猫画虎def getNumLeafs(myTree):    numLeafs = 0    keys = list( myTree )    #print("keys:");print(keys)    firstStr = keys[0]    #print("firstStr:");print(firstStr);print("MyTree:"); print(myTree)    secondDict = myTree[firstStr]    for key in secondDict.keys():        if type( secondDict[key]).__name__ == 'dict':   #the key's value is collection!            #print("secondDict");print(secondDict[key])            numLeafs += getNumLeafs( secondDict[key] )        else:            numLeafs += 1    return numLeafsdef plotNode(nodeTxt , centerPt , parentPt , nodeType):    createPlot.axl.annotate( nodeTxt, xy= parentPt , xycoords= "axes fraction" , xytext= centerPt , textcoords= "axes fraction" ,va= "center" , ha= "center" , bbox = nodeType , arrowprops= arrow_args)def getTreeDepth(myTree):    maxDepth = 0    keys = list(myTree)    firstStr = keys[0]    secondDict = myTree[firstStr]    for key in secondDict.keys():        if type( secondDict[key] ).__name__ =='dict':             thisDepth = 1 + getTreeDepth( secondDict[key] )        else:            thisDepth = 1        if thisDepth > maxDepth : maxDepth = thisDepth    return maxDepthdef plotMidText(cntrPt, parentPt , txtString):    xMid = ( parentPt[0] - cntrPt[0] )/2.0 + cntrPt[0]    yMid = ( parentPt[1] - cntrPt[1] )/2.0 + cntrPt[1]    createPlot.axl.text(xMid , yMid , txtString)def plotTree(myTree , parentPt , nodeTxt):    numLeafs = getNumLeafs(myTree)    depth = getTreeDepth(myTree)    firstStr = list(myTree.keys())[0]    cntrPt = ( plotTree.xOff + (1.0 + float(numLeafs) )/2.0 / plotTree.totalW , plotTree.yOff )    plotMidText( cntrPt ,  parentPt , nodeTxt )    plotNode( firstStr , cntrPt , parentPt , decisionNode )    secondDict = myTree[firstStr]    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD    for key in secondDict.keys():        if type( secondDict[key] ).__name__ =='dict':            plotTree(secondDict[key] , cntrPt , str(key) )        else:            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW            plotNode ( secondDict[key] , ( plotTree.xOff , plotTree.yOff ), cntrPt , leafNode )            plotMidText ( (plotTree.xOff , plotTree.yOff), cntrPt , str(key) )    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalDdef createPlot(inTree):    fig = plt.figure( 1 , facecolor = 'white' )    fig.clf()    axprops = dict( xticks=[] , yticks=[] )    createPlot.axl = plt.subplot(111 , frameon=False , **axprops)    plotTree.totalW = float( getNumLeafs(inTree) )    plotTree.totalD = float( getTreeDepth(inTree) )    plotTree.xOff = -0.1 /plotTree.totalW; plotTree.yOff = 1;    plotTree( inTree ,(0.5 ,0.5) ,'' )    plt.show()decisionNode = dict( boxstyle = "sawtooth" , fc = "0.8" )leafNode = dict( boxstyle = "round4" ,fc = "0.8" )arrow_args=dict( arrowstyle = "<-" )dataSet,labels = createDataSet()# print( dataSet )# shannonEnt = calcShannonEnt(dataSet)# print( shannonEnt )# bestFeature = chooseBestFeatureToSplit( dataSet )# print(bestFeature)# splitDataSet = splitDataSet(dataSet , bestFeature , 1)# print( splitDataSet )myTree = createTree(dataSet,labels)print(myTree)createPlot(myTree)

算法特点：

优点：在数据表示形式上特别容易理解。计算复杂度不高，输出结果易于理解，对中间值的缺少不敏感，可以处理不相关特征数据。
缺点：可能会有过度匹配的问题，会产生大量的匹配节点，使分类繁杂。
适用数据范围：数值型和标称型

算法精髓：

个人理解，此算法的精髓在于根据熵值划分集合，使得分类的集合可以按照某种特性分开，简单易于理解

机器学习的主要任务是分类

本说明–文章是学习《机器学习实战》-人民邮电出版社后个人的理解笔记或者摘抄，作为本人笔记，也作为他人的理解参考

阅读全文

0 0