Python学习-机器学习实战-ch04 Bayes

来源：互联网发布：人参淘宝编辑：程序博客网时间：2024/06/07 16:19

毕业论文写不下去，就逃避来学这个

万事开头难，要勇敢迈出第一步

加油！

========================================================================================

贝叶斯的原理不赘述啦，网上还是有很多资料的

创建一个数据集，书中是以文档分类的例子来讲

def loadDataSet():    postingList=[['my','dog','has','flea','problem','help','please'],\                 ['maybe','not','take','him','to','dog','park','stupid'],\                 ['my','dalmation','is','so','cute','I','love','him'],\                 ['stop','posting','stupid','worthless','garbage'],\                 ['mr','licks','ate','my','steak','how','to','stop','him'],\                 ['quit','buying','worthless','dog','food','stupid']]    classVec=[0,1,0,1,0,1]    return postingList,classVec

上面这个函数就创建了一个小数据集，包含六篇文档，每篇文档有各自的分类（此例仅有0和1两类）

def createVocabList(dataset):    vocabSet=set([])    for document in dataset:        vocabSet=vocabSet|set(document)    #循环对数据集内的每个文件提取word,set用于去重    #求并集    return list(vocabSet)

该函数将文档集转换为一个词汇库(vocabulary)，里面包含在文档集内的所有word

贝叶斯的文档分类都是基于词汇库将文档转换成（特征）向量的，值就0和1表示存在或不存在

def setOfWords2Vec(vocabList,inputSet):    returnVec=[0]*len(vocabList)    #创建一个所含元素都是0的向量    for word in inputSet:        if word in vocabList:            returnVec[vocabList.index(word)]=1        else:print("the word: %s is not in my Vocabulary!" %word)    return returnVec#该函数首先创建一个与词汇表等长的向量#输出表示判断文档中的单词在词汇表中是否出现#从而将文档转换为词向量

朴素贝叶斯分类器训练函数：

def trainNB0(trainMatrix,trainCategory):    numTrainDocs=len(trainMatrix)    #获取训练集的文档个数    numWords=len(trainMatrix[0])    #由第一行的个数获得vocabulary的长度    pAbusive=sum(trainCategory)/float(numTrainDocs)    #表示类别的概率，此例中仅限类别为0和1的状况    p0Num=zeros(numWords)    p1Num=zeros(numWords)    #pXNum是一个与Vocabulary等长的向量，用于统计对应word出现的次数    p0Denom=0.0    p1Denom=0.0    #pXDenom表示第X类内单词的总数    for i in range(numTrainDocs):        if trainCategory[i]==1:            p1Num+=trainMatrix[i]            p1Denom+=sum(trainMatrix[i])        else:            p0Num+=trainMatrix[i]            p0Denom+=sum(trainMatrix[i])    p1Vec=p1Num/p1Denom    p0Vec=p0Num/p0Denom    #vocabulary中的某个词在某类别里头出现的频率    return p0Vec,p1Vec,pAbusive

#首先搞清楚参数的意思
#结合前几个函数：postingList表示文档的集合，每一行表示一篇文档，行数即文档数
#classVec向量内值的个数与文档数相同，表示各文档的分类
#createVocabList函数把这些文档整合起来求得不含重复word的vocabulary
#setOfWords2Vec函数把一篇文档的word对应到vocabulary中，变成一个向量
#本函数的第一个参数表示每篇转化到vocabulary对应的向量，为n*m，n是文档数，m是vocabulary的长度
#trainCategory是一个向量，是每篇文档对应的类别

测试用的代码：

from numpy import *import bayeslistPost,listClass=bayes.loadDataSet()myVoc=bayes.createVocabList(listPost)trainMat=[]for postinDoc in listPost:    trainMat.append(bayes.setOfWords2Vec(myVoc,postinDoc))p0V,p1V,pAb=bayes.trainNB0(trainMat,listClass)print(myVoc)print(p0V)print(p1V)print(pAb)

这里朴素贝叶斯分类器训练函数的输出：

vocabulary里的word在个类别中出现的概率（先验概率）

每个类别出现的概率（先验概率）

此例中pAb结果为0.5，表示0和1两类是等概率出现的

根据现实情况修改：

1.初始化问题

贝叶斯进行文档分类时，需要多个概率的乘积以获得文档属于某个类别的概率

即：分别在每个类内对文档内的每个WORD的概率相乘，以获得整个文档对应该类别的概率

但是如果某个概率值为0，则整个概率值也为0。所以书中将所有单词出现数初始化为1，分母初始化为2

    p0Num=ones(numWords)    p1Num=ones(numWords)    #pXNum的个数被初始化为1    p0Denom=2.0    p1Denom=2.0

2.下溢出

由于有很多个很小的数相乘，容易造成下溢出，最后会四舍五入得0.

解决的方法是：对乘积取对数

ln(a*b)=ln(a)+ln(b)

具体代码中为：

    p1Vec=log(p1Num/p1Denom)    p0Vec=log(p0Num/p0Denom)

最后是整合上面的步骤，用于进行分类

def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):    p1=sum(vec2Classify*p1Vec)+log(pClass1)    p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)    if p1>p0:        return 1    else:        return 0

针对检测向量，对每个类别的概率进行计算，概率大的为分类结果

def testingNB():    listPost,listClass=loadDataSet()    myVoc=createVocabList(listPost)    trainMat=[]    for postinDoc in listPost:        trainMat.append(setOfWords2Vec(myVoc,postinDoc))    p0V,p1V,pAb=trainNB0(trainMat,listClass)    testEntry=['love','my','dalmation']    thisDoc=array(setOfWords2Vec(myVoc,testEntry))    print(testEntry,' classified as ',classifyNB(thisDoc,p0V,p1V,pAb))    testEntry=['stupid','garbage']    thisDoc=array(setOfWords2Vec(myVoc,testEntry))    print(testEntry,' classified as ',classifyNB(thisDoc,p0V,p1V,pAb))

整合上述步骤，同时用了两个测试用例

检测结果：

使用朴素贝叶斯过滤垃圾邮件

def textParse(bigString):    import re    listOfTokens=re.split(r'\W*',bigString)    #使用中正则表达式提取    return [token.lower() for token in listOfTokens if len(token) >2]

</pre>此处，我犯了个错，就是正则表达式那一块，打的小写的w，所以结果错误。怎么犯这么愚蠢的错误<p></p><p><span style="background-color:rgb(240,240,240)"></span></p><pre code_snippet_id="1627130" snippet_file_name="blog_20160329_12_57207" name="code" class="python">def spamTest():    docList=[];classList=[];fullText=[]    for i in range(1,26):        wordList=textParse(open('email\spam\%d.txt' %i).read())        docList.append(wordList)        fullText.append(wordList)        classList.append(1)        #正例        wordList=textParse(open('email\ham\%d.txt' %i).read())        docList.append(wordList)        fullText.append(wordList)        classList.append(0)        #反例    vocabulary=createVocabList(docList)    trainingSet=list(range(50))    testSet=[]    for i in range(10):        randIndex=int(random.uniform(0,len(trainingSet)))        #random模块用于生成随机数        #random.uniform(a,b)用于生成制定范围内的随机浮点数        testSet.append(trainingSet[randIndex])        del trainingSet[randIndex]        #随机选择10个文档作为测试集，其余作为训练集    trainMat=[];trainClasses=[]    for docIndex in trainingSet:        trainMat.append(setOfWords2Vec(vocabulary,docList[docIndex]))        trainClasses.append(classList[docIndex])        #将选中的训练集逐个整合在一起    p0V,p1V,pSpam=trainNB0(trainMat,trainClasses)    errorCount=0    for docIndex in testSet:        wordVector=setOfWords2Vec(vocabulary,docList[docIndex])        if(classifyNB(array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]):            errorCount+=1        #如果分类结果与原类别不一致，错误数加1    print('the error rate is:',float(errorCount)/len(testSet))

</pre>修改了一个地方：按照原文的话抱一个错就是在trainingSet的地方<p></p><p><span style="background-color:rgb(240,240,240)">del(trainingSet[randIndex])TypeError: 'range' object doesn't support item deletion</span></p><p><span style="background-color:rgb(240,240,240)">于是，我在初始化的时候，把它改成了List型</span></p><p><span style="background-color:rgb(240,240,240)"><img src="http://img.blog.csdn.net/20160329104816823?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" /></span></p><p><span style="background-color:rgb(240,240,240)"></span></p><p><span style="background-color:rgb(240,240,240)"></span><pre name="code" class="python">def calcMostFreq(vocabulary,fulltext):    import operator    freqDict={}    for token in vocabulary:        freqDict[token]=fulltext.count(token)    sortedFreq=sorted(freqDict.items(),key=operator.itemgetter(1),reverse=True)    return sortedFreq[:30]    #出现频率前30的词

def localWords(feed1,feed0):    import feedparser    docList=[];classList=[];fullText=[]    minlen=min(len(feed1['entries']),len(feed0['entries']))    for i in range(minlen):        wordList=textParse(feed1['entries'][i]['summary'])        docList.append(wordList)        fullText.extend(wordList)        classList.append(1)        wordList=textParse(feed0['entries'][i]['summary'])        docList.append(wordList)        fullText.extend(wordList)        classList.append(0)    #两个RSS源作为正反例    vocabulary=createVocabList(docList)    #创建词汇库    top30Words=calcMostFreq(vocabulary,fullText)    #获得出现频率最高的30个    for pairW in top30Words:        if pairW[0] in vocabulary:vocabulary.remove(pairW[0])    #去除前30的单词    trainingSet=list(range(2*minlen));testSet=[]    for i in range(20):        randIndex=int(random.uniform(0,len(trainingSet)))        testSet.append(trainingSet[randIndex])        del(trainingSet[randIndex])    #随机选择训练和测试集；测试集为20个    trainMat=[];trainClass=[]    for docIndex in trainingSet:        trainMat.append(bagOfWords2VecMN(vocabulary,docList[docIndex]))        trainClass.append(classList[docIndex])    #将训练集内的文档转换成频数特征    p0V,p1V,pSpam=trainNB0(array(trainMat),array(trainClass))    errorCount=0    for docIndex in testSet:        wordVector=bagOfWords2VecMN(vocabulary,docList[docIndex])        if classifyNB(array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:            errorCount+=1    print('the error rate is: ',float(errorCount)/len(testSet))    return vocabulary,p0V,p1V

其中还是修改了

 trainingSet=list(range(2*minlen))

不知道其他学习的同学们有没有遇到这个问题，这么处理对不对？

测试用的代码：

import feedparserny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')sf=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')vocabulary,pSF,pNY=bayes.localWords(ny,sf)

结果是因随机抽取的测试集和训练集不一样会发生变化。

最具表征性词汇显示：

def getTopWord(ny,sf):    import operator    vocabulary,p0V,p1V=localWords(ny,sf)    topNY=[];topSF=[]    for i in range(len(p0V)):        if p0V[i]>-6.0:topSF.append((vocabulary[i],p0V[i]))        if p1V[i]>-6.0:topNY.append((vocabulary[i],p1V[i]))    #按照排序选择    sortedSF=sorted(topSF,key=lambda pair:pair[1],reverse=True)    #pair:pair[1]表示按每个元素的第二个参数排序    print("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF")    for item in sortedSF:        print(item[0])    sortedNY=sorted(topNY,key=lambda pair:pair[1],reverse=True)    print("NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY")    for item in sortedNY:        print(item[0])

=========================================================================================

下载安装feedsparse

下载地址：点击打开链接

安装方法：首先将路径转换到该文件夹下

然后输入指令python setup.py install

0 0