朴素贝叶斯:从词向量计算概率

来源:互联网 发布:ipad 电子相册软件 编辑:程序博客网 时间:2024/06/16 02:36

函数伪代码:

计算每个类别中的文档数目

对每篇训练文档:

       对每个类别:

              如果词条出现在文档中------增加该词条的计数值

              增加所有词条的计数值

对每个类别:

        对每个词条:

               将该词条的数目除以总词条数目得到条件概率

返回每个类别的条件概率

具体代码:

#coding:-utf-8from numpy import *def loadDataSet():    postingList = [['my','dog','has','flea',\                    'problems','help','please'],                   ['maybe','not','take','him',\                       'to','dog','park','stupid'],                   ['my','dalmation','is','so','cute',\                       'I','love','him'],                   ['stop','posting','stupid','worthless','garbage'],                   ['mr','licks','ate','my','steak','how',\                    'to','stop','him'],                   ['quit','buying','worthless','dog','food','stupid']]    classVec = [0,1,0,1,0,1]    return postingList,classVecdef createVocaList(dataSet):    vocabSet = set([])    for document in dataSet:        vocabSet = vocabSet|set(document)    return list(vocabSet)def setOfWordds2Vec(vocabList,inputSet):    returnVec = [0]*len(vocabList)    #print inputSet    for word in inputSet:        if word in vocabList:            #print word            returnVec[vocabList.index(word)] = 1        else:            print "The word:%s is not in my Vocabulary!" % word    return returnVecdef trainNB0(trainMatrix,trainCategory):    #矩阵正一共有6行数据    numTrainDocs = len(trainMatrix)    #print numTrainDocs:6    #每行一共有32个元素    numWords = len(trainMatrix[0])    #print numWords:32    #侮辱性留言中文档数在总文档数中所占百分比    pAbusive = sum(trainCategory)/float(numTrainDocs)    #print pAbusive:0.5    #创建一共32个元素的一维数组    p0Num = zeros(numWords)    p1Num = zeros(numWords)    p0Denom = 0.0;p1Denom = 0.0    for i in range(numTrainDocs):        #print trainCategory[i]        #print sum(trainMatrix[i])        #print trainMatrix[i]        if trainCategory[i] == 1:            #对类别1(侮辱性),每个词向量文档累加            p1Num += trainMatrix[i]            #每个词向量文档中所有词相加,即一共有多少个侮辱性的词            p1Denom += sum(trainMatrix[i])        else:            # 对类别0(正常词),每个词向量文档累加            p0Num += trainMatrix[i]            # 每个词向量文档中所有词相加,即一共有多少个正常词            p0Denom += sum(trainMatrix[i])    p1Vect = p1Num/p1Denom    p0Vect = p0Num/p0Denom    #返回的是给定文档类别条件下词汇表中单词的出现概率    return p0Vect,p1Vect,pAbusivelistOPost,listClasses = loadDataSet()myVocaList = createVocaList(listOPost)#print myVocaListreturnVec = setOfWordds2Vec(myVocaList,listOPost[0])#print returnVectrainMat = []for postinDoc in listOPost:    trainMat.append(setOfWordds2Vec(myVocaList,postinDoc))p0Vect,p1Vect,pAbusive = trainNB0(trainMat,listClasses)print p0Vectprint p1Vect
看结果,给定文档类别条件下词汇表中单词的出现概率。

p0Vect:

[ 0.04166667  0.04166667  0.04166667  0.          0.          0.04166667
  0.04166667  0.04166667  0.          0.04166667  0.04166667  0.04166667
  0.04166667  0.          0.          0.08333333  0.          0.
  0.04166667  0.          0.04166667  0.04166667  0.          0.04166667
  0.04166667  0.04166667  0.          0.04166667  0.          0.04166667
  0.04166667  0.125     ]
p1Vect:

[ 0.          0.          0.          0.05263158  0.05263158  0.          0.
  0.          0.05263158  0.05263158  0.          0.          0.
  0.05263158  0.05263158  0.05263158  0.05263158  0.05263158  0.
  0.10526316  0.          0.05263158  0.05263158  0.          0.10526316
  0.          0.15789474  0.          0.05263158  0.          0.          0.        ]

从结果中,可以看出词汇表中第一个词是cute,其在类别0中出现一次,而在类别1中未出现,对应的条件概率为别为0.04166667和0。