朴素贝叶斯

来源:互联网 发布:windows live id下载 编辑:程序博客网 时间:2024/06/11 21:22

朴素贝叶斯算法特点

优点:在数据较少的情况下仍然有效,可以处理多类别问题。
缺点:对于输入数据的准备方式较为敏感。
使用数据类型:标称型数据。

朴素贝叶斯算法核心思想

  1. 贝叶斯定理:P(c|x)=P(c)P(x|c)P(x)
  2. “属性条件独立性假设”:对已知类别,假设所有属性相互独立;
  3. 结合1、2,得P(c|x)=P(c)P(x|c)P(x)=P(c)P(x)iy=1dP(xi|c)
  4. y=argmaxcyP(c)i=1dP(xi|c)

任务

使用朴素贝叶斯过滤垃圾邮件。

数据集

非垃圾邮件存放在ham文件夹下,共25封;
垃圾邮件存放在spam文件夹下,共25封;

特征:所有邮件中每个不同的单词为一个特征。
标签:垃圾邮件1;非垃圾邮件0。

使用朴素贝叶斯算法完成垃圾邮件分类python实现

bayes.spamTest

说明
wordList:每封邮件中的单词列表;(rows:1, columns:n)
docList:所有邮件单词列表构成的列表;(rows:50, columns:每行变长)
fullText:非垃圾邮件和垃圾邮件中所有的单词列表;(包含重复单词)
classList:每封邮件对应的标签列表;(spam:1, ham:0)
vocabList:非垃圾邮件和垃圾邮件中所有的单词构成的集合;(不包含重复的单词)

def spamTest():    docList=[]; classList = []; fullText =[]    for i in range(1,26):        wordList = textParse(open('email/spam/%d.txt' % i).read())        docList.append(wordList)        fullText.extend(wordList)        classList.append(1)        wordList = textParse(open('email/ham/%d.txt' % i).read())        docList.append(wordList)        fullText.extend(wordList)        classList.append(0)    vocabList = createVocabList(docList) #create vocabulary    #create test set    trainingSet = range(50); testSet=[]    for i in range(10): #size of test set is 10;        randIndex = int(random.uniform(0,len(trainingSet)))        testSet.append(trainingSet[randIndex])        del(trainingSet[randIndex])    #create training set    trainMat=[]; trainClasses = []    for docIndex in trainingSet:#train the classifier (get probs) trainNB0        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))        trainClasses.append(classList[docIndex])    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))    errorCount = 0    for docIndex in testSet:        #classify the remaining items        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:            errorCount += 1            print "classification error",docList[docIndex]    print 'the error rate is: ',float(errorCount)/len(testSet)    #return vocabList,fullText

bayes.textParse

#input: big string#output: word listdef textParse(bigString):    import re    #正则表达式,参见http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html    listOfTokens = re.split(r'\W*', bigString)     return [tok.lower() for tok in listOfTokens if len(tok) > 2]

bayes.createVocabList

#input: fullText dataSet#output: unique vocabSet#structure: setdef createVocabList(dataSet):    vocabSet = set([])  #create empty set    for document in dataSet:        vocabSet = vocabSet | set(document) #union of the two sets    return list(vocabSet)

bayes.bagOfWords2VecMN

说明:词袋模型

def bagOfWords2VecMN(vocabList, inputSet):    returnVec = [0]*len(vocabList)    for word in inputSet:        if word in vocabList:            returnVec[vocabList.index(word)] += 1    return returnVec

bayes.trainNB0

说明:
p1Vect: [p(x1|c=1),p(x2|c=1),…,p(xn|c=1)];
p0Vect: [p(x1|c=0),p(x2|c=0),…,p(xn|c=0)];
pAbusive: P(c).

#input:trainMatrix, the matrix with the wordList of email in each row.#      trainCategory, the labels of each email.#output: p0Vect, p1Vect, pAbusive#algorithm: naive bayesdef trainNB0(trainMatrix,trainCategory):    numTrainDocs = len(trainMatrix)    numWords = len(trainMatrix[0])    pAbusive = sum(trainCategory)/float(numTrainDocs)    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0 避免某个条件概率为0,导致连乘为0    for i in range(numTrainDocs):        if trainCategory[i] == 1:            p1Num += trainMatrix[i]            p1Denom += sum(trainMatrix[i])        else:            p0Num += trainMatrix[i]            p0Denom += sum(trainMatrix[i])    p1Vect = log(p1Num/p1Denom)          #change to log() 避免连乘后的结果太小导致下溢    p0Vect = log(p0Num/p0Denom)          #change to log()    return p0Vect,p1Vect,pAbusive

bayes.classifyNB

说明:
y=argmaxcyP(c)i=1dP(xi|c)
log(y)=log(P(c))+i=1dlog(P(xi|c))

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)    if p1 > p0:        return 1    else:        return 0