机器学习手记[3]---朴素贝叶斯识别垃圾邮件的应用

来源:互联网 发布:mac上的图片浏览 编辑:程序博客网 时间:2024/05/22 03:20

本文主要基于《机器学习实战》朴素贝叶斯章节进行的,

问题:有一封邮件如何判定这个邮件是不是垃圾邮件? 假定我们已经有了好几封邮件的训练材料,同时做出了是否垃圾的分类。


解决:P(邮件是垃圾邮件|邮件包含某个词汇集合)*P(邮件包含词汇集合)

= P(某个词汇集合set |垃圾邮件)*P(垃圾邮件)
= P(word1|垃圾邮件)* P(word2|垃圾邮件)….*P(wordN|垃圾邮件)*P(垃圾邮件)

上面这个等式主要有两个部分
P(word1|垃圾邮件):通过已有的垃圾邮件训练集,我们可以计算这个word1在垃圾邮件词汇集合所占的频率,这就是我们要的值。 
P(垃圾邮件): 我们可以通过计算垃圾邮件所占整个训练邮件集合的比例可以得到,比如训练集有5个email,3个是junkEmail,那么垃圾概率0.6。


用email的例子来讲,就这这样操作的。


from numpy import *# load training dataset#添加数据集,#加载包含5条短的message训练文本,及其对应是否垃圾邮件的分类结果def loadDataSet():        trainMessages=[            ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],            ['maybe', 'not', 'take', 'him','to', 'dog', 'park', 'stupid'],            ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],            ['stop', 'posting', 'stupid', 'worthless', 'garbage'],            ['mr', 'licks', 'ate', 'my', 'steak', 'how','to', 'stop', 'him'],            ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']    ]    trainClassify= [0,1,0,1,0,1]    return trainMessages, trainClassify#create the word set# 根据message素材生成词汇表set集合def createWordSet(trainMessages):    wordSet=set([])    for message in trainMessages:        wordSet=wordSet|set(message)    return list(wordSet)        # create conditional probability mat for wordset  # 根据训练文本,以及单词表,训练得到相关后验概率  # 垃圾邮件里面各个单词出现的频率--P(word1|垃圾邮件) # 良好邮件里面各个单词出现的频率--P(word1|良好邮件) # 一封邮件是垃圾邮件的概率---P(垃圾邮件)  def createProbSet(trainMessages,trainClassify,wordSet):    goodWordProbList=ones(len(wordSet))    junkWordProbList=ones(len(wordSet))    junkPercent=sum(trainClassify)/float(len(trainClassify))    goodSum=2.0    junkSum=2.0    numMessage=len(trainMessages)    for i in range(numMessage):        judgeList=message2ExitsJudge(trainMessages[i], wordSet)        if(trainClassify[i]==0):            goodWordProbList+=judgeList            goodSum+=sum(judgeList)        else:            junkWordProbList+=judgeList            junkSum+=sum(judgeList)    goodWordProbList=goodWordProbList/goodSum    junkWordProbList=junkWordProbList/junkSum    return goodWordProbList,junkWordProbList,junkPercent# judge if message content word are in the defined wordset# 判断测试message是否存在于我们训练得到的wordSet里面def message2JudgeList(testMessage, wordSet):    messageJudgeList=len(wordSet)*[0];    for word in testMessage:        if word in wordSet:            messageJudgeList[wordSet.index(word)]+=1    return messageJudgeList  # classify the test message   # 将测试msg,训练得到的概率和wordSet一并都添加进去   def messageClassifier(testMessage,wordSet,goodProbSet,junkProbSet,junkPercent):    messageJudgeList=message2JudgeList(testMessage, wordSet):    probGood=sum( array(messageJudgeList) * log(goodProbSet) )+log(1-junkPercent)    probJunk=sum( array(messageJudgeList) * log(junkProbSet) )+log((junkPercent))    if (probGood>probJunk):        return 0    else:        return 1        # 执行主函数def run(testMessage):    #prepare and train     trainMessages,trainClassify=loadDataSet()    wordSet=createWordSet(trainMessages)    goodProbSet,junkProbSet,junkPercent=createProbSet(trainMessages,trainClassify,wordSet)    #classify the test message    result=messageClassifier(testMessage,wordSet,goodProbSet,junkProbSet,junkPercent)    print result    



0 0
原创粉丝点击