朴素贝叶斯--源码解析
来源:互联网 发布:淘宝缩水女 编辑:程序博客网 时间:2024/06/07 20:26
概率论及贝叶斯决策理论的一些知识请参阅相关书籍,这里给出源码及解析。
1. 使用python进行文本分类
# -*- coding: utf-8 -*-"""Created on Mon Aug 14 21:40:38 2017@author: LiLong"""from numpy import *# 创建实验样本def loadDataSet(): # 词条集合 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] # 类别标签集合,这里是人工标注的 classVec = [0,1,0,1,0,1] return postingList,classVecdef createVocabList(dataSet): # vocabSet = set([]) # 创建一个空集,set()确保元素的唯一性 for document in dataSet: vocabSet = vocabSet | set(document) # 两个集合的并集,既是添加新词集合 print 'vocabSet',vocabSet #得到的是一个集合 return list(vocabSet) # 得到的是一个列表,在此需要转换为列表def setOfWords2Vec(vocabList, inputSet): # 输入的词组转换为向量 returnVec = [0]*len(vocabList) # 创建一个列表向量,并且和词汇表等长 for word in inputSet: if word in vocabList: # 判断单词是否在词汇表中if...in.... returnVec[vocabList.index(word)] = 1 # 出现设置为1,为词集模型 else: print "the word: %s is not in my Vocabulary!" % word return returnVec # 返回输入文本的词向量,每个都是等长的# 朴素贝叶斯训练函数def trainNB0(trainMatrix,trainCategory): #trainCategory每篇文档类别标签所构成的向量 numTrainDocs = len(trainMatrix) #训练文档的数目 numWords = len(trainMatrix[0]) # 每篇文档的词向量 pAbusive = sum(trainCategory)/float(numTrainDocs) # 侮辱性文档的频率 p0Num = ones(numWords); p1Num = ones(numWords) # 初始化:设为1和2为了消除概率为0的影响 p0Denom = 2.0; p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: # 如果该文档相应的标签是1 p1Num += trainMatrix[i] #两向量相加,侮辱性的词语个数累加 p1Denom += sum(trainMatrix[i]) #同一个向量的元素相加,得到标签1的侮辱词总个数 else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) # 为了避免下溢,同时也是为分类时的运算做准备 p0Vect = log(p0Num/p0Denom) return p0Vect,p1Vect,pAbusive# 朴素贝叶斯分类函数# vec2Classify是要分类的向量def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): # p(Ci|W)<=>p(W|Ci)p(Ci)--->log(p(Ci|W))<=>log(p(W|Ci))+log(p(Ci)) #sum()列表对应元素相乘,再相加(有点类似求期望) p1 = sum(vec2Classify * p1Vec) + log(pClass1) p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0def testingNB(): listOPosts,listClasses = loadDataSet() # 载入文档和标签 myVocabList = createVocabList(listOPosts) # 得到词汇表,即文档中不重复的词列表 trainMat=[] for postinDoc in listOPosts: # 得到所有词条的词向量 trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) # 得到整篇文档的侮辱性词条向量的概率以及两个类别的概率 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) testEntry = ['love', 'my', 'dalmation'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) testEntry = ['stupid', 'garbage'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)# 函数入口testingNB()
运行结果:
runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')vocabSet set(['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my'])['love', 'my', 'dalmation'] classified as: 0['stupid', 'garbage'] classified as: 1
这里需要注意几点:
- 假设所有的词都相互独立,使用了条件独立性假设
- 将每个词出现与否作为一个特征,即词集模型
- p1 = sum(vec2Classify * p1Vec) + log(pClass1)这句有点类似求期望
2. 使用朴素贝叶斯过滤垃圾邮件
# -*- coding: utf-8 -*-"""Created on Mon Aug 14 21:40:38 2017@author: LiLong"""from numpy import *#import feedparserdef createVocabList(dataSet): vocabSet = set([]) # 创建一个空集,set()确保元素的唯一性 for document in dataSet: # dataset形如[[],[],[],.....] vocabSet = vocabSet | set(document) # 两个集合的并集,既是添加新词集合 #print 'vocabSet',vocabSet # 得到的是一个集合 return list(vocabSet) # 得到的是一个列表,在此需要转换为列表# 词集模型def setOfWords2Vec(vocabList, inputSet): # 输入的词组转换为向量 returnVec = [0]*len(vocabList) # 创建一个列表向量,并且和词汇表等长 for word in inputSet: if word in vocabList: # 判断单词是否在词汇表中if...in.... returnVec[vocabList.index(word)] = 1 # 出现设置为1,为词集模型 else: print "the word: %s is not in my Vocabulary!" % word return returnVec # 返回输入文本的词向量,每个都是等长的# 词袋模型 def bagOfWords2VecMN(vocabList, inputSet): # 输入的词组转换为向量 returnVec = [0]*len(vocabList) # 创建一个列表向量,并且和词汇表等长 for word in inputSet: if word in vocabList: # 判断单词是否在词汇表中if...in.... returnVec[vocabList.index(word)] = +1 # 出现就加一 else: print "the word: %s is not in my Vocabulary!" % word return returnVec # 返回输入文本的词向量,每个都是等长的# 朴素贝叶斯训练函数def trainNB0(trainMatrix,trainCategory): #trainCategory每篇文档类别标签所构成的向量 numTrainDocs = len(trainMatrix) #训练文档的数目 numWords = len(trainMatrix[0]) # 每篇文档的词向量 pAbusive = sum(trainCategory)/float(numTrainDocs) # 侮辱性文档的频率 p0Num = ones(numWords); p1Num = ones(numWords) # 初始化:设为1和2为了消除概率为0的影响 p0Denom = 2.0; p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: # 如果该文档相应的标签是1,计算p(w|1) p1Num += trainMatrix[i] #两向量相加,侮辱性的词语个数累加 p1Denom += sum(trainMatrix[i]) #同一个向量的元素相加,得到标签1的侮辱词总个数 else: # 计算p(w|0) p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) # 为了避免下溢,同时也是为分类时的运算做准备 p0Vect = log(p0Num/p0Denom) return p0Vect,p1Vect,pAbusive# 朴素贝叶斯分类函数# vec2Classify是要分类的向量def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): # p(Ci|W)<=>p(W|Ci)p(Ci)--->log(p(Ci|W))<=>log(p(W|Ci))+log(p(Ci)) #sum()列表对应元素相乘,再相加(有点类似求期望) p1 = sum(vec2Classify * p1Vec) + log(pClass1) p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0# 测试函数,封装了所有操作 def testingNB(): listOPosts,listClasses = loadDataSet() # 载入文档和标签 myVocabList = createVocabList(listOPosts) # 得到词汇表,即文档中不重复的词列表 trainMat=[] for postinDoc in listOPosts: # 得到所有词条的词向量 trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) # 得到整篇文档的侮辱性词条向量的概率以及两个类别的概率 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) #在此必须转换为numpy的array() testEntry = ['love', 'my', 'dalmation'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) # 只要是数组,就必须array() print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) testEntry = ['stupid', 'garbage'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)# 文件解析def textParse(bigString): import re listOfTokens = re.split(r'\W*', bigString) # 切分文本 # 去掉少于两个字符的字符串 return [tok.lower() for tok in listOfTokens if len(tok) > 2]# 垃圾邮件测试函数def spamTest(): docList=[]; classList = []; fullText =[] for i in range(1,26): # 一种重要的读取路径下的文件的有效方法,打开文件并读取文件内容 wordList = textParse(open('spam/%d.txt' % i).read()) docList.append(wordList) # 添加形成[[],[]...] fullText.extend(wordList) # 添加形成[.....] classList.append(1) # 类别1 # 读取另一个文件 wordList = textParse(open('ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) #类别0 #print 'classList:',classList # 创建词汇表(doclist存储所有的类别,50个),得到不重复的所有字符串的列表 vocabList = createVocabList(docList) #print 'vocabList:',vocabList trainingSet = range(50); testSet=[] for i in range(10): # 随机构建训练集 randIndex = int(random.uniform(0,len(trainingSet))) # 0到50的一个随机整数 testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) print 'testSet:',testSet # 其中10个被选为测试集 print 'trainingSet:',trainingSet # 剩下的40个为训练集 # 构建训练集词条向量 trainMat=[]; trainClasses = [] for docIndex in trainingSet: trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) #trainMat是数组 trainClasses.append(classList[docIndex]) # 相应的类别 # 朴素贝叶斯训练函数 p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) errorCount = 0 # 测试 for docIndex in testSet: wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]: errorCount += 1 print "classification error",docList[docIndex] # 输出相应的判断错误的词表 print 'the error rate is:',float(errorCount)/len(testSet) #return vocabList,fullText# 函数入口#testingNB()spamTest()
结果:
runfile('C:/Users/LiLong/Desktop/Bayesian/debug.py', wdir='C:/Users/LiLong/Desktop/Bayesian')testSet: [34, 23, 8, 10, 40, 13, 21, 14, 2, 20]trainingSet: [0, 1, 3, 4, 5, 6, 7, 9, 11, 12, 15, 16, 17, 18, 19, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49]the error rate is: 0.0runfile('C:/Users/LiLong/Desktop/Bayesian/debug.py', wdir='C:/Users/LiLong/Desktop/Bayesian')testSet: [31, 15, 23, 8, 12, 27, 10, 3, 13, 1]trainingSet: [0, 2, 4, 5, 6, 7, 9, 11, 14, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 28, 29, 30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]classification error ['oem', 'adobe', 'microsoft', 'softwares', 'fast', 'order', 'and', 'download', 'microsoft', 'office', 'professional', 'plus', '2007', '2010', '129', 'microsoft', 'windows', 'ultimate', '119', 'adobe', 'photoshop', 'cs5', 'extended', 'adobe', 'acrobat', 'pro', 'extended', 'windows', 'professional', 'thousand', 'more', 'titles']the error rate is: 0.1
结果是两次的运行效果,因为电子有邮件是随机选择的,所以每次的输出结果可能有些差别,也可以重复多次,然后求平均值,降低错误率。
注意:
- 这里用到的是词袋模型
数据选择用的是留存交叉验证
其他:
>>> [1,0,1]+[0,0,0][1, 0, 1, 0, 0, 0]>>> t= array([1,0,1])>>> m=array([0,0,1])>>> t+marray([1, 0, 2])>>> t= array([1,0,2])>>> t*tarray([1, 0, 4])>>> t= array([[1,0,2],[1,0,2]])>>> m=array([2,0,2])>>> t*marray([[2, 0, 4], [2, 0, 4]])
3. 使用朴素贝叶斯从个人广告中获取区域倾向
我用的是spyder,自身没有带feedparser,所以首先安装feedparser:
下载安装包 :feedparser-5.2.1
基于spyder平台安装:打开spyder后,tools–>open command prompt,打开控制台后,cd进入下载包的位置,运行python setup.py install。然后在cmd下 输入pip list 查看已安装的包,如果是比较老的版本用pipi freeze。
# -*- coding: utf-8 -*-"""Created on Mon Aug 14 21:40:38 2017@author: LiLong"""from numpy import *#import feedparser# 创建实验样本def loadDataSet(): # 词条集合 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] # 类别标签集合,这里是人工标注的 classVec = [0,1,0,1,0,1] return postingList,classVecdef createVocabList(dataSet): # vocabSet = set([]) # 创建一个空集,set()确保元素的唯一性 for document in dataSet: # dataset形如[[],[],[],.....] vocabSet = vocabSet | set(document) # 两个集合的并集,既是添加新词集合 #print 'vocabSet',vocabSet # 得到的是一个集合 return list(vocabSet) # 得到的是一个列表,在此需要转换为列表# 词集模型def setOfWords2Vec(vocabList, inputSet): # 输入的词组转换为向量 returnVec = [0]*len(vocabList) # 创建一个列表向量,并且和词汇表等长 for word in inputSet: if word in vocabList: # 判断单词是否在词汇表中if...in.... returnVec[vocabList.index(word)] = 1 # 出现设置为1,为词集模型 else: print "the word: %s is not in my Vocabulary!" % word return returnVec # 返回输入文本的词向量,每个都是等长的# 词袋模型 def bagOfWords2VecMN(vocabList, inputSet): # 输入的词组转换为向量 returnVec = [0]*len(vocabList) # 创建一个列表向量,并且和词汇表等长 for word in inputSet: if word in vocabList: # 判断单词是否在词汇表中if...in.... returnVec[vocabList.index(word)] = +1 # 出现就加一 else: pass #print "the word: %s is not in my Vocabulary!" % word return returnVec # 返回输入文本的词向量,每个都是等长的# 朴素贝叶斯训练函数def trainNB0(trainMatrix,trainCategory): #trainCategory每篇文档类别标签所构成的向量 numTrainDocs = len(trainMatrix) #训练文档的数目 numWords = len(trainMatrix[0]) # 每篇文档的词向量 pAbusive = sum(trainCategory)/float(numTrainDocs) # 侮辱性文档的频率 p0Num = ones(numWords); p1Num = ones(numWords) # 初始化:设为1和2为了消除概率为0的影响 p0Denom = 2.0; p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: # 如果该文档相应的标签是1,计算p(w|1) p1Num += trainMatrix[i] #两向量相加,侮辱性的词语个数累加 p1Denom += sum(trainMatrix[i]) #同一个向量的元素相加,得到标签1的侮辱词总个数 else: # 计算p(w|0) p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) # 为了避免下溢,同时也是为分类时的运算做准备 p0Vect = log(p0Num/p0Denom) return p0Vect,p1Vect,pAbusive# 朴素贝叶斯分类函数# vec2Classify是要分类的向量def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): # p(Ci|W)<=>p(W|Ci)p(Ci)--->log(p(Ci|W))<=>log(p(W|Ci))+log(p(Ci)) #sum()列表对应元素相乘,再相加(有点类似求期望) p1 = sum(vec2Classify * p1Vec) + log(pClass1) p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0# 测试函数,封装了所有操作 def testingNB(): listOPosts,listClasses = loadDataSet() # 载入文档和标签 myVocabList = createVocabList(listOPosts) # 得到词汇表,即文档中不重复的词列表 trainMat=[] for postinDoc in listOPosts: # 得到所有词条的词向量 trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) # 得到整篇文档的侮辱性词条向量的概率以及两个类别的概率 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) #在此必须转换为numpy的array() testEntry = ['love', 'my', 'dalmation'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) # 只要是数组,就必须array() print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) testEntry = ['stupid', 'garbage'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)# 文件解析def textParse(bigString): import re listOfTokens = re.split(r'\W*', bigString) # 切分文本 # 去掉少于两个字符的字符串 return [tok.lower() for tok in listOfTokens if len(tok) > 2]# 垃圾邮件测试函数def spamTest(): docList=[]; classList = []; fullText =[] for i in range(1,26): # 一种重要的读取路径下的文件的有效方法,打开文件并读取文件内容 wordList = textParse(open('spam/%d.txt' % i).read()) docList.append(wordList) # 添加形成[[],[]...] fullText.extend(wordList) # 添加形成[.....] classList.append(1) # 类别1 # 读取另一个文件 wordList = textParse(open('ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) #类别0 #print 'classList:',classList # 创建词汇表(doclist存储所有的类别,50个),得到不重复的所有字符串的列表 vocabList = createVocabList(docList) #print 'vocabList:',vocabList trainingSet = range(50); testSet=[] for i in range(10): # 随机构建训练集 randIndex = int(random.uniform(0,len(trainingSet))) # 0到50的一个随机整数 testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) print 'testSet:',testSet # 其中10个被选为测试集 print 'trainingSet:',trainingSet # 剩下的40个为训练集 # 构建训练集词条向量 trainMat=[]; trainClasses = [] for docIndex in trainingSet: trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) #trainMat是数组 trainClasses.append(classList[docIndex]) # 相应的类别 # 朴素贝叶斯训练函数 p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) errorCount = 0 # 测试 for docIndex in testSet: wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]: errorCount += 1 print "classification error",docList[docIndex] # 输出相应的判断错误的词表 print 'the error rate is:',float(errorCount)/len(testSet) #return vocabList,fullText# 计算出现频率def calcMostFreq(vocabList,fullText): import operator freqDict = {} for token in vocabList: freqDict[token]=fullText.count(token) # token在列表fullText中的数量 sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedFreq[:30] # 类似spamTest()的功能def localWords(feed1,feed0): import feedparser docList=[]; classList = []; fullText =[] minLen = min(len(feed1['entries']),len(feed0['entries'])) # entris是一个list print 'minLen:',minLen for i in range(minLen): wordList = textParse(feed1['entries'][i]['summary']) docList.append(wordList) fullText.extend(wordList) classList.append(1) #NY is class 1 wordList = textParse(feed0['entries'][i]['summary']) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList) #create vocabulary # top30Words是[(u'and', 91),.....]的形式 top30Words = calcMostFreq(vocabList,fullText) print 'top30Words:',top30Words for pairW in top30Words: if pairW[0] in vocabList: vocabList.remove(pairW[0]) # 移除排序最高的30个单词 trainingSet = range(2*minLen); testSet=[] # 建立测试集 for i in range(20): randIndex = int(random.uniform(0,len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat=[]; trainClasses = [] for docIndex in trainingSet: trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) errorCount = 0 for docIndex in testSet: wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]: errorCount += 1 print 'the error rate is: ',float(errorCount)/len(testSet) return vocabList,p0V,p1V# 函数入口#testingNB()#spamTest()ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss') # 必须函数外导入sf=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')vocabList,psF,pNY=localWords(ny,sf)
结果:
runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25top30Words: [(u'and', 90), (u'you', 54), (u'for', 51), (u'indian', 35), (u'looking', 32), (u'who', 32), (u'the', 29), (u'with', 28), (u'have', 25), (u'can', 21), (u'male', 19), (u'female', 17), (u'your', 17), (u'that', 14), (u'not', 13), (u'just', 13), (u'like', 13), (u'here', 11), (u'out', 11), (u'are', 11), (u'good', 10), (u'married', 10), (u'but', 10), (u'single', 10), (u'area', 10), (u'woman', 9), (u'want', 9), (u'friend', 9), (u'bay', 9), (u'about', 9)]the error rate is: 0.45runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25top30Words: [(u'and', 90), (u'you', 54), (u'for', 51), (u'indian', 35), (u'looking', 32), (u'who', 32), (u'the', 29), (u'with', 28), (u'have', 25), (u'can', 21), (u'male', 19), (u'female', 17), (u'your', 17), (u'that', 14), (u'not', 13), (u'just', 13), (u'like', 13), (u'here', 11), (u'out', 11), (u'are', 11), (u'good', 10), (u'married', 10), (u'but', 10), (u'single', 10), (u'area', 10), (u'woman', 9), (u'want', 9), (u'friend', 9), (u'bay', 9), (u'about', 9)]the error rate is: 0.35runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25top30Words: [(u'and', 90), (u'you', 54), (u'for', 51), (u'indian', 35), (u'looking', 32), (u'who', 32), (u'the', 29), (u'with', 28), (u'have', 25), (u'can', 21), (u'male', 19), (u'female', 17), (u'your', 17), (u'that', 14), (u'not', 13), (u'just', 13), (u'like', 13), (u'here', 11), (u'out', 11), (u'are', 11), (u'good', 10), (u'married', 10), (u'but', 10), (u'single', 10), (u'area', 10), (u'woman', 9), (u'want', 9), (u'friend', 9), (u'bay', 9), (u'about', 9)]the error rate is: 0.15runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25top30Words: [(u'and', 90), (u'you', 54), (u'for', 51), (u'indian', 35), (u'looking', 32), (u'who', 32), (u'the', 29), (u'with', 28), (u'have', 25), (u'can', 21), (u'male', 19), (u'female', 17), (u'your', 17), (u'that', 14), (u'not', 13), (u'just', 13), (u'like', 13), (u'here', 11), (u'out', 11), (u'are', 11), (u'good', 10), (u'married', 10), (u'but', 10), (u'single', 10), (u'area', 10), (u'woman', 9), (u'want', 9), (u'friend', 9), (u'bay', 9), (u'about', 9)]the error rate is: 0.15
如果注释掉用于移除高频词的那几行代码,会发现错误率有所改变,由此可以看出最具表征性的词在词汇表中的重要性,也即是特征的重要性。。
runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25the error rate is: 0.3runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25660the error rate is: 0.35runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25660the error rate is: 0.3
同时错误率要远高于垃圾邮件的错误率,由于这里关注的是单词概率而不是实际分类,此问题不是很严重
4. 最具表征性的词汇显示函数
# 最具表征性的词汇显示函数def getTopWords(ny,sf): import operator vocabList,p0V,p1V=localWords(ny,sf) topNY=[]; topSF=[] for i in range(len(p0V)): if p0V[i] > -4.0 : topSF.append((vocabList[i],p0V[i])) #设定阈值 if p1V[i] > -4.0 : topNY.append((vocabList[i],p1V[i])) sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True) #print 'sortedSF:',sortedSF print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**" for item in sortedSF: print item[0] sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True) print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**" for item in sortedNY: print item[0]
结果:
runfile('C:/Users/LiLong/Desktop/Bayesian/bayesian.py', wdir='C:/Users/LiLong/Desktop/Bayesian')minLen: 25695the error rate is: 0.45SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**andfortheNY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**andforhavetheyou
在此处我把阈值改为了-0.4,可以看出满足的词汇就比较少,但是更最具表征性的词汇。。。
附:列表存储元组的用法
tt=[('and', 91), ('for', 60)]tt[0]Out[30]: ('and', 91)for i in tt: print i[0]andfor
阅读全文
2 0
- 朴素贝叶斯--源码解析
- 机器学习实战第四章——朴素贝叶斯分类(源码解析)
- 机器学习实战:基于概率论的分类方法:朴素贝叶斯(源码解析,错误分析)
- 朴素贝叶斯算法解析与应用
- 大数据:Spark mlib(二) Naive bayes朴素贝叶斯分类之多元朴素贝叶斯源码分析
- 贝叶斯(朴素贝叶斯,正太贝叶斯)及OpenCV源码分析
- 贝叶斯(朴素贝叶斯,正太贝叶斯)及OpenCV源码分析
- 贝叶斯(朴素贝叶斯,正太贝叶斯)及OpenCV源码分析
- spark1.2.0源码MLlib --- 朴素贝叶斯分类器
- Shark源码分析(九):朴素贝叶斯算法
- 《机器学习实战》读书笔记6:朴素贝叶斯源码
- 机器学习入门(一)朴素贝叶斯解析
- 机器学习算法解析—朴素贝叶斯分类
- 分类算法 之 朴素贝叶斯--案例与代码解析
- 朴素贝叶斯算法解析-机器学习实战(python)
- 朴素贝叶斯
- 朴素贝叶斯
- 朴素贝叶斯
- 集合类(常用集合类、Iterator迭代器、)
- HttpResponse返回models对象
- 位运算 题目
- Java面向对象编程之封装(encapsulation)
- Opengl函数解释
- 朴素贝叶斯--源码解析
- DozerBeanMapper简单封装, 实现深度转换Bean<->Bean的Mapper
- JMX简介及实践
- javah命令自动生成JNI头文件+Eclipse自动配置
- 《视频直播技术详解》系列:(2)架构
- [ACL2015]Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
- atcoder-abc-070D
- AR导览智慧旅游解决方案-广东海狸信息科技
- 文本超出隐藏.......