Machine Learning in action --朴素贝叶斯(已勘误)
来源:互联网 发布:蒜泥白肉 知乎 编辑:程序博客网 时间:2024/06/05 16:04
最近在自学机器学习,应导师要求,先把《Machine Learning with R》动手刷了一遍,感觉R真不能算是一门计算机语言,感觉也就是一个功能复杂的计算器。所以这次就决定使用经典教材《Machine Learning in action》。因为开学得换work station ,怕到时候代码又丢了,所以就索性开个博客,把代码上传上来。
因为书上的原代码有很多错误,并且网上的许多博客的代码也是没有改正的,这次我把修正过的代码po上来
edition:python3.5
talk is cheap show me the code
函数定义代码
#coding=utf-8from numpy import *#from math import logdef loadDataSet():#数据格式 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0,1,0,1,0,1]#1 侮辱性文字 , 0 代表正常言论 return postingList,classVecdef createVocabList(dataSet): #创建空集 vocabSet = set([]) for document in dataSet: vocabSet = vocabSet | set(document) return list(vocabSet)def setOfWords2Vec(vocabList, inputSet): #创建一个长度为 len(vocabList), 所含元素全为0的向量 returnVec = [0] * len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print("the word %s is not in Vocabulary"%word) return returnVecdef trainNBO(trainMatrix, trainCategory): numTrainDocs = len(trainMatrix) #矩阵行数 numWords = len(trainMatrix[0])#矩阵列数 #sum(trainCategory)表示label为1 的数量 pAbusive = sum(trainCategory) / float(numTrainDocs)#label为1的先验概率p(c1) p0Num = ones(numWords) #列数 p1Num = ones(numWords)#列数 p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): #每一行 if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num / p1Denom) p0Vect = log(p0Num / p0Denom) return p0Vect, p1Vect, pAbusivedef classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + log(pClass1) p0 = sum(vec2Classify * p0Vec) + log(1.0- pClass1) if p1 > p0: return 1 else: return 0def textParse(bigString): import re listOfTokens = re.split(r'\W*', bigString) return [tok.lower() for tok in listOfTokens if len(tok) > 2]def spamTest(): docList = [] classList = [] fullText = [] for i in range(1, 26): wordList = textParse(open('email/spam/%d.txt'%i, encoding='gbk', errors='ignore').read()) docList.append(wordList) fullText.extend(wordList) classList.append(1) wordList = textParse(open('email/ham/%d.txt'%i, encoding='gbk', errors='ignore').read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList) trainingSet = list(range(50)) testSet = [] #随机构建训练集 for i in range(10): randIndex = int(random.randint(0, len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat = [] trainClasses = [] for docIndex in trainingSet: trainMat.append(setOfWords2Vec(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) p0V, p1V, pSpam = trainNBO(array(trainMat), array(trainClasses)) errorCount = 0 #对测试集进行分类 for docIndex in testSet: wordVector = setOfWords2Vec(vocabList, docList[docIndex]) if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]: errorCount += 1 print('the error rate is :', float(errorCount)/len(testSet))
在spamTest()中,主要有以下几个错误
1.’range’ object doesn’t support item deletion –>这是因为python3中中range不返回数组对象,而是返回range对象
改正方法:http://blog.csdn.net/dillon2015/article/details/52987792
1.UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xae in position 199: illegal multibyte sequence —> 这个具体什么原因,我也是一头乱麻,后来找了下,因为原文件是 gbk 格式,所以改成以下格式
wordList = textParse(open('email/spam/%d.txt'%i, encoding='gbk', errors='ignore').read())
上面代码块只是定义了主要的函数,离运行还差一点。由于书原文中,采用了使用 iPython 命令行的运行方式,但是博主比较懒,所以干脆舍弃掉原来的方式。
废话不多少,直接上代码
实验1
if __name__=="__main__": listOPosts, listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) print(sum(listClasses)) print(listClasses) print(myVocabList) vec1 = setOfWords2Vec(myVocabList, listOPosts[0]) vec2 = setOfWords2Vec(myVocabList, listOPosts[3]) print(vec1) print(vec2)
实验2 :
if __name__ == "__main__": listOPosts, listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat = [] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) p0V, p1V ,pAb = trainNBO(trainMat, listClasses) print(p0V) print(p1V) print(pAb)
实验3 :
if __name__ == "__main__": listOPosts, listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat = [] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) p0V, p1V ,pAb = trainNBO(trainMat, listClasses) testEntry = ['love','my','dalmation'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print(testEntry , 'classified as :', classifyNB(thisDoc, p0V, p1V, pAb))
更多请戳github
https://github.com/Edgis/Machine-learning-in-action/blob/master/bayes.py
- Machine Learning in action --朴素贝叶斯(已勘误)
- Machine Learning in action --AdaBoost(已勘误)
- Machine Learning in action --regression(已勘误)
- Machine Learning in action –kNN(已勘误)
- Machine Learning in action --逻辑回归(已勘误)
- 《Machine Learning in Action》 读书笔记之三:朴素贝叶斯(naive Bayes)
- <Machine Learning in Action >之二 朴素贝叶斯 C#实现文章分类
- Machine Learning in Action 学习笔记-(4)基于概率论的分类方法:朴素贝叶斯
- machine learning in action
- Machine Learning in Action
- Machine Learning In Action
- Machine Learning In Action
- Machine Learning In Action
- Machine Learning In Action
- Machine Learning In Action
- Machine Learning In Action
- Machine Learning In Action
- Machine Learning In Action
- 155. Min Stack
- sizeof与strlen区别
- 网易2017春招笔试真题编程题集合--魔力手环
- 字符串编辑距离
- STL vector实现机制
- Machine Learning in action --朴素贝叶斯(已勘误)
- getopt_long
- poj 3321 Apple Tree 树状数组 dfs序
- 动态顺序表
- C++基础(五)虚函数、重载、覆盖、隐藏
- ip分片+端口
- Linux进程管理(1):进程描述和进程创建
- ptrace的些许总结
- linux下静态库和动态库的区别