朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测
来源:互联网 发布:手机adb软件 编辑:程序博客网 时间:2024/05/16 08:38
转自相国大人的博客,
http://blog.csdn.net/github_36326955/article/details/54891204
做个笔记
代码按照1 2 3 4的顺序进行即可:
1.py(corpus_segment.py)
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""@version: python2.7.8 @author: XiangguoSun@contact: sunxiangguodut@qq.com@file: corpus_segment.py@time: 2017/2/5 15:28@software: PyCharm"""import sysimport osimport jieba# 配置utf-8输出环境reload(sys)sys.setdefaultencoding('utf-8')# 保存至文件def savefile(savepath, content): with open(savepath, "wb") as fp: fp.write(content) ''' 上面两行是python2.6以上版本增加的语法,省略了繁琐的文件close和try操作 2.5版本需要from __future__ import with_statement 新手可以参考这个链接来学习http://zhoutall.com/archives/325 '''# 读取文件def readfile(path): with open(path, "rb") as fp: content = fp.read() return contentdef corpus_segment(corpus_path, seg_path): ''' corpus_path是未分词语料库路径 seg_path是分词后语料库存储路径 ''' catelist = os.listdir(corpus_path) # 获取corpus_path下的所有子目录 ''' 其中子目录的名字就是类别名,例如: train_corpus/art/21.txt中,'train_corpus/'是corpus_path,'art'是catelist中的一个成员 ''' # 获取每个目录(类别)下所有的文件 for mydir in catelist: ''' 这里mydir就是train_corpus/art/21.txt中的art(即catelist中的一个类别) ''' class_path = corpus_path + mydir + "/" # 拼出分类子目录的路径如:train_corpus/art/ seg_dir = seg_path + mydir + "/" # 拼出分词后存贮的对应目录路径如:train_corpus_seg/art/ if not os.path.exists(seg_dir): # 是否存在分词目录,如果没有则创建该目录 os.makedirs(seg_dir) file_list = os.listdir(class_path) # 获取未分词语料库中某一类别中的所有文本 ''' train_corpus/art/中的 21.txt, 22.txt, 23.txt ... file_list=['21.txt','22.txt',...] ''' for file_path in file_list: # 遍历类别目录下的所有文件 fullname = class_path + file_path # 拼出文件名全路径如:train_corpus/art/21.txt content = readfile(fullname) # 读取文件内容 '''此时,content里面存贮的是原文本的所有字符,例如多余的空格、空行、回车等等, 接下来,我们需要把这些无关痛痒的字符统统去掉,变成只有标点符号做间隔的紧凑的文本内容 ''' content = content.replace("\r\n", "") # 删除换行 content = content.replace(" ", "")#删除空行、多余的空格 content_seg = jieba.cut(content) # 为文件内容分词 savefile(seg_dir + file_path, " ".join(content_seg)) # 将处理后的文件保存到分词后语料目录 print "中文语料分词结束!!!"'''如果你对if __name__=="__main__":这句不懂,可以参考下面的文章http://imoyao.lofter.com/post/3492bc_bd0c4ce简单来说如果其他python文件调用这个文件的函数,或者把这个文件作为模块导入到你的工程中时,那么下面的代码将不会被执行,而如果单独在命令行中运行这个文件,或者在IDE(如pycharm)中运行这个文件时候,下面的代码才会运行。即,这部分代码相当于一个功能测试。如果你还没懂,建议你放弃IT这个行业。'''if __name__=="__main__": #对训练集进行分词 corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train/" # 未分词分类语料库路径 seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/" # 分词后分类语料库路径,本程序输出结果 corpus_segment(corpus_path,seg_path) #对测试集进行分词 corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/answer/" # 未分词分类语料库路径 seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/" # 分词后分类语料库路径,本程序输出结果 corpus_segment(corpus_path,seg_path)
2.py(corpus2Bunch.py)
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""@version: python2.7.8 @author: XiangguoSun@contact: sunxiangguodut@qq.com@file: corpus2Bunch.py@time: 2017/2/7 7:41@software: PyCharm"""import sysreload(sys)sys.setdefaultencoding('utf-8')import os#python内置的包,用于进行文件目录操作,我们将会用到os.listdir函数import cPickle as pickle#导入cPickle包并且取一个别名pickle'''事实上python中还有一个也叫作pickle的包,与这里的名字相同了,无所谓关于cPickle与pickle,请参考博主另一篇博文:python核心模块之pickle和cPickle讲解http://blog.csdn.net/github_36326955/article/details/54882506本文件代码下面会用到cPickle中的函数cPickle.dump'''from sklearn.datasets.base import Bunch#这个您无需做过多了解,您只需要记住以后导入Bunch数据结构就像这样就可以了。#今后的博文会对sklearn做更有针对性的讲解def _readfile(path): '''读取文件''' #函数名前面带一个_,是标识私有函数 # 仅仅用于标明而已,不起什么作用, # 外面想调用还是可以调用, # 只是增强了程序的可读性 with open(path, "rb") as fp:#with as句法前面的代码已经多次介绍过,今后不再注释 content = fp.read() return contentdef corpus2Bunch(wordbag_path,seg_path): catelist = os.listdir(seg_path)# 获取seg_path下的所有子目录,也就是分类信息 #创建一个Bunch实例 bunch = Bunch(target_name=[], label=[], filenames=[], contents=[]) bunch.target_name.extend(catelist) ''' extend(addlist)是python list中的函数,意思是用新的list(addlist)去扩充 原来的list ''' # 获取每个目录下所有的文件 for mydir in catelist: class_path = seg_path + mydir + "/" # 拼出分类子目录的路径 file_list = os.listdir(class_path) # 获取class_path下的所有文件 for file_path in file_list: # 遍历类别目录下文件 fullname = class_path + file_path # 拼出文件名全路径 bunch.label.append(mydir) bunch.filenames.append(fullname) bunch.contents.append(_readfile(fullname)) # 读取文件内容 '''append(element)是python list中的函数,意思是向原来的list中添加element,注意与extend()函数的区别''' # 将bunch存储到wordbag_path路径中 with open(wordbag_path, "wb") as file_obj: pickle.dump(bunch, file_obj) print "构建文本对象结束!!!"if __name__ == "__main__":#这个语句前面的代码已经介绍过,今后不再注释 #对训练集进行Bunch化操作: wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat" # Bunch存储路径,程序输出 seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/" # 分词后分类语料库路径,程序输入 corpus2Bunch(wordbag_path, seg_path) # 对测试集进行Bunch化操作: wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat" # Bunch存储路径,程序输出 seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/" # 分词后分类语料库路径,程序输入 corpus2Bunch(wordbag_path, seg_path)
3.py(TFIDF_space.py)
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""@version: python2.7.8 @author: XiangguoSun@contact: sunxiangguodut@qq.com@file: TFIDF_space.py@time: 2017/2/8 11:39@software: PyCharm"""import sysreload(sys)sys.setdefaultencoding('utf-8')from sklearn.datasets.base import Bunchimport cPickle as picklefrom sklearn.feature_extraction.text import TfidfVectorizerdef _readfile(path): with open(path, "rb") as fp: content = fp.read() return contentdef _readbunchobj(path): with open(path, "rb") as file_obj: bunch = pickle.load(file_obj) return bunchdef _writebunchobj(path, bunchobj): with open(path, "wb") as file_obj: pickle.dump(bunchobj, file_obj)def vector_space(stopword_path,bunch_path,space_path,train_tfidf_path=None): stpwrdlst = _readfile(stopword_path).splitlines() bunch = _readbunchobj(bunch_path) tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[], vocabulary={}) if train_tfidf_path is not None: trainbunch = _readbunchobj(train_tfidf_path) tfidfspace.vocabulary = trainbunch.vocabulary vectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5,vocabulary=trainbunch.vocabulary) tfidfspace.tdm = vectorizer.fit_transform(bunch.contents) else: vectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5) tfidfspace.tdm = vectorizer.fit_transform(bunch.contents) tfidfspace.vocabulary = vectorizer.vocabulary_ _writebunchobj(space_path, tfidfspace) print "tf-idf词向量空间实例创建成功!!!"if __name__ == '__main__': # stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#输入的文件 # bunch_path = "train_word_bag/train_set.dat"#输入的文件 # space_path = "train_word_bag/tfdifspace.dat"#输出的文件 # vector_space(stopword_path,bunch_path,space_path) # # bunch_path = "test_word_bag/test_set.dat"#输入的文件 # space_path = "test_word_bag/testspace.dat" # train_tfidf_path="train_word_bag/tfdifspace.dat" # vector_space(stopword_path,bunch_path,space_path,train_tfidf_path) stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#输入的文件 train_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat"#输入的文件 space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"#输出的文件 vector_space(stopword_path,train_bunch_path,space_path) train_tfidf_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat" # 输入的文件,由上面生成 test_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat"#输入的文件 test_space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat"#输出的文件 vector_space(stopword_path,test_bunch_path,test_space_path,train_tfidf_path)
4.py(NBayes_Predict.py)
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""@version: python2.7.8 @author: XiangguoSun@contact: sunxiangguodut@qq.com@file: NBayes_Predict.py@time: 2017/2/8 12:21@software: PyCharm"""import sysreload(sys)sys.setdefaultencoding('utf-8')import cPickle as picklefrom sklearn.naive_bayes import MultinomialNB # 导入多项式贝叶斯算法# 读取bunch对象def _readbunchobj(path): with open(path, "rb") as file_obj: bunch = pickle.load(file_obj) return bunch# 导入训练集trainpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"train_set = _readbunchobj(trainpath)# 导入测试集testpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat"test_set = _readbunchobj(testpath)# 训练分类器:输入词袋向量和分类标签,alpha:0.001 alpha越小,迭代次数越多,精度越高clf = MultinomialNB(alpha=0.01).fit(train_set.tdm, train_set.label)# 预测分类结果predicted = clf.predict(test_set.tdm)for flabel,file_name,expct_cate in zip(test_set.label,test_set.filenames,predicted): if flabel != expct_cate: print file_name,": 实际类别:",flabel," -->预测类别:",expct_cateprint "预测完毕!!!"# 计算分类精度:from sklearn import metricsdef metrics_result(actual, predict): print '精度:{0:.3f}'.format(metrics.precision_score(actual, predict,average='weighted')) print '召回:{0:0.3f}'.format(metrics.recall_score(actual, predict,average='weighted')) print 'f1-score:{0:.3f}'.format(metrics.f1_score(actual, predict,average='weighted'))metrics_result(test_set.label, predicted)
大概说下用法:
一、上面四个代码依次运行即可
二、要注意数据的存放方式要和转载的博客中一样,文件夹的名字就是类别名字,代码会进行自动识别。
三、每次跑完一遍流程,跑下一次程序前,train_corpus_seg和test_corpus_seg两个文件夹要全部删除,不然上次残留的结果会影响这次的预测。
同样地,如果更换中文数据集,这两个文件夹也要删除,总之,运行以上代码的第一步骤就是检查这两个文件夹下面是不是空的。(当然如果是第一次运行以上四个代码,没有生成这两个文件夹,自然是不用检查的)
另外,他这篇博客的优点是,可以针对小数据集(数据条数不到1000,十折交叉验证),预测概率可以达到60%~70%
程序之间的输入输出关系图
阅读全文
0 0
- 朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测
- SparkMLlib Java 朴素贝叶斯分类算法(NaiveBayes)
- 大数据入门——新闻文本数据类别预测(朴素贝叶斯分类器)
- 使用朴素贝叶斯分类器对新闻文本数据进行类别预测
- Spark朴素贝叶斯(naiveBayes)
- Spark朴素贝叶斯(naiveBayes)
- Spark朴素贝叶斯(naiveBayes)
- NaiveBayes朴素贝叶斯
- Spark MLlib源代码解读之朴素贝叶斯分类器,NaiveBayes
- Spark MLlib源代码解读之朴素贝叶斯分类器,NaiveBayes
- 朴素贝叶斯-文本分类
- 文本分类——NaiveBayes
- Spark朴素贝叶斯(naiveBayes)实践
- Spark 实现 朴素贝叶斯(naiveBayes)
- 朴素贝叶斯与文本分类
- 中文文本分类-朴素贝叶斯
- 朴素贝叶斯与文本分类
- 文本分类算法--朴素贝叶斯
- vimの光标移动
- 微信小程序支付业务流程
- 安卓两个build.gradle的区别
- ubuntu docker社区版本安装
- 棋盘覆盖
- 朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测
- 【深度学习】迁移学习
- 百度在用的Python MySQL连接池
- Redis基本数据类型
- Connections in Galaxy War(反向并查集)
- ios nil、NULL和NSNull 的使用
- 面向对象和面向过程的区别
- php laravel 源码阅读2(入口文件)
- 第2周项目3 ——汉诺塔程序