NLP12-Bayes与文本分类探讨

来源：互联网发布：手机淘宝购物流程编辑：程序博客网时间：2024/06/10 02:18

摘要：学习Bayes的基础，公式，原理，把Bayes应用到文本分类的小例子。通过手工例子理解后，依托skLearn工具，进行对中文作一个分类探讨，采用三类200多条记录做实现，三类组合起来的正确率为83%，两两区别90%以上。

0. Bayes定义

Bayes的定义网上很多，可以看一下< 从贝叶斯方法谈到贝叶斯网络>
http://blog.csdn.net/v_july_v/article/details/40984699，
理解一下思想：先验分布 f(a) + 样本信息X ==> 后验分布 f(a|x)

1. 例子

理解好Bayes的公式与原理，最好看一下这个东西在文本分类是怎样用，用一个简单的手工例子去计算一下，来自http://blog.csdn.net/jteng/article/details/51499363，下面是为自这个博客里写的一个例子：
这里写图片描述

2. 实践

当学习完定义，理解完Bayes在文本上运用之后，考虑计算是怎样实现的，从sklearn的用户手册找到了Bayes的运用（http://scikit-learn.org/stable/modules/naive_bayes.html），
Bayes如下清晰说明：
这里写图片描述

2.1 sklearn

Naive Bayes的三个模型: Gaussian Naive Bayes；Multinomial Naive Bayes；Bernoulli Naive Bayes
这个三个模型对于大的数量提供了partial_fit 函数来求解。

2.2 构造函数

def init(self, priors=None)
可以转入一个先验，如果没先验概率，会是这样计算：

# Update if only no priors is providedif self.priors is None:    # Empirical prior, with sample_weight taken into account    self.class_prior_ = self.class_count_ / self.class_count_.sum()

2.3两个训练接口

def fit(self, X, y, sample_weight=None)def partial_fit(self, X, y, classes=None, sample_weight=None)

两个训练函数都会调用这个函数来训练：

def _partial_fit(self, X, y, classes=None, _refit=False,                 sample_weight=None)

参数更新，对于Gaussian Naive Bayes可以在线计算，相关参考论文，《Updating Formulae and a Pairwise Algorithm for Computing Sample Variances》，http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
def _update_mean_variance(n_past, mu, var, X, sample_weight=None)

3. 数据

主要是抓取了三类数据【慢性病预防，母婴，药界新闻】，查看文本的分类：
这里写图片描述
记录分布

相关标签,对于标签，分别标记为0，1，3；由于相看看任何两类的距离情况，把这三类，分成了两两一组，共三组来计算。像上一篇文章的做法一样，应用了LSI生成了向量矩阵。用这个向量矩阵进行了分类学习，这里采用了Gaussian Naive Bayes，不过有一个问题未想明白，不知道样本经过LSI降维后是否是正态分布？有知道麻烦告诉一下。

4.运行的结果

母婴与慢性病两类文章分类，平均ROC area = 0.95,效果还是比较好的。
这里写图片描述
母婴与新闻

慢病预防与新闻

5. 代码

# -*- coding:utf-8 -*-import reimport stringimport jiebaimport jieba.analyseimport matplotlib.pyplot as pltimport numpy as npfrom bs4 import BeautifulSoupfrom gensim import corpora, models, matutilsfrom scipy import interpfrom sklearn.cross_validation import StratifiedKFoldfrom sklearn.metrics import roc_curve, auc# 判断是否是数字def isXiaoShu(word):    rs = False    a = re.search(r'^\d*\.?\d*$', word)    if a:        if a.group(0) == '':            pass        else:            rs = True    else:        pass    return rs# 分词def cutPhase(inFile, outFile):    # jieba.load_userdict("dict_all.txt")    stoplist = {}.fromkeys([line.strip() for line in open('config\stopwords.txt', 'r', encoding='utf-8')])    f1 = open(inFile, 'r', encoding='utf-8')    f2 = open(outFile, 'a', encoding='utf-8')    line = f1.readline()    count = 0    while line:        b = BeautifulSoup(line, "lxml")        line = b.text        # line.replace('\u3000', '').replace('\t', '').replace(' ', '')        segs = jieba.cut(line, cut_all=False)        segs = [word for word in list(segs)                if word.lstrip() is not None                and word.lstrip() not in stoplist                and word.lstrip() not in string.punctuation                and not isXiaoShu(word.lstrip())                ]        f2.write(" ".join(segs))        f2.write('\n')        line = f1.readline()        count += 1        if count % 100 == 0:            print(count)    f1.close()    f2.close()class MyNews(object):    def __init__(self, dict, in_file):        self.dict = dict        self.in_file = in_file    def __iter__(self):        for line in open(self.in_file, encoding='utf-8'):            yield self.dict.doc2bow(line.split())    def __len__(self):        return 0def trainBayes():    # 生成相似矩阵    print('加载bows')    bows = corpora.MmCorpus(u'data/资讯文章数据.mm')    print('加载LSI模型')    lsi = models.LsiModel.load(u'data/资讯文章数据.lsi')    bow_lsi = lsi[bows]    # 把语料储存类型转numpy类型    data = np.transpose(matutils.corpus2dense(bow_lsi, 100))    target = np.loadtxt("data/资讯文章数据_f.txt")    print('data.shape:', data.shape)    print('target.shape:', target.shape)    from sklearn.naive_bayes import GaussianNB    classifier = GaussianNB()    params = classifier.get_params()    print(params)    cv = StratifiedKFold(target, n_folds=6)    mean_tpr = 0.0    mean_fpr = np.linspace(0, 1, 100)    all_tpr = []    # 解决中文问题    plt.rcParams["font.family"] = "SimHei"    for i, (train, test) in enumerate(cv):        probas_ = classifier.fit(data[train], target[train]).predict_proba(data[test])        fpr, tpr, thresholds = roc_curve(target[test], probas_[:, 1])        # 对mean_tpr在mean_fpr处进行插值，通过scipy包调用interp()函数        mean_tpr += interp(mean_fpr, fpr, tpr)        mean_tpr[0] = 0.0  # 初始处为0        roc_auc = auc(fpr, tpr)        # 画图，只需要plt.plot(fpr,tpr),变量roc_auc只是记录auc的值，通过auc()函数能计算出来        plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))    # 画对角线    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')    mean_tpr /= len(cv)  # 在mean_fpr100个点，每个点处插值插值多次取平均    mean_tpr[-1] = 1.0  # 坐标最后一个点为（1,1）    mean_auc = auc(mean_fpr, mean_tpr)  # 计算平均AUC值    # 画平均ROC曲线    plt.plot(mean_fpr, mean_tpr, 'k--',             label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)    plt.xlim([-0.05, 1.05])    plt.ylim([-0.05, 1.05])    plt.xlabel('False Positive Rate')    plt.ylabel('True Positive Rate')    plt.title('两类ROC-慢病预防&&新闻')    plt.legend(loc="lower right")    plt.show()if __name__ == '__main__':    is_train = True    # 进行训练计算模型    if is_train:        print("***分词***")        cutPhase(inFile=u'data\资讯文章数据.txt', outFile=u"data\资讯文章数据.cut")        print("***建立词典***")        dict = corpora.Dictionary(line.lower().split() for line in open(u'data\资讯文章数据.cut', encoding='utf-8'))        dict.save('data\资讯文章数据.dic')        # 加载词典:建立词袋语料        # if is_load:        #     dict = corpora.Dictionary.load(u'data/资讯文章数据.dic')        print('=================dictinary info=============')        print('词数：', len(dict.keys()))        print('处理的文档数(num_docs):', dict.num_docs)        print('没有去重词条总数(num_pos):', dict.num_pos)        print('=================dictinary=============')        bows = MyNews(dict, in_file=u'data/资讯文章数据.cut')        print("***保存词代信息***")        corpora.MmCorpus.serialize('data/资讯文章数据.mm', bows)        print("***计算iftdf***")        tfidf = models.TfidfModel(dictionary=dict)        corpus_tfidf = tfidf[bows]        tfidf.save(u'data/资讯文章数据.tfidf')        print("***计算lsi模型并保存***")        lsi = models.LsiModel(corpus_tfidf, id2word=dict, num_topics=100)        lsi.save(u'data/资讯文章数据.lsi')        # 计算所有语料        corpus_lsi = lsi[corpus_tfidf]        # 训练        trainBayes()

【作者：happyprince , http://blog.csdn.net/ld326/article/details/78524486】

阅读全文

1 0