Scikit learn：machine learning in Python之贝叶斯学习

来源：互联网发布：java代码库编辑：程序博客网时间：2024/05/04 17:44

chapter 2之朴素贝叶斯.

朴素贝叶斯是一个简单却很强大的分类器，基于贝叶斯定理的概率模型。本质来说，贝叶斯是基于每个特征值的概率去决定该实例属于一类的概率，前提条件，也就是假定每个特征之间是独立的。朴素贝叶斯的一个非常成功的应用就是自然语言处理（natural language processing , NLP），NLP问题有很重要的，大量的标记数据（一般为文本文件），该数据作为算法的训练集。

在这个章节，将介绍使用朴素贝叶斯进行文本分类。数据集为一组分出着相应类别的文本文档，然后训练朴素贝叶斯算法来预测一个新的未知的文档的类别。scikit-learn中给出的数据集包含19,000组来自从政治，宗教到体育和科学等20个不同主题的新闻组。

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all') #导入数据和赋值

值得注意的是，数据是存着一系列的文本内容，而不是矩阵。另外，由于书本是Python2的，我使用的是Python3，故代码和书本有些微不同。

print (type(news.data),type(news.target),type(news.target_names))

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)

print (news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

print(len(news.data))
print(len(news.target))

18846

print(news.data[0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>

Subject: Pens fans reactions

Organization: Post Office, Carnegie Mellon, Pittsburgh, PA

Lines: 12

NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack

of any kind of posts about the recent Pens massacre of the Devils. Actually,

I am bit puzzled too and a bit relieved. However, I am going to put an end

to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they

are killing those Devils worse than I thought. Jagr just showed you why

he is much better than his regular season stats. He is also a lot

fo fun to watch in the playoffs. Bowman should let JAgr have a lot of

fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final

regular season game. PENS RULE!!!

print(news.target[0],news.target_names[news.target[0]]) #target是用于下标定位

10 rec.sport.hockey #下标从0开始

预处理数据：

本书的机器学习算法只能适用于数值型数据，因此，需要将文本数据转化为数值数据。

目前，只有一个特征——文本内容，因此，需要一些函数将文本内容转变为有意义的一组数值型特征。直观地看，每个文本类别中的文字（确切地说，就是符号，包括数字或标点符号）有哪些，然后尝试用这些文字的频繁分布描述每个类别。sklearn.feature_extraction.text 提供一些实用程序，从文本文档中建立数字特征向量。

在转换数据之前，先划分好训练集和测试集。在随机顺序下，75%个实例为训练集，25%个实例为测试集。

SPLIT_PETC = 0.75
split_size = int(len(news.data) * SPLIT_PETC)
x_train = news.data[:split_size]
x_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]

这里有3中方式将文本转变为数字特征：CountVectorizer, HashingVectorizer,and TfidfVectorizer.（它们之间的不同在于获得数字特征的计算）

CountVectorizer 主要是从文本中建立一个字典，然后每个实例转变成一个数字特征向量，其中的每个元素是文本中一个独有单词出现的次数

HashingVectorizer 实现一个哈希函数（hashing function），映射特征的索引，然后如CountVectorizer计算次数

TfidfVectorizer 和CountVectorizer 很像，但是计算方式更为先进，使用术语逆文档频率法（Term Frequency Inverse Document Frequency，TF-IDF）——测量单词在文档或者文集中的重要性的统计学方法（寻找当前文档中比价频繁出现的单词，对比其在整个文档集中出现的次数；这样可以看到标准化的结果，避免了过度频繁）。

训练朴素贝叶斯分类器：

建立一个朴素贝叶斯分类器，由特征向量化程序和实际贝叶斯分类器：使用 sklearn.naive_bayes模块中的方法MultinomialNB；sklearn.pipeline模块中的Pipeline能够将向量和分类器组合一起。这里结合MultinomialNB 建立3个不同的分类器，分别使用上面提及的3个不同的文本向量，然后对比在默认参数下，哪个更好。

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer,CountVectorizer

clf_1 = Pipeline([('vect',CountVectorizer()),('clf',MultinomialNB()),])
clf_2 = Pipeline([('vect',HashingVectorizer(non_negative=True)),('clf',MultinomialNB()),])
clf_3 = Pipeline([('vect',TfidfVectorizer()),('clf',MultinomialNB()),])

定义一个函数，分类和对指定的x和y值进行交叉验证：

from sklearn.cross_validation import cross_val_score,KFold
import numpy as np

from scipy.stats import sem

def evaluate_cross_validation(clf,x,y,K):
#create a k-fold cross validation iterator of k=5 folds(建立一个k=5的交叉验证迭代器)
cv = KFold(len(y),K,shuffle=True,random_state=0)
#by default the score used is the one returned by score method of the estimator(accuracy)(默认情况下，使用的得分是返回的一个估计分数)
scores = cross_val_score(clf,x,y,cv=cv)
print(scores)

print(("Mean score:{0:.3f} (+/-{1:.3f})").format(np.mean(scores),sem(scores)))

然后，每个分类器都进行5重交叉验证：

clfs = [clf_1,clf_2,clf_3]
for clf in clfs:
evaluate_cross_validation(clf,news.data,news.target,5)

结果如下：

[ 0.85782493 0.85725657 0.84664367 0.85911382 0.8458477 ]

Mean score:0.853 (+/-0.003)

[ 0.75543767 0.77659857 0.77049615 0.78508888 0.76200584]

Mean score:0.770 (+/-0.005)

[ 0.84482759 0.85990979 0.84558238 0.85990979 0.84213319]

Mean score:0.850 (+/-0.004)

可以看出，CountVectorizer 和 TfidfVectorizer 比HashingVectorizer 结果更好。使用TfidfVectorizer 继续，尝试通过将文档解析成不同的符号正则表达式来提高结果。

默认的正则表达式：ur"\b\w\w+\b" ，考虑了字母数字字符，下划线（也许也会考虑削减和点号以提高标记and begin considering tokens as Wi-Fi and site.com.）

新的正则表达式：ur"\b[a-z0- 9_\-\.]+[a-z][a-z0-9_\-\.]+\b"：

clf_4 = Pipeline([('vect',TfidfVectorizer(token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),]) #Python3不支持ur

evaluate_cross_validation(clf_4,news.data,news.target,5)

结果如下：

[ 0.86100796 0.8718493 0.86203237 0.87291059 0.8588485 ]

Mean score:0.865 (+/-0.003)

说明结果从0.850提高到0.865。

此外，还有另一个参数：stop_words，允许我们忽略掉不想加入计算的一列单词，例如太频繁的单词，或者先验认为不该为特定主题提供信息的单词。

定义一个函数，获得stop words （禁用词）：

def get_stop_words():
result = set()
for line in open('stopwords_en.txt','r').readlines():
result.add(line.strip())
return result

然后，建立一个新的分类器：

clf_5 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),])
evaluate_cross_validation(clf_5,news.data,news.target,5)

结果如下：

[ 0.88222812 0.89625895 0.88591138 0.89599363 0.88485009]

Mean score:0.889 (+/-0.003)

结果由0.865提高到0.889。

再看MultinomialNB的参数，最重要的参数是alpha参数，也叫平滑参数，其默认值为1.0，假设令其为0.1：

clf_6 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB(alpha=0.1)),])

结果如下：

[ 0.91405836 0.91589281 0.91085168 0.91721942 0.91509684]

Mean score:0.915 (+/-0.001)

结果由 0.889 提高到 0.915 。接下来，测试不同的alpha值对结果的影响，进而选择最佳的alpha值。

模型评估：

定义一个函数，在整个训练集训练模型，和评估模型在训练集和测试集的准确性。

from sklearn import metrics

def train_and_evaluate(clf,x_train,x_test,y_train,y_test):
clf.fit(x_train,y_train)
print("Accuracy on training set:")
print(clf.score(x_train,y_train))
print("Accuracy on testing set:")
print(clf.score(x_test,y_test))
print("Classification Report:")
print(metrics.classification_report(y_test,y_pred=y_test))
print("Confusion Matrix:")
print(metrics.confusion_matrix(y_test,y_pred=y_test))

train_and_evaluate(clf_6,x_train,x_test,y_train,y_test)

结果：

Accuracy on training set:

0.98776001132

Accuracy on testing set:

0.909592529711

由上可知，结果还可以。测试集结果也差不多达到0.91.

0 0