Scikit learn:machine learning in Python之贝叶斯学习
来源:互联网 发布:java代码库 编辑:程序博客网 时间:2024/05/04 17:44
chapter 2之朴素贝叶斯.
朴素贝叶斯是一个简单却很强大的分类器,基于贝叶斯定理的概率模型。本质来说,贝叶斯是基于每个特征值的概率去决定该实例属于一类的概率,前提条件,也就是假定每个特征之间是独立的。朴素贝叶斯的一个非常成功的应用就是自然语言处理(natural language processing , NLP),NLP问题有很重要的,大量的标记数据(一般为文本文件),该数据作为算法的训练集。
在这个章节,将介绍使用朴素贝叶斯进行文本分类。数据集为一组分出着相应类别的文本文档,然后训练朴素贝叶斯算法来预测一个新的未知的文档的类别。scikit-learn中给出的数据集包含19,000组来自从政治,宗教到体育和科学等20个不同主题的新闻组。
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all') #导入数据和赋值
值得注意的是,数据是存着一系列的文本内容,而不是矩阵。另外,由于书本是Python2的,我使用的是Python3,故代码和书本有些微不同。
print (type(news.data),type(news.target),type(news.target_names))
Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)
<class 'list'> <class 'numpy.ndarray'> <class 'list'>
print (news.target_names)
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
print(len(news.data))
print(len(news.target))
print(len(news.target))
18846
18846
print(news.data[0])
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
print(news.target[0],news.target_names[news.target[0]]) #target是用于下标定位
10 rec.sport.hockey #下标从0开始
预处理数据:
本书的机器学习算法只能适用于数值型数据,因此,需要将文本数据转化为数值数据。
目前,只有一个特征——文本内容,因此,需要一些函数将文本内容转变为有意义的一组数值型特征。直观地看,每个文本类别中的文字(确切地说,就是符号,包括数字或标点符号)有哪些,然后尝试用这些文字的频繁分布描述每个类别。sklearn.feature_extraction.text 提供 一些实用程序,从文本文档中建立数字特征向量。
在转换数据之前,先划分好训练集和测试集。在随机顺序下,75%个实例为训练集,25%个实例为测试集。
SPLIT_PETC = 0.75
split_size = int(len(news.data) * SPLIT_PETC)
x_train = news.data[:split_size]
x_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]
split_size = int(len(news.data) * SPLIT_PETC)
x_train = news.data[:split_size]
x_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]
这里有3中方式将文本转变为数字特征:CountVectorizer, HashingVectorizer,and TfidfVectorizer.(它们之间的不同在于获得数字特征的计算)
CountVectorizer 主要是从文本中建立一个字典,然后每个实例转变成一个数字特征向量,其中的每个元素是文本中一个独有单词出现的次数
HashingVectorizer 实现一个哈希函数(hashing function),映射特征的索引,然后如CountVectorizer计算次数
TfidfVectorizer 和CountVectorizer 很像,但是计算方式更为先进,使用术语逆文档频率法(Term Frequency Inverse Document Frequency,TF-IDF)——测量单词在文档或者文集中的重要性的统计学方法(寻找当前文档中比价频繁出现的单词,对比其在整个文档集中出现的次数;这样可以看到标准化的结果,避免了过度频繁)。
训练朴素贝叶斯分类器:
建立一个朴素贝叶斯分类器,由特征向量化程序和实际贝叶斯分类器:使用 sklearn.naive_bayes模块中的方法MultinomialNB;sklearn.pipeline模块中的Pipeline能够将向量和分类器组合一起。这里结合MultinomialNB 建立3个不同的分类器,分别使用上面提及的3个不同的文本向量,然后对比在默认参数下,哪个更好。
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer,CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer,CountVectorizer
clf_1 = Pipeline([('vect',CountVectorizer()),('clf',MultinomialNB()),])
clf_2 = Pipeline([('vect',HashingVectorizer(non_negative=True)),('clf',MultinomialNB()),])
clf_3 = Pipeline([('vect',TfidfVectorizer()),('clf',MultinomialNB()),])
clf_2 = Pipeline([('vect',HashingVectorizer(non_negative=True)),('clf',MultinomialNB()),])
clf_3 = Pipeline([('vect',TfidfVectorizer()),('clf',MultinomialNB()),])
定义一个函数,分类和对指定的x和y值进行交叉验证:
from sklearn.cross_validation import cross_val_score,KFold
import numpy as np
import numpy as np
from scipy.stats import sem
def evaluate_cross_validation(clf,x,y,K):
#create a k-fold cross validation iterator of k=5 folds(建立一个k=5的交叉验证迭代器)
cv = KFold(len(y),K,shuffle=True,random_state=0)
#by default the score used is the one returned by score method of the estimator(accuracy)(默认情况下,使用的得分是返回的一个估计分数)
scores = cross_val_score(clf,x,y,cv=cv)
print(scores)
#create a k-fold cross validation iterator of k=5 folds(建立一个k=5的交叉验证迭代器)
cv = KFold(len(y),K,shuffle=True,random_state=0)
#by default the score used is the one returned by score method of the estimator(accuracy)(默认情况下,使用的得分是返回的一个估计分数)
scores = cross_val_score(clf,x,y,cv=cv)
print(scores)
print(("Mean score:{0:.3f} (+/-{1:.3f})").format(np.mean(scores),sem(scores)))
然后,每个分类器都进行5重交叉验证:
clfs = [clf_1,clf_2,clf_3]
for clf in clfs:
evaluate_cross_validation(clf,news.data,news.target,5)
for clf in clfs:
evaluate_cross_validation(clf,news.data,news.target,5)
结果如下:
[ 0.85782493 0.85725657 0.84664367 0.85911382 0.8458477 ]
Mean score:0.853 (+/-0.003)
[ 0.75543767 0.77659857 0.77049615 0.78508888 0.76200584]
Mean score:0.770 (+/-0.005)
[ 0.84482759 0.85990979 0.84558238 0.85990979 0.84213319]
Mean score:0.850 (+/-0.004)
可以看出,CountVectorizer 和 TfidfVectorizer 比HashingVectorizer 结果更好。使用TfidfVectorizer 继续,尝试通过将文档解析成不同的符号正则表达式来提高结果。
默认的正则表达式:ur"\b\w\w+\b" ,考虑了字母数字字符,下划线(也许也会考虑削减和点号以提高标记and begin considering tokens as Wi-Fi and site.com.)
新的正则表达式:ur"\b[a-z0- 9_\-\.]+[a-z][a-z0-9_\-\.]+\b":
clf_4 = Pipeline([('vect',TfidfVectorizer(token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),]) #Python3不支持ur
evaluate_cross_validation(clf_4,news.data,news.target,5)
结果如下:
[ 0.86100796 0.8718493 0.86203237 0.87291059 0.8588485 ]
Mean score:0.865 (+/-0.003)
说明结果从0.850提高到0.865。
此外,还有另一个参数:stop_words,允许我们忽略掉不想加入计算的一列单词,例如太频繁的单词,或者先验认为不该为特定主题提供信息的单词。
定义一个函数,获得stop words (禁用词):
def get_stop_words():
result = set()
for line in open('stopwords_en.txt','r').readlines():
result.add(line.strip())
return result
result = set()
for line in open('stopwords_en.txt','r').readlines():
result.add(line.strip())
return result
然后,建立一个新的分类器:
clf_5 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),])
evaluate_cross_validation(clf_5,news.data,news.target,5)
evaluate_cross_validation(clf_5,news.data,news.target,5)
结果如下:
[ 0.88222812 0.89625895 0.88591138 0.89599363 0.88485009]
Mean score:0.889 (+/-0.003)
结果由0.865提高到0.889。
再看MultinomialNB的参数,最重要的参数是alpha参数,也叫平滑参数,其默认值为1.0,假设令其为0.1:
clf_6 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB(alpha=0.1)),])
结果如下:
[ 0.91405836 0.91589281 0.91085168 0.91721942 0.91509684]
Mean score:0.915 (+/-0.001)
结果由 0.889 提高到 0.915 。接下来,测试不同的alpha值对结果的影响,进而选择最佳的alpha值。
模型评估:
定义一个函数,在整个训练集训练模型,和评估模型在训练集和测试集的准确性。
from sklearn import metrics
def train_and_evaluate(clf,x_train,x_test,y_train,y_test):
clf.fit(x_train,y_train)
print("Accuracy on training set:")
print(clf.score(x_train,y_train))
print("Accuracy on testing set:")
print(clf.score(x_test,y_test))
print("Classification Report:")
print(metrics.classification_report(y_test,y_pred=y_test))
print("Confusion Matrix:")
print(metrics.confusion_matrix(y_test,y_pred=y_test))
clf.fit(x_train,y_train)
print("Accuracy on training set:")
print(clf.score(x_train,y_train))
print("Accuracy on testing set:")
print(clf.score(x_test,y_test))
print("Classification Report:")
print(metrics.classification_report(y_test,y_pred=y_test))
print("Confusion Matrix:")
print(metrics.confusion_matrix(y_test,y_pred=y_test))
train_and_evaluate(clf_6,x_train,x_test,y_train,y_test)
结果:
Accuracy on training set:
0.98776001132
Accuracy on testing set:
0.909592529711
由上可知,结果还可以。测试集结果也差不多达到0.91.
0 0
- Scikit learn:machine learning in Python之贝叶斯学习
- Learning Scikit-learn Machine Learning in Python
- scikit-learn: machine learning in Python
- Machine Learning in Python (Scikit-learn)-(转)
- Python Machine Learning---scikit-learn
- scikit-learn: machine learning in Python系列(一)
- Tools for Machine Learning in Python(scikit-learn)
- Machine Learning in Python (Scikit-learn)-(No.1)
- Machine Learning in Python (Scikit-learn)-(No.2)
- 学习An introduction to machine learning with scikit-learn笔记
- 【Python学习】Scikit-learn之SVM
- 【Mastering Machine Learning with scikit-learn (python+spark版)】Chapter2 Linear Regression
- [Machine Learning step by step]1 统计学习:scikit-learn机器学习简介
- [Machine Learning step by step] 1 统计学习:scikit-learn机器学习简介
- Machine Learning with Scikit-Learn and Tensorflow 7 集成学习和随机森林(章节目录)
- Machine Learning in Python
- scikit-learn学习之贝叶斯分类算法
- scikit-learn学习之贝叶斯分类算法
- Linux#shell编辑
- java 刚刚开始
- Shiro源码分析----认证流程
- 缓冲输出字节流
- 【JAVA】——JAVA 概述和JDK初步
- Scikit learn:machine learning in Python之贝叶斯学习
- linux设置开机自动连接网络
- Add Two Numbers
- Android开发——Chronometer计时器控件
- LeetCode 462. Minimum Moves to Equal Array Elements II
- hduoj 1121 序列找规律差分法
- 关于Linux的常用命令
- vue.js初体验
- JAVA中字符串函数subString的用法小结