文本分类之情感分析 – 朴素贝叶斯分类器

来源：互联网发布：中国农业银行软件编辑：程序博客网时间：2024/06/05 17:39

情感分析正成为研究和社交媒体分析的热点领域，尤其是在用户评论和微博上。它是文本挖掘的一种特殊情况，一般关注在识别正反观点上，虽然它常不很准确，它仍然是有用的。为简单起见（因为训练数据容易获取），我将重点放在2个可能的情感分类：积极的和消极的。

NLTK 朴素贝叶斯分类

NLTK附带了所有你需要的情感分析的入手的东西：一份带有分为POS和NEG类别的电影评论语料，以及一些可训练分类器。我们先从一个简单的NaiveBayesClassifier作为基准，用布尔特征提取。

词袋特征提取

所有NLTK分类器的与特征结构一起工作，它可以是简单的字典，一个特征值名称映射到一个特征值。对于文本，我们将使用简单的词袋模型，每一个字是特征名称带有一个True值。这是特征提取方法：

def word_feats(words):        return dict([(word, True) for word in words])

训练集 VS 测试集和准确率

电影评论语料有1000正向文件和1000负面文件。我们将使用其中的3/4作为训练集，其余的作为测试集。这给了我们1500训练实例和500个测试实例。分类器训练方法被期望给出一系列这种格式[（特征，标签）]的此项，其中的特征是一个特征字典，标签是分类标签。在我们的例子中，特征将将是{字：真}的格式，标签将是“pos”或“neg”之一。为准确评估，我们可以对测试集使用nltk.classify.util.accuracy作为黄金标准。

训练和测试朴素贝叶斯分类

这里是在电影评论语料上训练和测试朴素贝叶斯分类器的完整Python代码。

import nltk.classify.utilfrom nltk.classify import NaiveBayesClassifierfrom nltk.corpus import movie_reviews def word_feats(words):    return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg')posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)) classifier = NaiveBayesClassifier.train(trainfeats)print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)classifier.show_most_informative_features()

输出是：

train on 1500 instances, test on 500 instancesaccuracy: 0.728Most Informative Features         magnificent = True              pos : neg    =     15.0 : 1.0         outstanding = True              pos : neg    =     13.6 : 1.0           insulting = True              neg : pos    =     13.0 : 1.0          vulnerable = True              pos : neg    =     12.3 : 1.0           ludicrous = True              neg : pos    =     11.8 : 1.0              avoids = True              pos : neg    =     11.7 : 1.0         uninvolving = True              neg : pos    =     11.7 : 1.0          astounding = True              pos : neg    =     10.3 : 1.0         fascination = True              pos : neg    =     10.3 : 1.0             idiotic = True              neg : pos    =      9.8 : 1.0

如你所见，10个最由信息量的特征是是，在大多数情况下，高度描述性的形容词。只有2个字，似乎有点奇怪是“弱势”和“避免”。也许这些词是表明一部好电影的重要的情节点或情节发展。无论是哪种情况，用简单的假设和非常少的代码，我们能够得到几乎73％的准确率。这有点接近人类的准确性，显然人们认同的情绪时候只有大约80％。在本系列的后续文章将介绍精度和召回指标，替代的分类，技术提高精度。

原文：http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/

0 0

文本分类之情感分析 – 朴素贝叶斯分类器

NLTK 朴素贝叶斯分类

词袋特征提取

训练集 VS 测试集 和 准确率

训练和测试朴素贝叶斯分类

训练集 VS 测试集和准确率