情感分类--中文语料

来源：互联网发布：windows怎么一键还原编辑：程序博客网时间：2024/04/25 05:56

title: 情感分类–中文语料

data: 2017-03-04

tags: NLTK

折腾了几天终于上午用nltk实现了中文语料的分类。把整个流程记录一下。

中文语料

用的是谭松波老师的酒店分类的语料库，有四个版本：2000(balanced)、4000(balanced)、6000(balanced)、10000(unbalanced)。语料库结构如下：

-ChnSentiCorp_htl_ba_2000 |-neg   |-neg.0.txt ~ neg.999.txt |-pos   |-pos.0.txt ~ pos.999.txt

编码格式和中文分词

因为该语料库编码格式为GB2312，为了后续在python和nltk中好处理，将其转化为UTF-8编码格式，使用了一个转码的小工具GB2312<–>UTF-8 。统一转码之后，将进行中文分词，使用jiaba。

import globimport jieba i=0;for file in glob.glob(r"C:\Users\rumusan\Desktop\ChnSentiCorp_htl_ba_2000\pos\*.txt"):    with open(file,"r+",encoding= 'utf-8') as f1:        lines=f1.readlines()        lines=''.join(lines)        lines=lines.replace('\n', '')#原文件有大量空行，去掉        seg_list = jieba.cut(lines)#分词        seg_list=' '.join(seg_list)#        print(seg_list) #显示分词结果        f2=open(r"C:\Users\rumusan\Desktop\2\%d.txt"%i,'w',encoding='utf-8')        f2.write(seg_list)#分词结果写入        f2.close()        i=i+1;

分别转码neg和pos的1000个文件，并进行存储：

-hotle_reviews |-neg   |-0.txt ~ 999.txt |-pos   |-0.txt ~ .999.txt

分词效果还是不错的：

硬件 设施 太旧 , 和 房价 不 相符 , 价格 还是 贵 了……

载入自己的语料库

现在我们已经把文本进行了预处理，可以将其作为语料库。nltk有两种载入自己语料库的文件：

第一种：

from nltk.corpus import PlaintextCorpusReadercorpus_root=r"C:\Users\rumusan\Desktop\hotel_reviews"hotel_reviews=PlaintextCorpusReader(corpus_root,'.*')hotel_reviews.fileids()#查看所有文件

第二种：

from nltk.corpus import BracketParseCorpusReadercorpus_root=r"C:\Users\rumusan\Desktop\hotel_reviews"file_pattern = r".*/.*\.txt"  hotel_reviews=BracketParseCorpusReader(corpus_root,file_pattern)hotel_reviews.fileids()#查看所有文件

采用了第一种方式，为了便于处理，我们把语料结构转化为和“情感分析–example”中一样：

import nltkimport random#加载自己的语料库from nltk.corpus import PlaintextCorpusReader#路径corpus_root_reviews=r"C:\Users\rumusan\Desktop\hotel_reviews"#总（后有对整个库处理的步骤，就重复加载了。）corpus_root_neg=r"C:\Users\rumusan\Desktop\hotel_reviews\neg"#negcorpus_root_pos=r"C:\Users\rumusan\Desktop\hotel_reviews\pos"#pos#加载reviews=PlaintextCorpusReader(corpus_root_reviews,'.*')#总neg=PlaintextCorpusReader(corpus_root_neg,'.*')#negpos=PlaintextCorpusReader(corpus_root_pos,'.*')#posdocuments_neg =[(list(neg.words(fileid)),0)#加入了标签0            for fileid in neg.fileids()]documents_pos =[(list(pos.words(fileid)),1)#加入了标签1            for fileid in pos.fileids()]documents_neg.extend(documents_pos)#组合documents_neg和documents_posdocuments=documents_neg#将组合后的语料库命名为document

分类

下面和英文语料分类就大体一致了：

random.shuffle(documents)all_words = nltk.FreqDist(w for w in reviews.words())word_features=[word for (word, freq) in all_words.most_common(3000)]def document_features(document):    document_words = set(document)    features = {}    for word in word_features:        features[word] = (word in document_words)    return featuresfeaturesets = [(document_features(d), c) for (d,c) in documents]train_set,test_set=featuresets[500:],featuresets[:500]classifier=nltk.NaiveBayesClassifier.train(train_set)print (nltk.classify.accuracy(classifier,test_set))#哪些特征是分类器发现最有信息量的classifier.show_most_informative_features(10)

后续工作

大致的流程已经可以了，接下来可以进行一些细致的处理，如特征、分类器、训练集和测试集等等。

0 0