苹果手机评论情感分析
来源:互联网 发布:软件清理不干净 编辑:程序博客网 时间:2024/04/30 09:35
参考这里做的:http://www.jianshu.com/p/4cfcf1610a73
首先抓取网页上的数据,每一页十条评论,生成为一个txt文件。链接:http://pan.baidu.com/s/1jINsNlG 密码:vumv
以下采用既有词典的方式:
准备四本词典,停用词,否定词,程度副词,情感词,链接也给出来:http://pan.baidu.com/s/1i5LYN8h 密码:tfh8,其他版本可以自己找找
读取各种词典
f=open(r'C:/Users/user/Desktop/stopword.dic')#停止词stopwords = f.readlines()stopwords=[i.replace("\n","").decode("utf-8") for i in stopwords]from collections import defaultdict# (1) 情感词f1 =open(r"C:\Users\user\Desktop\BosonNLP_sentiment_score.txt")senList = f1.readlines()senDict = defaultdict()for s in senList: s=s.decode("utf-8").replace("\n","") senDict[s.split(' ')[0]] = float(s.split(' ')[1])# (2) 否定词f2=open(r"C:\Users\user\Desktop\notDict.txt")notList = f2.readlines()notList=[x.decode("utf-8").replace("\n","") for x in notList if x != '']# (3) 程度副词f3=open(r"C:\Users\user\Desktop\degreeDict.txt")degreeList = f3.readlines()degreeDict = defaultdict()for d in degreeList: d=d.decode("utf-8") degreeDict[d.split(',')[0]] = float(d.split(',')[1])
导入数据并且分词
import jiebadef sent2word(sentence): """ Segment a sentence to words Delete stopwords """ segList = jieba.cut(sentence) segResult = [] for w in segList: segResult.append(w) newSent = [] for word in segResult: if word in stopwords: # print "stopword: %s" % word continue else: newSent.append(word) return newSentimport ospath = u"C:/Users/user/Desktop/comments/" listdir = os.listdir(path) t=[]for i in listdir: f=open(path+i).readlines() for j in f: t.append(sent2word(j))
计算一下得分,注意,程度副词和否定词只修饰后面的情感词,这是缺点之一,之二是无法判断某些贬义词其实是褒义的,之三是句子越长得分高的可能性比较大,在此可能应该出去词的总数。
def class_score(word_lists): id=[] for i in word_lists: if i in senDict.keys(): id.append(1) elif i in notList: id.append(2) elif i in degreeDict.keys(): id.append(3) word_nake=[] for i in word_lists: if i in senDict.keys(): word_nake.append(i) elif i in notList: word_nake.append(i) elif i in degreeDict.keys(): word_nake.append(i) score=0 w=1 score0=0 for i in range(len(id)): # if id[i] ==3 and id[i+1]==2 and id[i+2]==1: # score0 = (-1)*degreeWord[word_nake[i+1]]*senWord[word_nake[i+2]] if id[i]==1: score0=w*senDict[word_nake[i]] w=1 elif id[i]==2: w=-1 elif id[i]==3: w=w*degreeDict[word_nake[i]] # print degreeWord[word_nake[i]] score=score+score0 score0=0 return score
import xlwtwb=xlwt.Workbook() sheet=wb.add_sheet('score')num=390writings="" for i in t[389:]: print "第",num,"条得分",class_score(i[:-1]) sheet.write(num-1,0,class_score(i[:-1])) num=num+1 wb.save(r'C:/Users/userg/Desktop/result.xlsx')
排序之后图标如下,可以看出积极正面的得分比较多,负面的比较少,根据原网页的评分确实如此,然而点评为1星的有1半得分为正,点评为5星的有四分之一得分为负。基于词典的方式严重依赖词典的质量,以及这种方式的缺点都可能造成得分的偏差,所以接下来打算利用word2vec试试。
词向量的变换方式如下:
from gensim.models import word2vecimport logginglogging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)sentences = word2vec.Text8Corpus("corpus.csv") # 加载语料model = word2vec.Word2Vec(sentences, size = 400) # 训练skip-gram模型,根据单词寻找周边词# 保存模型,以便重用model.save("corpus.model")# 对应的加载方式# model = word2vec.Word2Vec.load("corpus.model") from gensim.models import word2vec# load word2vec modelmodel = word2vec.Word2Vec.load("corpus.model")model.save_word2vec_format("corpus.model.bin", binary = True)model = word2vec.Word2Vec.load_word2vec_format("corpus.model.bin", binary = True)
加载一下评分
stars=open("C:\Users\user\Desktop\stars\stars.txt").readlines()stars=[ int(i.split(".")[0]) for i in stars]#三类y=[]for i in stars: if i ==1 or i ==2: y.append(-1) elif i ==3: y.append(0) elif i==4 or i==5: y.append(1)
转换成词向量,发现里面有2个失败并且删除
import numpy as npimport sysreload(sys)sys.setdefaultencoding("utf-8")def getWordVecs(wordList): vecs = [] for word in wordList: try: vecs.append(model[word]) except KeyError: continue return np.array(vecs, dtype = 'float')def buildVecs(list): posInput = [] # print txtfile for line in list:# print u"第",id,u"条" resultList = getWordVecs(line) # for each sentence, the mean vector of all its vectors is used to represent this sentence if len(resultList) != 0: resultArray = sum(np.array(resultList))/len(resultList) posInput.append(resultArray) else: pass#或者打印出来看看 return posInputX = np.array(buildVecs(t))#327 408失败del(y[326])del(y[407])y = np.array(y)
PCA降维并运用SVM进行分类
import matplotlib.pyplot as pltfrom sklearn.decomposition import PCA# Plot the PCA spectrumpca = PCA(n_components=400)pca.fit(X)plt.figure(1, figsize=(4, 3))plt.clf()plt.axes([.2, .2, .7, .7])plt.plot(pca.explained_variance_, linewidth=2)plt.axis('tight')plt.xlabel('n_components')plt.ylabel('explained_variance_')X_reduced = PCA(n_components = 100).fit_transform(X)from sklearn.cross_validation import train_test_splitX_reduced_train,X_reduced_test,y_reduced_train,y_reduced_test= train_test_split(X, y, test_size=0.33, random_state=42)from sklearn.svm import SVCfrom sklearn import metrics#准确度clf = SVC(C = 2, probability = True)clf.fit(X_reduced_train, y_reduced_train)pred_probas = clf.predict(X_reduced_test)scores =[]scores.append(metrics.accuracy_score(pred_probas, y_reduced_test))print scores
降维后的准确度为auc=0.83,相比MLP神经网络的准确度0.823来说结果差不多,以下是MLP的代码。对于利用word2vec来说,其结果依赖于语料库的词语量大小,我打印了部分失败的词语如下,表明在语料库中并没有找到相关的词,导致向量的表达信息有所缺失。
from keras.models import Sequentialfrom keras.layers import Dense, Dropout, Activationfrom keras.optimizers import SGDmodel = Sequential()model.add(Dense(512, input_dim = 400, init = 'uniform', activation = 'tanh'))model.add(Dropout(0.7))# Dropout的意思就是训练和预测时随机减少特征个数,即去掉输入数据中的某些维度,用于防止过拟合。model.add(Dense(256, activation = 'relu'))model.add(Dropout(0.7))model.add(Dense(128, activation = 'relu'))model.add(Dropout(0.7))model.add(Dense(64, activation = 'relu'))model.add(Dropout(0.7))model.add(Dense(32, activation = 'relu'))model.add(Dropout(0.7))model.add(Dense(16, activation = 'relu'))model.add(Dropout(0.7))model.add(Dense(1, activation = 'sigmoid'))model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])model.fit(X_reduced_train, y_reduced_train, nb_epoch = 20, batch_size = 16)score = model.evaluate(X_reduced_test, y_reduced_test, batch_size = 16)print ('Test accuracy: ', score[1])
0 0
- 苹果手机评论情感分析
- 电影评论情感分析 keras
- 商品评论中的实体情感分析
- kaggle 电影评论情感分析 贝叶斯分类
- 电影评论人名抽取与情感分析
- 电商产品评论数据情感分析代码详解
- 基于R语言对用户评论进行情感分析
- keras实现双向循环RNN,豆瓣电视剧评论情感分析
- LSTM中文评论情感分析(粗糙版)
- 情感分析
- 情感分析
- 情感分析
- 情感分析
- 情感分析
- 情感分析
- 情感分析
- 情感分析
- 情感分析
- opencv 双目标定程序
- [BZOJ3578]GTY的人类基因组计划2(hash+set)
- JavaSE equlas和==的区别
- ListView 的侧滑删除
- DataGrid无故多一行空白行
- 苹果手机评论情感分析
- Eclipse 使用 Maven 构建动态 Web 工程,默认无 java 目录的解决方法
- JSTL循环
- 类与对象
- 使用GDAL 读写TAB文件注意
- Netty遇到的坑
- YUV格式分析详解
- Create-react-app 构建react环境
- #面试题总结#