Feature Selection(特征提取) 单纯高信息量unigram与参考情感词典词汇对比

来源：互联网发布：什么软件改变图片大小编辑：程序博客网时间：2024/06/04 19:36

在之前的实验中我使用卡方统计法，提取语料库中单词信息量最高的1000 2000 3000词不等。

之后发现大部分算法还是需要一定的features量，也就是在一定范围内（如 1000~3000）提取的词汇越多，classifier性能越好。达到一定峰值后性能会下降。

只有Knn算法出现了与众不同的现象，features越大，性能有呈现下降趋势。其本身性能也不是很好，所以没给予更多关注。

后来想提高算法性能，于是打算先从已有的features下手。导师建议我去review一下提取的unigram，看里面是个什么情况。简单浏览一下发现，很多词汇是没什么情感色彩的名词，这些词汇似乎对情感分析没有帮助，而且有可能起到负面作用。于是想试下提纯features。导师推荐了用情感词典做为参考。

于是做了一个筛选：如果选出来的features在情感词典中那么保留，如果不在，丢弃。

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

import itertools


line = []
share_words = []
inpfile = open("my_features.txt", "r") # 提取的features，可多提取一些，如8000个 best_words
line1 = inpfile.readline()
my_features = []
while line1:
word1 = line1.strip()
my_features.append(word1)
line1 = inpfile.readline()
inpfile.close()

inpfile = open("lexicon.txt", "r") # 情感词典
line2 = inpfile.readline()
lexicon = []
while line2:
word2 = line2.strip()
lexicon.append(word2)
line2 = inpfile.readline()
inpfile.close()

for w in my_features:
if(w in lexicon):

line.append(w)
share_words.append(line)

f1=open('shared_words.txt','w') # 写入到文件（此例子中为1662个shared features）
words = share_words
words = list(itertools.chain(*words))

for item in words:

f1.write(item+"\n") #k
f1.close()

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

这样处理后再用weka进行处理，结果很有意思。

ClassifiersCross-validation (10)Top 1662 featuresNaiveBayesMultinomial:81.21%NaiveBayes :67.80%IBK(KNN)65.92%J48(Decision Tree)64.34%Logistic79.71%SMO(SVM)79.39%MultilayerPerceptron(Netural Network)Voting(NB+NBM+J48)Voting(NBM+Logistic+SMO)

ClassifiersCross-validation (10)1662 features shared with 6800 sentment dictNaiveBayesMultinomial:71.82%NaiveBayes :70.06%IBK(KNN)68.31%J48(Decision Tree)59.35%Logistic72.64%SMO(SVM)71.74%MultilayerPerceptron(Netural Network)Voting(NB+NBM+J48)71.58%

从整体的accuracy中可以看出大部分分类器性能都下降了，NB和knn是有一些提高但也不是很明显。

不过更细致去看：

Top 1662 featuresNBM=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.862 0.238 0.784 0.862 0.821 0.627 0.900 0.899 pos 0.762 0.138 0.847 0.762 0.802 0.627 0.900 0.898 negWeighted Avg. 0.812 0.188 0.815 0.812 0.812 0.627 0.900 0.898 === Confusion Matrix === a b <-- classified as 4309 691 | a = pos 1188 3812 | b = neg

1662 features shared with sentiment dictNBM=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.895 0.459 0.661 0.895 0.761 0.467 0.818 0.800 pos 0.541 0.105 0.838 0.541 0.658 0.467 0.818 0.798 negWeighted Avg. 0.718 0.282 0.750 0.718 0.709 0.467 0.818 0.799 === Confusion Matrix === a b <-- classified as 4477 523 | a = pos 2295 2705 | b = neg

用提纯后的features NBM pos class 分对的比例比单纯 top1662 unigrams 提高了3个百分点性能, 但neg class错误率很高，所以导致了整体性能下降。

再来看看SMO(SVM):

Top 1662 featuresSMO=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.777 0.189 0.804 0.777 0.790 0.588 0.794 0.736 pos 0.811 0.223 0.784 0.811 0.797 0.588 0.794 0.731 negWeighted Avg. 0.794 0.206 0.794 0.794 0.794 0.588 0.794 0.733 === Confusion Matrix === a b <-- classified as 3885 1115 | a = pos 946 4054 | b = neg

1662 features shared with sentiment dictSMO=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.594 0.159 0.789 0.594 0.678 0.449 0.717 0.671 pos 0.841 0.406 0.674 0.841 0.748 0.449 0.717 0.647 negWeighted Avg. 0.717 0.283 0.732 0.717 0.713 0.449 0.717 0.659 === Confusion Matrix === a b <-- classified as 2970 2030 | a = pos 796 4204 | b = neg

SMO 中用提纯过后的features neg class 分对的比例达到了 84% 比单纯top 1662词汇中的81%也提高了3个百分点，由于 pos正确率很低所以整体性能下降。

其他算法中也有类似现象偏向一极的比例有所增加。

如何选取features可能还是要深思熟虑。

继续....

0 0