文章标题

来源：互联网发布：天互数据招聘信息编辑：程序博客网时间：2024/05/16 08:27

词典创建
1.1、现成的词典

1.1.1、NRC Emotion Lexicon(Mohammad & Turney, 2010)：annotated for eight emotions (joy, sadness, anger, fear, disgust, surprise, trust, and anticipation) as well as for positive and negative sentiment.
1.1.2、Bing Liu’s Lexicon (Hu & Liu, 2004)：provides a list of positive and negative words manually extracted from customer reviews.
1.1.3、MPQA Subjectivity Lexicon (Wilson et al., 2005): contains words marked with their prior polarity (positive or negative) and a discrete strength of evaluative intensity(strong or weak)
Entities in these lexicons do not come with a real-valued score indicating
the ne-grained evaluative intensity这些词典都不包含情感的强度分值。

1.2、作者自己创建的词典

1.2.1、Hashtag Sentiment Lexicon
有些tweet里包含“#词”这样的hashtag，这个#后面的词就表明了这条tweet的主题或者情感，所以根据这样的hashtag来从tweet网站上里爬取一些tweet，爬到的每条tweet的情感极性就用#后面词的极性标注。作者用了一个包含有74个词的种子集合(30个positive的词，43个negtive的词)来爬取tweet。生成的语料库为Hashtag Sentiment Corpus。利用这个语料库计算库中每个词的情感分值
The sentiment score for a term w was calculated：
Sentiment Score (w) = PMI (w; positive) - PMI (w; negative)
其中
PMI (w; positive) = log2(freq (w; positive)*N/freq (w)*freq (positive))

where freq (w, positive) is the number of times a term w occurs in positive tweets, freq (w) is the total frequency of term w in the corpus, freq (positive) is the total number of tokens in positive tweets, and N is the total number of tokens in the corpus. PMI (w, negative) is calculated in a similar way. 在positve和negtive集合中都小于5次的词要忽略掉。得出的正的评分表示为positive sentiment，否则相反。
The final lexicon, which we will refer to as Hashtag Sentiment Base Lexicon (HS Base) has entries for 39,413 unigrams and 178,851 bigrams. Entries were also generated for unigram-unigram, unigram-bigram, and bigram-bigram pairs that were not necessarily contiguous in the tweets corpus(这一个pair里的两个词在句子里不连续). Pairs where at least one of the terms is punctuation (e.g., ,”, \?”, .”), a user mention, a URL, or a function word (e.g., \a”, \the”, \and”) were removed. The lexicon has entries for 308,808 non-contiguous pairs.如果这个pair里有一个词为标点符号之类的，则不考虑这个pair。

1.2.2、Sentiment140 Lexicon

The Sentiment140 Corpus (Go et al., 2009) is a collection of 1.6 million tweets that contain emoticons. The tweets are labeled positive or negative according to the emoticon. 采用和上面相同的方法来生成这个词典。

1.2.3、Armative Context and Negated Context Lexicons
A word in a negated context has a dierent evaluative nature than the same word in an affairmative (non-negated) context. This dierence may include the change in the polarity(positive becomes negative or vice versa), the evaluative intensity（评估强度）, or both. We create separate lexicons for affirmative and negated contexts. 这样一个词有两个评分，一个对应affirmative context的，一个对应negtive context的。
The Hashtag Sentiment Corpus is split into two parts: Affirmative Context Corpus and Negated Context Corpus. we define a negated context as a segment of a tweet that starts with a negation word and ends with one of the punctuation marks: ,',.’, :',;’, !',?’. The list of negation words was adopted from Christopher Potts’ sentiment tutorial. Thus, part of a tweet that is marked as negated is included into the Negated Context Corpus while the rest of the tweet becomes part of the Affirmative Context Corpus. The sentiment label for the tweet is kept unchanged in both corpora.

Then, we generate the Affirmative Context Lexicon (HS AffLex) from the Affirmative Context Corpus and the Negated Context Lexicon (HS NegLex) from the Negated Context Corpus using the technique described in above. We will refer to the sentiment score calculated from the Armative Context Corpus as scoreAffLex (w), and the score calculated from the Negated Context Corpus as scoreNegLex (w).

Similarly, the Sentiment140 Affirmative Context Lexicon (S140 ALex) and the Sentiment140 Negated Context Lexicon (S140 NegLex) are built from the Armative Context and the Negated Context parts of the Sentiment140 tweet corpus.

在计算一个句子的情感评分的时候，将句子分成affirmative context 和negtive context. affirmative context里的词从 affirmative context lexicon里查评分；negtive context里的词从negtive context lexicon里查评分。

1.2.4、Negated Context (Positional) Lexicons
是上面描述创建Negated Context Lexicon的一种更好的方法。splitting a negated context into two parts: the immediate context consisting of a sin-gle token that directly follows a negation word, and the distant context consisting of the rest of the tokens in the negated context. 这样得到的否定上下文词典里的词就有两个评分，一个对应这个词紧跟在否定词后面时的评分；一个对应这个词没有和否定词连在一起时的评分。

在应用这个词典时：将一个句子的否定上下文也分成直接否定上下文和distant context。直接否定上下文里的词的评分用这个词在词典里对应的immediate context score，如果词典里没有这个词的immediate context score，则用distant context score；distant context里的词的评分用这个词对应在词典里的distant context score。

自动创建的词典feature：
1、句子中positive的数量，以及positive词的分数和
2、句子中negtive的数量，以及negtive词的分数和
3、每种pos的分值和
4、最后一个词的分值

negation feature:
1、tweet里否定词的个数
2、对于ngram，否定词后面的每个词都要加上否定标签，然后再来生成ngram

clusters feature：
the presence or absence of tokens from each of the 1000 clusters。如果一个词属于某个类比如1111，则加一个1111的特征。总共使用1000个分类

elongated feature：the number of words with one character repeated more than two times, for example, soooo;

0 0