Mining Twitter Data with Python Part 6: Sentiment Analysis Basics
来源:互联网 发布:sql server raiserror 编辑:程序博客网 时间:2024/06/05 10:41
http://www.kdnuggets.com/2016/07/mining-twitter-data-python-part-6.html
Part 6 of this series builds on the previous installments by exploring the basics of sentiment analysis on Twitter data.
By Marco Bonzanini, Independent Data Science Consultant.
Sentiment Analysis is one of the interesting applications of text analytics. Although the term is often associated with sentiment classification of documents, broadly speaking it refers to the use of text analytics approaches applied to the set of problems related to identifying and extracting subjective material in text sources.
This article continues the series on mining Twitter data with Python, describing a simple approach for Sentiment Analysis and applying it to the rubgy data set (see Part 4).
A Simple Approach for Sentiment Analysis
The technique we’re discussing in this post has been elaborated from the traditional approach proposed by Peter Turney in his paper Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. A lot of work has been done in Sentiment Analysis since then, but the approach has still an interesting educational value. In particular, it is intuitive, simple to understand and to test, and most of all unsupervised, so it doesn’t require any labelled data for training.
Firstly, we define the Semantic Orientation (SO) of a word as the difference between its associations with positive and negative words. In practice, we want to calculate “how close” a word is with terms like goodand bad. The chosen measure of “closeness” is Pointwise Mutual Information (PMI), calculated as follows (t1 and t2 are terms):
In Turney’s paper, the SO of a word was calculated against excellent andpoor, but of course we can extend the vocabulary of positive and negative terms. Using and a vocabulary of positive terms and for the negative ones, the Semantic Orientation of a term t is hence defined as:
We can build our own list of positive and negative terms, or we can use one of the many resources available on-line, for example the opinion lexicon by Bing Liu.
Computing Term Probabilities
In order to compute (the probability of observing the term t) and (the probability of observing the terms t1 and t2 occurring together) we can re-use some previous code to calculate term frequencies and term co-occurrences. Given the set of documents (tweets) D, we define the Document Frequency (DF) of a term as the number of documents where the term occurs. The same definition can be applied to co-occurrent terms. Hence, we can define our probabilities as:
In the previous articles, the document frequency for single terms was stored in the dictionaries count_single and count_stop_single (the latter doesn’t store stop-words), while the document frequency for the co-occurrencies was stored in the co-occurrence matrix com
This is how we can compute the probabilities:
# n_docs is the total n. of tweetsp_t = {}p_t_com = defaultdict(lambda : defaultdict(int)) for term, n in count_stop_single.items(): p_t[term] = n / n_docs for t2 in com[term]: p_t_com[term][t2] = com[term][t2] / n_docs
Computing the Semantic Orientation
Given two vocabularies for positive and negative terms:
positive_vocab = [ 'good', 'nice', 'great', 'awesome', 'outstanding', 'fantastic', 'terrific', ':)', ':-)', 'like', 'love', # shall we also include game-specific terms? # 'triumph', 'triumphal', 'triumphant', 'victory', etc.]negative_vocab = [ 'bad', 'terrible', 'crap', 'useless', 'hate', ':(', ':-(', # 'defeat', etc.]
We can compute the PMI of each pair of terms, and then compute the Semantic Orientation as described above:
pmi = defaultdict(lambda : defaultdict(int))for t1 in p_t: for t2 in com[t1]: denom = p_t[t1] * p_t[t2] pmi[t1][t2] = math.log2(p_t_com[t1][t2] / denom) semantic_orientation = {}for term, n in p_t.items(): positive_assoc = sum(pmi[term][tx] for tx in positive_vocab) negative_assoc = sum(pmi[term][tx] for tx in negative_vocab) semantic_orientation[term] = positive_assoc - negative_assoc
The Semantic Orientation of a term will have a positive (negative) value if the term is often associated with terms in the positive (negative) vocabulary. The value will be zero for neutral terms, e.g. the PMI’s for positive and negative balance out, or more likely a term is never observed together with other terms in the positive/negative vocabularies.
We can print out the semantic orientation for some terms:
semantic_sorted = sorted(semantic_orientation.items(), key=operator.itemgetter(1), reverse=True)top_pos = semantic_sorted[:10]top_neg = semantic_sorted[-10:] print(top_pos)print(top_neg)print("ITA v WAL: %f" % semantic_orientation['#itavwal'])print("SCO v IRE: %f" % semantic_orientation['#scovire'])print("ENG v FRA: %f" % semantic_orientation['#engvfra'])print("#ITA: %f" % semantic_orientation['#ita'])print("#FRA: %f" % semantic_orientation['#fra'])print("#SCO: %f" % semantic_orientation['#sco'])print("#ENG: %f" % semantic_orientation['#eng'])print("#WAL: %f" % semantic_orientation['#wal'])print("#IRE: %f" % semantic_orientation['#ire'])
Different vocabularies will produce different scores. Using the opinion lexicon from Bing Liu, this is what we can observed on the Rugby data-set:
# the top positive terms[('fantastic', 91.39950482011552), ('@dai_bach', 90.48767241244532), ('hoping', 80.50247748725415), ('#it', 71.28333427277785), ('days', 67.4394844955977), ('@nigelrefowens', 64.86112716005566), ('afternoon', 64.05064208341855), ('breathtaking', 62.86591435212975), ('#wal', 60.07283361352875), ('annual', 58.95378954406133)]# the top negative terms[('#england', -74.83306534609066), ('6', -76.0687215594536), ('#itavwal', -78.4558633116863), ('@rbs_6_nations', -80.89363516601993), ("can't", -81.75379628180468), ('like', -83.9319149443813), ('10', -85.93073078165587), ('italy', -86.94465165178258), ('#engvfra', -113.26188957010228), ('ball', -161.82146824640125)]# MatchesITA v WAL: -78.455863SCO v IRE: -73.487661ENG v FRA: -113.261890# Individual team#ITA: 53.033824#FRA: 14.099372#SCO: 4.426723#ENG: -0.462845#WAL: 60.072834#IRE: 19.231722
Some Limitations
The PMI-based approach has been introduced as simple and intuitive, but of course it has some limitations. The semantic scores are calculated on terms, meaning that there is no notion of “entity” or “concept” or “event”. For example, it would be nice to aggregate and normalise all the references to the team names, e.g. #ita, Italy and Italia should all contribute to the semantic orientation of the same entity. Moreover, do the opinions on the individual teams also contribute to the overall opinion on a match?
Some aspects of natural language are also not captured by this approach, more notably modifiers and negation: how do we deal with phrases likenot bad (this is the opposite of just bad) or very good (this is stronger than just good)?
Summary
This article has continued the tutorial on mining Twitter data with Python introducing a simple approach for Sentiment Analysis, based on the computation of a semantic orientation score which tells us whether a term is more closely related to a positive or negative vocabulary. The intuition behind this approach is fairly simple, and it can be implemented using Pointwise Mutual Information as a measure of association. The approach has of course some limitations, but it’s a good starting point to get familiar with Sentiment Analysis.
Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).
Original. Reposted with permission.
- Mining Twitter Data with Python Part 6: Sentiment Analysis Basics
- Mining Twitter Data with Python Part 5: Data Visualisation Basics
- Mining Twitter Data with Python Part 1: Collecting Data
- Mining Twitter Data with Python Part 2: Text Pre-processing
- Mining Twitter Data with Python Part 3: Term Frequencies
- Mining Twitter Data with Python
- Mining Twitter Data with Python Part 4: Rugby and Term Co-occurrences
- Mining Twitter Data with Python Part 7: Geolocation and Interactive Maps
- Machine learning and Data Mining - Association Analysis with Python
- Learning Data Mining with Python-第一章-affinity analysis
- Data analysis and Data mining
- Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume
- sentiment analysis
- Python for Data Analysis (6)
- Data Mining with Computational Intelligence
- Data analysis example with ggplot and dplyr (analyzing ‘supercar’ data, part 2)
- NAACL 2013 Paper Mining User Relations from Online Discussions using Sentiment Analysis and PMF
- Data Mining with Big Data--阅读笔记
- RNN以及LSTM的介绍和公式梳理
- Java泛型之Type体系
- 文本文件上传数据库
- Java——数据结构——堆排序
- web接口测试中需要测试的几个点
- Mining Twitter Data with Python Part 6: Sentiment Analysis Basics
- 关于easyui中datagrid数据网格与form表单的使用总结
- Android Json解析详解(详细代码)
- mysql用一个表更新另一个表
- String字符串特殊字符强制不转义
- 【Unity】基于ProtoBuffer与Socket实现网络传输
- 关于Asp.NET中页面事件加载的先后顺序
- C#为listview设置虚拟模式用于导入excel表
- jQuery+css+html制作简单的经典导航条