Mining Twitter Data with Python Part 3: Term Frequencies
来源:互联网 发布:乒乓球 知乎 编辑:程序博客网 时间:2024/06/15 15:33
http://www.kdnuggets.com/2016/06/mining-twitter-data-python-part-3.html
Part 3 of this 7 part series focusing on mining Twitter data discusses the analysis of term frequencies for meaningful term extraction.
By Marco Bonzanini, Independent Data Science Consultant.
This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.
Counting Terms
Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).
We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can usecollections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():
import operator import jsonfrom collections import Counter fname = 'mytweets.json'with open(fname, 'r') as f: count_all = Counter() for line in f: tweet = json.loads(line) # Create a list with all the terms terms_all = [term for term in preprocess(tweet['text'])] # Update the counter count_all.update(terms_all) # Print the first 5 most frequent words print(count_all.most_common(5))
The above code will produce some unimpressive results:
[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]
As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.
Removing stop-words
In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, andand on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).
Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.
from nltk.corpus import stopwordsimport string punctuation = list(string.punctuation)stop = stopwords.words('english') + punctuation + ['rt', 'via']
We can now substitute the variable terms_all in the first example with something like:
terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]
After counting, sorting the terms and printing the top 5, this is the result:
[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]
So apparently I mostly tweet about Python and data, and the users I re-tweet more often are @miguelmalvarez and @danielasfregola, it sounds about right.
More term filters
Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:
# Count terms only once, equivalent to Document Frequencyterms_single = set(terms_all)# Count hashtags onlyterms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#')]# Count terms only (no hashtags, no mentions)terms_only = [term for term in preprocess(tweet['text']) if term not in stop and not term.startswith(('#', '@'))] # mind the ((double brackets)) # startswith() takes a tuple (not a list) if # we pass a list of inputs
After counting and sorting, these are my most commonly used hashtags:
[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]
and these are my most commonly used terms:
[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]
“nice”?
While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).
from nltk import bigrams terms_bigram = bigrams(terms_stop)
The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_allto compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of ntokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.
So after counting and sorting the bigrams, this is the result:
[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]
So apparently I tweet about nice articles (I wouldn't bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.
Summary
This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.
Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).
Original. Reposted with permission.
- Mining Twitter Data with Python Part 3: Term Frequencies
- Mining Twitter Data with Python Part 4: Rugby and Term Co-occurrences
- Mining Twitter Data with Python Part 1: Collecting Data
- Mining Twitter Data with Python Part 5: Data Visualisation Basics
- Mining Twitter Data with Python Part 2: Text Pre-processing
- Mining Twitter Data with Python Part 6: Sentiment Analysis Basics
- Mining Twitter Data with Python
- Mining Twitter Data with Python Part 7: Geolocation and Interactive Maps
- Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume
- Machine learning and Data Mining - Association Analysis with Python
- Learning Data Mining with Python-第一章-affinity analysis
- Data-Intensive Text Processing with MapReduce第三章(3)——COMPUTING RELATIVE FREQUENCIES
- Data Mining with Computational Intelligence
- Data Mining with Big Data--阅读笔记
- Analyzing Twitter Data with Hadoop
- Learning Data Mining with Python-《Python数据挖掘入门与实践》学习后的分享
- Data Mining with SQL Server 2005
- Something Wrong with Data Mining Project
- Monte Carlo采样
- WebApi 全局变量 Global.asax 文件
- TabarItem图片显示过大如何解决方案
- linux 3306端口转发
- Jquery电话号码的验证
- Mining Twitter Data with Python Part 3: Term Frequencies
- Andrid数据库框架——greenDAO(二)
- 跟我学Spring的学习笔记
- [译] MYSQL索引最佳实践
- 阿里云unbutu服务器配置 nginx + tomcat8 手记
- phpMyAdmin 尝试连接到 MySQL 服务器,但服务器拒绝连接--解决方法
- 关于surfaceView视频的拉伸问题
- Orinda无线ap
- webshell学习总结