Applying Machine Learning to Sentiment Analysis
来源:互联网 发布:编程用的app 编辑:程序博客网 时间:2024/05/18 15:08
1. Obtaining the IMDb movie review dataset :
A compressed archive of the movie review dataset ---- http://ai.stanford.edu/~amaas/data/sentiment/
import pandas as pddf = pd.read_csv('./datasets/movie/movie_data.csv')print('Excerpt of the movie dataset', df.head(3))('Excerpt of the movie dataset', review sentiment
0 In 1974, the teenager Martha Moxley (Maggie Gr... 1
1 OK... so... I really like Kris Kristofferson a... 0
2 ***SPOILER*** Do not read this, if you think a... 0)
2. Introducing the bag-of-words model
2.1 Transforming words into feature vectors
import numpy as npfrom sklearn.feature_extraction.text import CountVectorizercount = CountVectorizer()docs = np.array(['The sun is shining', 'The weather is sweet', 'The sun is shining and the weather is sweet'])bag = count.fit_transform(docs)print('Vocabulary', count.vocabulary_)print('bag.toarray()', bag.toarray())('Vocabulary', {u'and': 0, u'weather': 6, u'sweet': 4, u'sun': 3, u'is': 1, u'the': 5, u'shining': 2})
('bag.toarray()', array([[0, 1, 1, 1, 0, 1, 0],
[0, 1, 0, 0, 1, 1, 1],
[1, 2, 1, 1, 1, 2, 1]], dtype=int64))
2.2 Assessing word relevancy via term frequency-inverse document frequency
from sklearn.feature_extraction.text import TfidfTransformernp.set_printoptions(precision=2)tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)print(tfidf.fit_transform(count.fit_transform(docs)).toarray())[[ 0. 0.43 0.56 0.56 0. 0.43 0. ]
[ 0. 0.43 0. 0. 0.56 0.43 0.56]
[ 0.4 0.48 0.31 0.31 0.31 0.48 0.31]]
2.3 Cleaning text data
print('Excerpt:\n\n', df.loc[0, 'review'][-50:])('Excerpt:\n\n', 'is seven.<br /><br />Title (Brazil): Not Available')
def preprocessor(text): text = re.sub('<[^>]*>', '', text) emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) text = re.sub('[\W]+', ' ', text.lower()) +\ ' '.join(emoticons).replace('-', '') return textprint('Preprocessor on Excerpt:\n\n', preprocessor(df.loc[0, 'review'][-50:]))('Preprocessor on Excerpt:\n\n', 'is seven title brazil not available')
res = preprocessor("</a>This :) is :( a test :-)!")print('Preprocessor on "</a>This :) is :( a test :-)!":\n\n', res)df['review'] = df['review'].apply(preprocessor)('Preprocessor on "</a>This :) is :( a test :-)!":\n\n', 'this is a test :) :( :)')
2.4 Processing documents into tokens
from nltk.stem.porter import PorterStemmerporter = PorterStemmer()def tokenizer(text): return text.split()def tokenizer_porter(text): return [porter.stem(word) for word in text.split()]t1 = tokenizer('runners like running and thus they run')print("Tokenize: 'runners like running and thus they run'")print(t1)t2 = tokenizer_porter('runners like running and thus they run')print("\nPorter-Tokenize: 'runners like running and thus they run'")print(t2)Tokenize: 'runners like running and thus they run'
['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
Porter-Tokenize: 'runners like running and thus they run'
[u'runner', 'like', u'run', 'and', u'thu', 'they', 'run']
3. Training a logistic regression model for document classifcation
from sklearn.model_selection import GridSearchCVfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.feature_extraction.text import TfidfVectorizernltk.download('stopwords')stop = stopwords.words('english')X_train = df.loc[:25000, 'review'].valuesy_train = df.loc[:25000, 'sentiment'].valuesX_test = df.loc[25000:, 'review'].valuesy_test = df.loc[25000:, 'sentiment'].valuestfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)param_grid = [{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [stop, None], 'vect__tokenizer': [tokenizer, tokenizer_porter], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}, {'vect__ngram_range': [(1, 1)], 'vect__stop_words': [stop, None], 'vect__tokenizer': [tokenizer, tokenizer_porter], 'vect__use_idf':[False], 'vect__norm':[None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}, ]lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)gs_lr_tfidf.fit(X_train, y_train)print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)CV Accuracy: 0.897
clf = gs_lr_tfidf.best_estimator_print('Test Accuracy: %.3f' % clf.score(X_test, y_test))Test Accuracy: 0.899
Reference:《Python Machine Learning》
0 0
- Applying Machine Learning to Sentiment Analysis
- advice for applying machine learning:Deciding what to do next
- Is functional analysis relevant to machine learning?
- 第六周:Advice for Applying Machine Learning
- Stanford Machine Learning: (4). Advice for applying Machine Learning
- Machine Learning week 6 quiz: Advice for Applying Machine Learning
- 【Coursera】Machine learning - week6 : Advice for Applying Machine Learning
- Coursera Machine Learning Week 6 - Advice for Applying Machine Learning
- 测试【Machine Learning week6】Advice for Applying Machine Learning
- sentiment analysis
- Theano-Deep Learning Tutorials 笔记:LSTM Networks for Sentiment Analysis
- Aspect Specific Sentiment Analysis using Hierarchical Deep Learning (Lakkaraju, 2014)
- Stanford机器学习第六讲(上)Advices for applying machine learning--Deciding what to try next
- 《Thumbs up? Sentiment Classification using Machine Learning Techniques》笔记
- Stanford ML - Lecture 6 - Advice for applying machine learning
- 应用机器学习的建议(Advice for applying machine learning)
- 机器学习笔记-advice for applying machine learning
- 吴恩达 机器学习 笔记 some tips on applying machine Learning
- chkconfig 添加灵活被系统控制服务, /etc/rc.local中添加开机自启动程序, /etc/profile中添加二进制命令
- WPS文字设置奇偶页眉、下划线的方法步骤
- 在Eclipse中显示.project和.classpath和.setting目录
- main函数的类型定义
- Java文件操作-读/写/复制/删除/随机访问
- Applying Machine Learning to Sentiment Analysis
- redis集群管理
- 易编远航第一期-七套多线程高级视频教程(两种中控台操作)-附密码
- Android跨进程通信之AIDL
- git奇葩淫计
- 网络流理论相关档案
- springmvc(四)---springmvc的较验机制
- 蓝桥杯 历届试题 剪格子 解题报告(dfs+ 回溯)
- listview item 高度无效 + 图片放错 滚动卡顿