[Text_Mining]notes_3
来源:互联网 发布:软件项目度量标准 编辑:程序博客网 时间:2024/06/05 17:22
Classification
Given a set of classes
Classification:Assign the correct class label to the given input
Examples of Text Classification:
Topic identification
Spam Dectection
Sentiment analysis
Spelling correction
Supervised learning
Supervised Classificaiton
Iearn a classification model on properties(‘features’) and their importance(‘weight’)from labeled instances.
Apply the model on new instances to predict the label
Supervised classification:Phases and Datasets
Classification paradigms
Binary Classification : when there are only two possible classes.
Multi-class Classification : when there are more than two possible classes.
Multi-label Classification : when data instances can have two or more labels
.
Questions to ask in Supervised Learning
Training phase:
What are the features? How do you represent them?
What is the classification model/algorithm?
What are the model parameters?
Inference phase:
What is the expected performance? What is a good measure?
Why is textual data unique
Textual data presents a unique set of challenges
All the information you need is in the text
But features can be pulled out from text at different granularities(粒度)
Types of textual features
Words:
By far the most common class of features
Handling commonly-occurring words: Stop words
Normalization: Make lower case vs. Leave as-is
Stemming/Lemmatization
Characteristics of words: Capitalization
Parts of speech of words in a sentence
Grammatical structure, sentence parsing
Grouping words of similar meaning, semantics
Depending on classification tasks,features may come from inside words and word sequences.
Naive Bayes Classifiers
Case study: Classifying text search queries
Probabilistic model
Update the likelihood of the class given new information
Prior Probability: Pr(y = Entertainment), Pr(y = CS), Pr(y = Zoology)
When I have new information:
Posterior probability: Pr(y = Entertainment|x = ‘Python’)
Bayes’ Rule:
Posterior probability = Prior probability * Likelihood / Evidence
Y* = argmax Pr(y|X) = argmax Pr(y) × Pr(X|y)
Naive assumption: Given the class label,features are assumed to be independent of each other.
Example:
Query: ‘Python download’
Y* = argmax Pr(y) × Pr(‘Python’|y)×Pr(‘download’|y)
Naive Bayes: What are the parameters?
Prior probabilities: Pr(y) for all y in Y
Likelihood: Pr(xi|y) for all features xi in labels y in Y
Q: You are training a naive Bayes classifier, where the number of possible labels,|Y|=3 and the dimention of the data element,|x| = 100,where every feature(dimention)is binary.How many parameters does the naive Bayes classification model have? 603
Naive Bayes: Learning parameters
Prior probabilities: Pr(y) for all y in Y
-Remenber training data?
-Count the number of instances in each class
-If there are N instances in all, and n out of those are labeled as class y --->Pr(y) = n/N
Likelihood: Pr(xi|y) for all features xi and labels y in Y
-Count how many times feature xi appears in instances labeled as class y
-If there are p instances of class y, and xi appears in k of those,Pr(xi|y) = k/p
Naive Bayes: Smoothing
What happens if Pr(xi|y) = 0?
-Features xi never occurs in documents labeled y
-But then, the posterior probability Pr(y|xi) will be 0!
Instead, smooth the parameters
Laplace smoothing or Additive smoothing: Add a dummy count
- Pr(xi|y) = (k+1)/(p+n); where n is number of features
Take Home Concept
Naive Bayes is a probabilistic model
Naive,because it assumes features are independent of each other,given the class label.(**)
For text classification problems, naive Bayes models typically provide very strong baselines.
Simple model, easy to learn parameters
Two Classic Naive Bayes Variants for Text
Multinomial Naive Bayes
-Data follows a multinomial distribution
-Each fearure value is a count(word occurrence counts,TF-IDF weighting, ......)
Bernoulli Naive Bayes
-Data follows a multivariate Bernoulli distribution
-Each feature is binary(word is present/absent)
It does not matter how many times that word was present
Case study: Sentiment analysis
Words that you might find in typical reviews
Classifier = Function on input data
Decision Boundaries
Classification function is represented by decision surfaces
Choosing a Decision Boundary
Data overfitting:Decision boundary learned over training data dosen’t generalize to test data.
Linear Boundaries
-Easy to find
-Easy to evaluate
-More generalizable: ‘Occam’s razor’
Finding a Linear Boundary
-Find the linear boundary = Find w or the slope of the line
Many methods
-Perceptron
-Linear Discriminative Analysis
-Linear least squares
Problem:If linearly separable,then infinite number of linear boundaries.
What is a reasonable boundary? Maximum margin
Support Vector Machine are maximum-margin classifiers
Support Vector Machine(SVM)
Uses optimization techniques to do it
SVMs are linear classifiers that find a hyperplane to separate two classes of data: positive and negative.
SVM:Multi-class classification
SVMs work only for binary classification problems
One vs Rest
n-class SVM has n classifiers
One vs One
N-class SVM has C(n,2) classifiers
SVM Parameters(I): Parameter C
Regularization: How much importance should you give individual data points as compared to better generalized model
Regularization parameter c
-Larger values of c = less regularization
-Fit training data as well as possible,every data point important
-Smaller values of c = more regularization
-More tolerant to errors on individual data points
SVMs Parameters(II): Other params
Linear kernels usually work best for text data
-Other kernels include rbf, polynomial
Multi_class: ovr(one-vs-rest)
Class_weight: Different classes can get different weights
-if you want a particular class,spam or not spam,know that the spams are usually like 80% of e-mails somebody gets,it’s just a skewed distribution where one of the classes 80% and the other classes 20% you would want to give different weight to these 2 classes.
Take Home Messaages
-Support Vector Machine tend to be the most accurate classifiers, especially in high-dimensional data.
-Strong theoretical foundation
-Handles only numeric features
-Convert categorical features to numeric features
-Normalization
-Hyperplane hard to interpret
Toolkits for Supervised Text Classification
-Scikit-learn
-NLTK
-Interfaces with sklearn and other ML tookits(like Weka)!
Using Sklearn’s NaiveBayesClassifier
from sklearn import naive_bayes
clfrNB = naive_bayes.MultinomialNB()
clfrNB,fit(train_data, train_labels)
Predicted_labels = clfrNB.predict(test_data)
metrics.f1_score(rest_labels, predicted_labels, average=’micro’)
micro averaging and macro averaging
Using Sklearn’s SVM classifier
from sklearn import svm
clfrSVM = svm.SVC(kernel = ‘linear’, C = 0.1) linear classifier always for text classification
C is the parameter for soft margin
clfrSVM.fit(train_data, train_labels)
Predicted_labels = clfrSVM.predict(test_data)
Model Selection
Recall the discussion on multiple phases in a supervised learning task
Model Selection in Scikit-learn
from sklearn import model_selection
model_selection.train_test_split(train_data, train_labels, test_size = 0.333, random_state = 0)
predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv = 5)
Supervised Text Classification in NLTK
NLTK has some classification algorithms
-NaiveBayesClassifier
-DecisionTreeClassifier
-ConditionalExponentialClassifier
-MaxentClassifier
-WekaClassifier
-SklearnClassifier
Using NLTK’s NaiveBayesClassifier
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
classifier.classify(unlabeled_instance)
classifier.classify_many(unlabeled_instances)
Nltk.classify.util.accuracy(classifier, test_set)
classifier.labels()
Classifier.show most informative features()
Using NLTK’s SklearnClassifier
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
clfrNB = SklearnClassifier(MultinomialNB()).train(train_set)
clfrSVM = SklearnClassifier(SVC(),kernel = ‘linear’).train(train_set)
Demonstration:Case study - Sentiment Analysis
import pandas as pd
import numpy as np
df = pd.read_csv(‘Amazon_Unlocked_Mobile.csv’)
df.head()
df.dropna(inplace = True)
df = df[df[‘Rating’] != 3]
df[‘Positively Rated’] = np.where(df[‘Rating’] > 3,1,0)
df.head(10)
df[‘Positively Rated’].mean()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[‘Reviews’],df[‘Positively Rated’],random_states=0)
#only counts how often each word occurs.
#CountVectorizer allows us to use the bag-of-words apprach by converting a collection of text #document into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(X_train)
vect.get_feature_names( )[: : 2000]
len(vect.get_feature_names())
X_train_vectorized = vect.transform(X_train)
X_train_vectorized #The entries in this matrix are the number of times each word appears in #each document
from sklearn.linear_model import LogisticRegression
model = LogisticRegression() #use LogisticRegression which works well for high demensional #sparse data.
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import roc_auc_score
predictions = model.predict(vect.transform(X_test))
print(‘AUC : ’, roc_auc_score(y_test, predictions)) #AUC score
#Note that any words in X_test that didn’t appear in X_train will just be ignored.
Tfidf Term frequency-inverse document frequency
from sklearn.feature_extraction.text import TfidfVectorizer
#allow us to weight terms based on how important they are to a document
#High weight is given to terms that appear often in a particular document but don’t appear often #in the corpus.
vect = TfidfVectorizer(min_df = 5).fit(X_train)
len(vect.get_feature_names())
#Features with high tf-idf are frequently used within specific documents,but rarely used across all documents.
‘’’
CountVectorizor and
tf–idf Vectorizor both take an argument, mindf, which allows us to specify
a minimum number of documents in which a token needs to appear
to become part of the vocabulary. This helps us remove some words
that might appear in only a few and are unlikely to be useful predictors. For example, here we'll pass in min_df = 5, which will remove any words from our vocabulary that appear in fewer than five documents.
‘’’
from sklearn.feature_extraction.text import TfidVectorizer
vect = TfidfVectorizer(min_df = 5).fit(X_train)
len(vect.get_feature_names())
X_train_vectorized = vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print(‘AUC: ’,roc_auc_score(y_test, predictions))
#see notes3_1
vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names( ))
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print(‘AUC: ’,roc_auc_score(y_test, predictions))
- [Text_Mining]notes_3
- [Text_Mining]notes_1
- [Text_Mining]notes_2
- [Text_Mining]notes_4
- 理解朴素贝叶斯
- LinearLayout之weight完美详解
- ftp文件上传和下载
- spring boot-application.properties配置文件属性
- Sublime Text 3 build 3143 LICENSE
- [Text_Mining]notes_3
- Android中"&" 符号,在静态的string.xml中如何表示
- 进程调度API之__wake_up_sync
- oracle的单行函数initcap()
- 27ti
- Windows下PHP运行环境的搭建
- nyoj07
- 高并发解决方案-mysql篇
- js call()方法