[Text_Mining]notes_3

来源:互联网 发布:软件项目度量标准 编辑:程序博客网 时间:2024/06/05 17:22

Classification

Given a set of classes

Classification:Assign the correct class label to the given input

Examples of Text Classification:

Topic identification

Spam Dectection

Sentiment analysis

Spelling correction

Supervised learning

Supervised Classificaiton

Iearn a classification model on properties(‘features’) and their importance(‘weight’)from labeled instances.

Apply the model on new instances to predict the label

 

Supervised classification:Phases and Datasets

Classification paradigms

Binary Classification : when there are only two possible classes.

Multi-class Classification : when there are more than two possible classes.

Multi-label Classification : when data instances can have two or more labels

.

Questions to ask in Supervised Learning

Training phase:

What are the features? How do you represent them?

What is the classification model/algorithm?

What are the model parameters?

Inference phase:

What is the expected performance? What is a good measure?

 

Why is textual data unique

Textual data presents a unique set of challenges

All the information you need is in the text

But features can be pulled out from text at different granularities(粒度)

 

Types of textual features

Words:

By far the most common class of features

Handling commonly-occurring words: Stop words

Normalization: Make lower case vs. Leave as-is

Stemming/Lemmatization

Characteristics of words: Capitalization

Parts of speech of words in a sentence

Grammatical structure, sentence parsing

Grouping words of similar meaning, semantics

Depending on classification tasks,features may come from inside words and word sequences.


Naive Bayes Classifiers

Case study: Classifying text search queries

 

Probabilistic model

Update the likelihood of the class given new information

Prior Probability: Pr(y = Entertainment), Pr(y = CS), Pr(y = Zoology)

When I have new information:

Posterior probability: Pr(y = Entertainment|x = ‘Python’)

Bayes’ Rule

Posterior probability = Prior probability * Likelihood / Evidence

Y* = argmax Pr(y|X) = argmax Pr(y) × Pr(X|y)

Naive assumption: Given the class label,features are assumed to be independent of each other.

Example:

Query: ‘Python download’

Y* = argmax Pr(y) × Pr(‘Python’|y)×Pr(‘download’|y)

 

Naive Bayes: What are the parameters?

Prior probabilities: Pr(y) for all y in Y

Likelihood: Pr(xi|y) for all features xi in labels y in Y

 

Q: You are training a naive Bayes classifier, where the number of possible labels,|Y|=3 and the dimention of the data element,|x| = 100,where every feature(dimention)is binary.How many parameters does the naive Bayes classification model have? 603

 

Naive Bayes: Learning parameters

Prior probabilities: Pr(y) for all y in Y

  -Remenber training data?

  -Count the number of instances in each class

  -If there are N instances in all, and n out of those are labeled as class y --->Pr(y) = n/N

Likelihood: Pr(xi|y) for all features xi and labels y in Y

  -Count how many times feature xi appears in instances labeled as class y

  -If there are p instances of class y, and xi appears in k of those,Pr(xi|y) = k/p

 

Naive Bayes: Smoothing

What happens if Pr(xi|y) = 0?

  -Features xi never occurs in documents labeled y

  -But then, the posterior probability Pr(y|xi) will be 0!

Instead, smooth the parameters

Laplace smoothing or Additive smoothing: Add a dummy count

  - Pr(xi|y) = (k+1)/(p+n); where n is number of features

 

Take Home Concept

Naive Bayes is a probabilistic model

Naive,because it assumes features are independent of each other,given the class label.(**)

For text classification problems, naive Bayes models typically provide very strong baselines.

Simple model, easy to learn parameters

 

Two Classic Naive Bayes Variants for Text

Multinomial Naive Bayes

   -Data follows a multinomial distribution

   -Each fearure value is a count(word occurrence counts,TF-IDF weighting, ......)

Bernoulli Naive Bayes

   -Data follows a multivariate Bernoulli distribution

   -Each feature is binary(word is present/absent)

It does not matter how many times that word was present

 

Case study: Sentiment analysis

Words that you might find in typical reviews

Classifier = Function on input data

 

Decision Boundaries

Classification function is represented by decision surfaces

 

Choosing a Decision Boundary

Data overfitting:Decision boundary learned over training data dosen’t generalize to test data.

 

Linear Boundaries

   -Easy to find

   -Easy to evaluate

   -More generalizable: ‘Occam’s razor’

 

Finding a Linear Boundary

   -Find the linear boundary = Find w or the slope of the line

Many methods

   -Perceptron

   -Linear Discriminative Analysis

   -Linear least squares

Problem:If linearly separable,then infinite number of linear boundaries.

What is a reasonable boundary? Maximum margin

Support Vector Machine are maximum-margin classifiers

 

Support Vector Machine(SVM)

Uses optimization techniques to do it

SVMs are linear classifiers that find a hyperplane to separate two classes of data: positive and negative.

 

 

 

SVM:Multi-class classification

SVMs work only for binary classification problems

One vs Rest

n-class SVM has n classifiers

One vs One

N-class SVM has C(n,2) classifiers

 

SVM Parameters(I): Parameter C

Regularization: How much importance should you give individual data points as compared to better generalized model

Regularization parameter c

  -Larger values of c = less regularization

   -Fit training data as well as possible,every data point important

  -Smaller values of c = more regularization

   -More tolerant to errors on individual data points

SVMs Parameters(II): Other params

Linear kernels usually work best for text data

  -Other kernels include rbf, polynomial

Multi_class: ovr(one-vs-rest)

Class_weight: Different classes can get different weights

  -if you want a particular class,spam or not spam,know that the spams are usually like 80% of e-mails somebody gets,it’s just a skewed distribution where one of the classes 80% and the other classes 20% you would want to give different weight to these 2 classes.

 

Take Home Messaages

-Support Vector Machine tend to be the most accurate classifiers, especially in high-dimensional data.

-Strong theoretical foundation

-Handles only numeric features

   -Convert categorical features to numeric features

   -Normalization

-Hyperplane hard to interpret

 

 

 

 

 

 

 

 

 

 

 

 

Toolkits for Supervised Text Classification

-Scikit-learn

-NLTK

  -Interfaces with sklearn and other ML tookits(like Weka)!

Using Sklearn’s NaiveBayesClassifier

from sklearn import naive_bayes

clfrNB = naive_bayes.MultinomialNB()

clfrNB,fit(train_data, train_labels)

Predicted_labels = clfrNB.predict(test_data)

metrics.f1_score(rest_labels, predicted_labels, average=’micro’)

micro averaging and macro averaging

 

Using Sklearn’s SVM classifier

from sklearn import svm

clfrSVM = svm.SVC(kernel = ‘linear’, C = 0.1)  linear classifier always for text classification

    C is the parameter for soft margin

clfrSVM.fit(train_data, train_labels)

Predicted_labels = clfrSVM.predict(test_data)

 

Model Selection

Recall the discussion on multiple phases in a supervised learning task

Model Selection in Scikit-learn

from sklearn import model_selection

model_selection.train_test_split(train_data, train_labels, test_size = 0.333,  random_state = 0)

predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv = 5)

 

Supervised Text Classification in NLTK

NLTK has some classification algorithms

  -NaiveBayesClassifier

  -DecisionTreeClassifier

  -ConditionalExponentialClassifier

  -MaxentClassifier

  -WekaClassifier

  -SklearnClassifier

 

Using NLTK’s NaiveBayesClassifier

from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)

classifier.classify(unlabeled_instance)

classifier.classify_many(unlabeled_instances)

Nltk.classify.util.accuracy(classifier, test_set)

classifier.labels()

Classifier.show most informative features()

 

Using NLTK’s SklearnClassifier

from nltk.classify import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB

from sklearn.svm import SVC

clfrNB = SklearnClassifier(MultinomialNB()).train(train_set)

clfrSVM = SklearnClassifier(SVC(),kernel = ‘linear’).train(train_set)

 

 

 

Demonstration:Case study - Sentiment Analysis

import pandas as pd

import numpy as np

df = pd.read_csv(‘Amazon_Unlocked_Mobile.csv’)

df.head()

df.dropna(inplace = True)

df = df[df[‘Rating’] != 3]

df[‘Positively Rated’] = np.where(df[‘Rating’] > 3,1,0)

df.head(10)

df[‘Positively Rated’].mean()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[‘Reviews’],df[‘Positively Rated’],random_states=0)

#only counts how often each word occurs.

#CountVectorizer allows us to use the bag-of-words apprach by converting a collection of text #document into a matrix of token counts.

 

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)

vect.get_feature_names( )[: : 2000]

len(vect.get_feature_names())

X_train_vectorized = vect.transform(X_train)

X_train_vectorized   #The entries in this matrix are the number of times each word appears in #each document

from sklearn.linear_model import LogisticRegression

model = LogisticRegression() #use LogisticRegression which works well for high demensional #sparse data.

model.fit(X_train_vectorized, y_train)

 

from sklearn.metrics import roc_auc_score

predictions = model.predict(vect.transform(X_test))

print(‘AUC :  ’,  roc_auc_score(y_test, predictions))  #AUC score

#Note that any words in X_test that didn’t appear in X_train will just be ignored.

 

 

Tfidf  Term frequency-inverse document frequency

from sklearn.feature_extraction.text import TfidfVectorizer

#allow us to weight terms based on how important they are to a document

#High weight is given to terms that appear often in a particular document but don’t appear often #in the corpus.

vect = TfidfVectorizer(min_df = 5).fit(X_train)

len(vect.get_feature_names())

#Features with high tf-idf are frequently used within specific documents,but rarely used across all documents.

 

‘’’

CountVectorizor and

tfidf Vectorizor both take an argument, mindf, which allows us to specify

a minimum number of documents in which a token needs to appear

to become part of the vocabulary. This helps us remove some words

that might appear in only a few and are unlikely to be useful predictors. For example, here we'll pass in min_df = 5, which will remove any words from our vocabulary that appear in fewer than five documents.

‘’’

from sklearn.feature_extraction.text import TfidVectorizer

vect = TfidfVectorizer(min_df = 5).fit(X_train)

len(vect.get_feature_names())

X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()

model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print(‘AUC: ’,roc_auc_score(y_test, predictions))

 

#see notes3_1

vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names( ))

 

model = LogisticRegression()

model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print(‘AUC:  ’,roc_auc_score(y_test, predictions))