[Text_Mining]notes_3

来源：互联网发布：软件项目度量标准编辑：程序博客网时间：2024/06/05 17:22

Classification

Given a set of classes

Classification:Assign the correct class label to the given input

Examples of Text Classification:

Topic identification

Spam Dectection

Sentiment analysis

Spelling correction

Supervised learning

Supervised Classificaiton

Iearn a classification model on properties(‘features’) and their importance(‘weight’)from labeled instances.

Apply the model on new instances to predict the label

Supervised classification:Phases and Datasets

Classification paradigms

Binary Classification : when there are only two possible classes.

Multi-class Classification : when there are more than two possible classes.

Multi-label Classification : when data instances can have two or more labels

Questions to ask in Supervised Learning

Training phase:

What are the features? How do you represent them?

What is the classification model/algorithm?

What are the model parameters?

Inference phase:

What is the expected performance? What is a good measure?

Why is textual data unique

Textual data presents a unique set of challenges

All the information you need is in the text

But features can be pulled out from text at different granularities(粒度)

Types of textual features

Words:

By far the most common class of features

Handling commonly-occurring words: Stop words

Normalization: Make lower case vs. Leave as-is

Stemming/Lemmatization

Characteristics of words: Capitalization

Parts of speech of words in a sentence

Grammatical structure, sentence parsing

Grouping words of similar meaning, semantics

Depending on classification tasks,features may come from inside words and word sequences.

Naive Bayes Classifiers

Case study: Classifying text search queries

Probabilistic model

Update the likelihood of the class given new information

Prior Probability: Pr(y = Entertainment), Pr(y = CS), Pr(y = Zoology)

When I have new information:

Posterior probability: Pr(y = Entertainment|x = ‘Python’)

Bayes’ Rule：

Posterior probability = Prior probability * Likelihood / Evidence

Y* = argmax Pr(y|X) = argmax Pr(y) × Pr(X|y)

Naive assumption: Given the class label,features are assumed to be independent of each other.

Example:

Query: ‘Python download’

Y* = argmax Pr(y) × Pr(‘Python’|y)×Pr(‘download’|y)

Naive Bayes: What are the parameters?

Prior probabilities: Pr(y) for all y in Y

Likelihood: Pr(xi|y) for all features xi in labels y in Y

Q: You are training a naive Bayes classifier, where the number of possible labels,|Y|=3 and the dimention of the data element,|x| = 100,where every feature(dimention)is binary.How many parameters does the naive Bayes classification model have? 603

Naive Bayes: Learning parameters

Prior probabilities: Pr(y) for all y in Y

-Remenber training data?

-Count the number of instances in each class

-If there are N instances in all, and n out of those are labeled as class y --->Pr(y) = n/N

Likelihood: Pr(xi|y) for all features xi and labels y in Y

-Count how many times feature xi appears in instances labeled as class y

-If there are p instances of class y, and xi appears in k of those,Pr(xi|y) = k/p

Naive Bayes: Smoothing

What happens if Pr(xi|y) = 0?

-Features xi never occurs in documents labeled y

-But then, the posterior probability Pr(y|xi) will be 0!

Instead, smooth the parameters

Laplace smoothing or Additive smoothing: Add a dummy count

- Pr(xi|y) = (k+1)/(p+n); where n is number of features

Take Home Concept

Naive Bayes is a probabilistic model

Naive,because it assumes features are independent of each other,given the class label.(**)

For text classification problems, naive Bayes models typically provide very strong baselines.

Simple model, easy to learn parameters

Two Classic Naive Bayes Variants for Text

Multinomial Naive Bayes

-Data follows a multinomial distribution

-Each fearure value is a count(word occurrence counts,TF-IDF weighting, ......)

Bernoulli Naive Bayes

-Data follows a multivariate Bernoulli distribution

-Each feature is binary(word is present/absent)

It does not matter how many times that word was present

Case study: Sentiment analysis

Words that you might find in typical reviews

Classifier = Function on input data

Decision Boundaries

Classification function is represented by decision surfaces

Choosing a Decision Boundary

Data overfitting:Decision boundary learned over training data dosen’t generalize to test data.

Linear Boundaries

-Easy to find

-Easy to evaluate

-More generalizable: ‘Occam’s razor’

Finding a Linear Boundary

-Find the linear boundary = Find w or the slope of the line

Many methods

-Perceptron

-Linear Discriminative Analysis

-Linear least squares

Problem:If linearly separable,then infinite number of linear boundaries.

What is a reasonable boundary? Maximum margin

Support Vector Machine are maximum-margin classifiers

Support Vector Machine(SVM)

Uses optimization techniques to do it

SVMs are linear classifiers that find a hyperplane to separate two classes of data: positive and negative.

SVM:Multi-class classification

SVMs work only for binary classification problems

One vs Rest

n-class SVM has n classifiers

One vs One

N-class SVM has C(n,2) classifiers

SVM Parameters(I): Parameter C

Regularization: How much importance should you give individual data points as compared to better generalized model

Regularization parameter c

-Larger values of c = less regularization

-Fit training data as well as possible,every data point important

-Smaller values of c = more regularization

-More tolerant to errors on individual data points

SVMs Parameters(II): Other params

Linear kernels usually work best for text data

-Other kernels include rbf, polynomial

Multi_class: ovr(one-vs-rest)

Class_weight: Different classes can get different weights

-if you want a particular class,spam or not spam,know that the spams are usually like 80% of e-mails somebody gets,it’s just a skewed distribution where one of the classes 80% and the other classes 20% you would want to give different weight to these 2 classes.

Take Home Messaages

-Support Vector Machine tend to be the most accurate classifiers, especially in high-dimensional data.

-Strong theoretical foundation

-Handles only numeric features

-Convert categorical features to numeric features

-Normalization

-Hyperplane hard to interpret

Toolkits for Supervised Text Classification

-Scikit-learn

-NLTK

-Interfaces with sklearn and other ML tookits(like Weka)!

Using Sklearn’s NaiveBayesClassifier

from sklearn import naive_bayes

clfrNB = naive_bayes.MultinomialNB()

clfrNB,fit(train_data, train_labels)

Predicted_labels = clfrNB.predict(test_data)

metrics.f1_score(rest_labels, predicted_labels, average=’micro’)

micro averaging and macro averaging

Using Sklearn’s SVM classifier

from sklearn import svm

clfrSVM = svm.SVC(kernel = ‘linear’, C = 0.1) linear classifier always for text classification

C is the parameter for soft margin

clfrSVM.fit(train_data, train_labels)

Predicted_labels = clfrSVM.predict(test_data)

Model Selection

Recall the discussion on multiple phases in a supervised learning task

Model Selection in Scikit-learn

from sklearn import model_selection

model_selection.train_test_split(train_data, train_labels, test_size = 0.333, random_state = 0)

predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv = 5)

Supervised Text Classification in NLTK

NLTK has some classification algorithms

-NaiveBayesClassifier

-DecisionTreeClassifier

-ConditionalExponentialClassifier

-MaxentClassifier

-WekaClassifier

-SklearnClassifier

Using NLTK’s NaiveBayesClassifier

from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)

classifier.classify(unlabeled_instance)

classifier.classify_many(unlabeled_instances)

Nltk.classify.util.accuracy(classifier, test_set)

classifier.labels()

Classifier.show most informative features()

Using NLTK’s SklearnClassifier

from nltk.classify import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB

from sklearn.svm import SVC

clfrNB = SklearnClassifier(MultinomialNB()).train(train_set)

clfrSVM = SklearnClassifier(SVC(),kernel = ‘linear’).train(train_set)

Demonstration:Case study - Sentiment Analysis

import pandas as pd

import numpy as np

df = pd.read_csv(‘Amazon_Unlocked_Mobile.csv’)

df.head()

df.dropna(inplace = True)

df = df[df[‘Rating’] != 3]

df[‘Positively Rated’] = np.where(df[‘Rating’] > 3,1,0)

df.head(10)

df[‘Positively Rated’].mean()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[‘Reviews’],df[‘Positively Rated’],random_states=0)

#only counts how often each word occurs.

#CountVectorizer allows us to use the bag-of-words apprach by converting a collection of text #document into a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)

vect.get_feature_names( )[: : 2000]

len(vect.get_feature_names())

X_train_vectorized = vect.transform(X_train)

X_train_vectorized #The entries in this matrix are the number of times each word appears in #each document

from sklearn.linear_model import LogisticRegression

model = LogisticRegression() #use LogisticRegression which works well for high demensional #sparse data.

model.fit(X_train_vectorized, y_train)

from sklearn.metrics import roc_auc_score

predictions = model.predict(vect.transform(X_test))

print(‘AUC : ’, roc_auc_score(y_test, predictions)) #AUC score

#Note that any words in X_test that didn’t appear in X_train will just be ignored.

Tfidf Term frequency-inverse document frequency

from sklearn.feature_extraction.text import TfidfVectorizer

#allow us to weight terms based on how important they are to a document

#High weight is given to terms that appear often in a particular document but don’t appear often #in the corpus.

vect = TfidfVectorizer(min_df = 5).fit(X_train)

len(vect.get_feature_names())

#Features with high tf-idf are frequently used within specific documents,but rarely used across all documents.

‘’’

CountVectorizor and

tf–idf Vectorizor both take an argument, mindf, which allows us to specify

a minimum number of documents in which a token needs to appear

to become part of the vocabulary. This helps us remove some words

that might appear in only a few and are unlikely to be useful predictors. For example, here we'll pass in min_df = 5, which will remove any words from our vocabulary that appear in fewer than five documents.

‘’’

from sklearn.feature_extraction.text import TfidVectorizer

vect = TfidfVectorizer(min_df = 5).fit(X_train)

len(vect.get_feature_names())

X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()

model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print(‘AUC: ’,roc_auc_score(y_test, predictions))

#see notes3_1

vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names( ))

model = LogisticRegression()

model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print(‘AUC: ’,roc_auc_score(y_test, predictions))

阅读全文

0 0