Detection Spam Emails

来源：互联网发布：淘宝主页设计编辑：程序博客网时间：2024/05/24 06:36

There are many emails downloaded from Internet.

The emails are labeled with spam or ham.

This article will demonstrate the theory of Naive Bayes Filter.

Email Data

There are n labeled emails with corresponding labels Li.

Li could be Spam or ham.

We also have a dictionary of J words.

yij means whether word j in ith email.

if it is in ith email, yij should be 1.

Otherwise , yij should be 0.

Naive Bayes Model

set Pr(L=spam)=s and Pr(L=ham)=1−s
for j=1,...,J
if L=ham, set Pr(Yj=1)=pj,Pr(Yj=0)=1−pj ,
if L=spam, set Pr(Yj=1)=qj,Pr(Yj=0)=1−qj .

We shall assume that our training data (li,yi) for each i are i.i.d according to the above description.
We have known that given L, all Yjs are independent.

Parameter Learning

We need to estimate the parameters

θ = {s, p 1, . . ., p j, q 1, . . ., q j}

To learn parameters, we find

θ that maximizes the likelihood.

θ^= a r g m a x θ \prod i = 1 n P L, Y 1, . . ., Y J (l i, y 1, . . ., y J; θ)

P (l 1, y 11, . . ., y 1 J, . . ., l n, y n 1, . . ., y n J; θ)

Because of

P(l1,y11,...,y1J,...,ln,yn1,...,ynJ;θ)i,i,d.
then

\prod i = 1 n P L, Y 1, . . ., Y J (l i, y 1, . . ., y J; θ) = \prod i = 1 n [P L (l i; θ) \prod j = 1 J P Y j | L (y i j | l i; θ)]

Given θ
then

P L (l; θ) = s 1 {l = s p a m} (1 - s) 1 {l a b e l = h a m}

if L = ham

P Y j | L (y j | l; θ) = p y j j (1 - p j) (1 - y j)

else if L = spam

P Y j | L (y j | l; θ) = q y j j (1 - q j) (1 - y j)

use log function in

l o g (\prod i = 1 n [P L (l i; θ) \prod j = 1 J P Y j | L (y i j | l i; θ)])

= \sum i = 1 n l o g P L (l i; θ) + \sum i = 1 n \sum j = 1 J l o g P Y j | L (y i j | l i; θ)

In the above equation, we can degrade the first one by:

\sum i = 1 n l o g P L (l i; θ) = (\sum i = 1 n 1 {L i = s p a m}) l o g (s) + (\sum i = 1 n 1 {L i = h a m}) l o g (1 - s)

= A l o g (s) + B l o g (1 - s)

So, we assume

A=∑ni=11{Li=spam} and

B=∑ni=11{Li=ham} for simplification.

Then, we simplify the second factor:

\sum i = 1 n \sum j = 1 J l o g P Y j | L (y (i) j | l (i); θ)

= \sum i = 1 n 1 {L (i) = h a m} \sum j = 1 J l o g (p y (i) j j (1 - p j) 1 - y (i) j) + \sum i = 1 n 1 {L (i) = s p a m} \sum j = 1 J l o g (q y (i) j j (1 - q j) 1 - y (i) j)

using

(i) means it is the

ith sample, not the exponential index.

= \sum i = 1 n 1 {L (i) = h a m} \sum j = 1 J [y (i) j l o g (p j) + (1 - y (i) j) l o g (1 - p j)] + \sum i = 1 n 1 {L (i) = s p a m} \sum j = 1 J [y (i) j l o g (q j) + (1 - y (i) j) l o g (1 - q j)]

= \sum i = 1 n \sum j = 1 J 1 {L (i) = h a m} [y (i) j l o g (p j) + (1 - y (i) j) l o g (1 - p j)] + \sum i = 1 n \sum j = 1 J 1 {L (i) = s p a m} [y (i) j l o g (q j) + (1 - y (i) j) l o g (1 - q j)]

= \sum j = 1 J [(\sum i = 1 n 1 {L (i) = h a m} y (i) j) l o g (p j) + (\sum i = 1 n 1 {L (i) = h a m} (1 - y (i) j)) l o g (1 - p j) + (\sum i = 1 n 1 {L (i) = s p a m} y (i) j) l o g (q j) + (\sum i = 1 n 1 {L (i) = s p a m} (1 - y (i) j)) l o g (1 - q j)]

For simplification, we assume that:

\sum i = 1 n 1 {L (i) = h a m} y (i) j = A j

\sum i = 1 n 1 {L (i) = h a m} (1 - y (i) j) = B j

\sum i = 1 n 1 {L (i) = s p a m} y (i) j = C j

\sum i = 1 n 1 {L (i) = s p a m} (1 - y (i) j) = D j

Then, we have :

A j + B j = \sum i = 1 n 1 {L (i) = h a m}

C j + D j = \sum i = 1 n 1 {L (i) = s p a m}

Find Optimal Value

there is a function

f (x) = M l n (x) + N l n (1 - x)

, then its first derivative is

\partial f \partial x = M x - N 1 - x = M - M x - N x x ( 1 - x )

To get the optimal of f, we make

\partial f \partial x = 0

then

x^= M M + N

so, we can apply the above theory to get parameter θ.

s^= A A + B

p j^= A j A j + B j

q j^= C j C j + D j

Prediction

Given a sample with Y1,Y2,...,YJ words, the probability of that it is a spam is based on:

P L | Y 1, Y 2, . . ., Y J (l | y 1, y 2, . . ., y J) = P L ( L ) \cdot P Y 1 , Y 2 , . . . , Y j | L ( y 1 , y 2 , . . . , y J | l ) P Y 1 , Y 2 , . . . , Y J ( y 1 , y 2 , . . . , y J )

P L | Y 1, Y 2, . . ., Y J (l = s p a m | y 1, y 2, . . ., y J) = s \cdot \prod J j = 1 q y j j ( 1 - q j ) 1 - y j K

P L | Y 1, Y 2, . . ., Y J (l = h a m | y 1, y 2, . . ., y J) = ( 1 - s ) \cdot \prod J j = 1 p y j j ( 1 - p j ) 1 - y j K

where K=s⋅∏Jj=1qyjj(1−qj)1−yj+(1−s)⋅∏Jj=1pyjj(1−pj)1−yj.

So, we can determine whether an email is a spam or a ham by:

Z = s 1 - s \cdot \prod J j = 1 q y j j ( 1 - q j ) 1 - y j \prod J j = 1 p y j j ( 1 - p j ) 1 - y j

Z≥1, then it is a spam.
However, in numerical calculation, we should use

l n (z) = l n (s) - l n (1 - s) + \sum j = 1 J [y j (l n (q j) - l n (p j)) + (1 - y j) (l n (1 - q j) - l n (1 - p j))]

Z≥0, then it is a spam.Otherwise , it is a ham.

Laplace Smoothing

Have we finished? There is a corner case, where there is a word that does not appear in any training samples. To handle this problem, we use Laplace Smoothing Coefficient lap.
Then the parameters are:

s^= A + l a p A + B + 2 l a p

p j^= A j + l a p A j + B j + 2 l a p

q j^= C j + l a p C j + D j + 2 l a p

File Arrangement

There are two .py files and one data directory in current work-space.
path
In data directory, there are 3 sub directories ham,spam,
tesing.
data
In ham, the text files are organized like this:
ham
In spam, the text files are organized like this:
spam
In testing, the text files are organized like this:
testing
Each email is like that:
email .
All words in email are separated by a space, even a punctuation.

code

naivebayes.py

import sysimport os.pathimport collectionsimport mathimport utilimport numpy as npUSAGE = "%s <test data folder> <spam folder> <ham folder>"def get_counts(file_list):    """    Computes counts for each word that occurs in the files in file_list.    Inputs    ------    file_list : a list of filenames, suitable for use with open() or                util.get_words_in_file()    Output    ------    A dict whose keys are words, and whose values are the number of files the    key occurred in.    """    words = []    for filename in file_list:        words.extend(list(set(util.get_words_in_file(filename))))    counter = collections.Counter(words)    return counterdef get_log_probabilities(file_list):    """    Computes log-frequencies for each word that occurs in the files in    file_list.    Input    -----    file_list : a list of filenames, suitable for use with open() or util.get_words_in_file().    Output    ------    A dict whose keys are words, and whose values are the log of the smoothed    estimate of the fraction of files the key occurred in.    Hint    ----    The data structure util.DefaultDict will be useful to you here, as will the    get_counts() helper above.    """    counter = get_counts(file_list)    num_files = len(file_list)    for key in counter:        counter[key] = math.log((counter[key]+1) / (num_files+2))    return counterdef learn_distributions(file_lists_by_category):    """    Input    -----    A two-element list. The first element is a list of spam files,    and the second element is a list of ham (non-spam) files.    Output    ------    (log_probabilities_by_category, log_prior)    log_probabilities_by_category : A list whose first element is a smoothed                                    estimate for log P(y=w_j|c=spam) (as a dict,                                    just as in get_log_probabilities above), and                                    whose second element is the same for c=ham.    log_prior_by_category : A list of estimates for the log-probabilities for                            each class:                            [est. for log P(c=spam), est. for log P(c=ham)]    """    spam_file_list, ham_file_list = file_lists_by_category    spam_counter = get_log_probabilities(spam_file_list)    length_spam = len(spam_file_list)    length_ham = len(ham_file_list)    ham_counter = get_log_probabilities(ham_file_list)    all_set = spam_counter.keys() | ham_counter.keys()    for word in all_set:        if word not in spam_counter:            spam_counter[word] = math.log(1.0/(length_spam+2)) # smooth        if word not in ham_counter:            ham_counter[word] = math.log(1.0/(length_ham+2)) # smooth    n_total = length_spam + length_ham    return ([spam_counter, ham_counter],            [math.log(length_spam*1.0/n_total), math.log(length_ham*1.0/n_total)])def classify_email(email_filename,                   log_probabilities_by_category,                   log_prior_by_category):    """    Uses Naive Bayes classification to classify the email in the given file.    Inputs    ------    email_filename : name of the file containing the email to be classified    log_probabilities_by_category : See output of learn_distributions    log_prior_by_category : See output of learn_distributions    Output    ------    One of the labels in names.    """    words = set(util.get_words_in_file(email_filename))    # prob_log_spam, prob_log_ham = log_prior_by_category    spam_counter, ham_counter = log_probabilities_by_category    prob_log_spam = -9.0    prob_log_ham = math.log(1-math.exp(prob_log_spam))    spam_log_sum = prob_log_spam    ham_log_sum = prob_log_ham    print("log(s) = {0}, log(1-s) = {1}".format(prob_log_spam, prob_log_ham))    for word in spam_counter:        if word in words:            spam_log_sum += spam_counter[word]        else:            spam_log_sum += math.log(1 - math.exp(spam_counter[word]))    for word in ham_counter:        if word in words:            ham_log_sum += ham_counter[word]        else:            ham_log_sum += math.log(1 - math.exp(ham_counter[word]))    if spam_log_sum >= ham_log_sum:        return 'spam'    else:        return 'ham'def classify_emails(spam_files, ham_files, test_files):    '''    compute the label of each email in test_files.    return value: List,such as ['spam','ham', 'spam', 'ham',....]    '''    log_probabilities_by_category, log_prior = \        learn_distributions([spam_files, ham_files])    estimated_labels = []    for test_file in test_files:        estimated_label = \            classify_email(test_file, log_probabilities_by_category, log_prior)        estimated_labels.append(estimated_label)    return estimated_labelsdef main():    '''    usage:    $python naivebayes.py data/testing/ data/spam/ data/ham/    '''    ### Read arguments    if len(sys.argv) != 4:        print(USAGE%sys.argv[0])    testing_folder = sys.argv[1]    (spam_folder, ham_folder) = sys.argv[2:4]    ### Learn the distributions    file_lists = []    for folder in (spam_folder, ham_folder):        file_lists.append(util.get_files_in_folder(folder))    (log_probabilities_by_category, log_priors_by_category) = \            learn_distributions(file_lists)    # Here, columns and rows are indexed by 0 = 'spam' and 1 = 'ham'    # rows correspond to true label, columns correspond to guessed label    performance_measures = np.zeros([2, 2])    ### Classify and measure performance    for filename in util.get_files_in_folder(testing_folder):        ## Classify        label = classify_email(filename,                               log_probabilities_by_category,                               log_priors_by_category)        ## Measure performance        # Use the filename to determine the true label        base = os.path.basename(filename)        true_index = ('ham' in base)        guessed_index = (label == 'ham')        performance_measures[true_index, guessed_index] += 1        # Uncomment this line to see which files your classifier        # gets right/wrong:        # print("%s : %s" %(label, filename))    template = "You correctly classified %d out of %d spam emails, and %d out of %d ham emails."    # Correct counts are on the diagonal    correct = np.diag(performance_measures)    # totals are obtained by summing across guessed labels    totals = np.sum(performance_measures, 1)    print(template % (correct[0],                      totals[0],                      correct[1],                      totals[1]))if __name__ == '__main__':    main()

util.py

import osdef get_words_in_file(filename):    """ Returns a list of all words in the file at filename. """    with open(filename, 'r', encoding='utf-8', errors='ignore') as f:        # read() reads in a string from a file pointer, and split() splits a        # string into words based on whitespace        words = f.read().split()    return wordsdef get_files_in_folder(folder):    """ Returns a list of files in folder (including the path to the file) """    filenames = os.listdir(folder)    # os.path.join combines paths while dealing with /s and \s appropriately    full_filenames = [os.path.join(folder, filename) for filename in filenames]    return full_filenames

0 0