Naive Bayes algorithm for spam classification （Matlab实现）

来源：互联网发布：淘宝保证金在哪里退编辑：程序博客网时间：2024/06/06 19:38

Materials, data, and algorithms comes from Stanford Andrew Ng Machine Learning courseProblem set 2 (Q3).

1. Preprocessing

(1)datatset只保留邮件的subject和正文

(2)所有单词转换成小写

(3)email address 替换成word EMAILADDR，类似的web address (HTTPADDR)，currency (DOLLAR), numbers (NUMBER).

(4)set vocabulary. 使用standard stemming algorithm来 stemming, 然后consider only the medium frequency tokens into vocabulary (出现次数高的和低的都不要).

(5)build document-word matrices. the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i, j)-entry of this matrix represents thenumber of occurrences of the jth token in the ith document.

下面就可以用matlab实现了 (注：下面程序采用的是另一篇博文Naive Bayes Classifier中的第二种方法)

nb_train.m

clear clc[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN');trainMatrix = full(spmatrix);%行是document，列是tokens，里面的数值是tokens在document中出现的次数numTrainDocs = size(trainMatrix, 1);numTokens = size(trainMatrix, 2);% trainMatrix is now a (numTrainDocs x numTokens) matrix.% Each row represents a unique document (email).% The j-th column of the row $i$ represents the number of times the j-th% token appeared in email $i$. % tokenlist is a long string containing the list of all tokens (words).% These tokens are easily known by position in the file TOKENS_LIST% trainCategory is a (1 x numTrainDocs) vector containing the true % classifications for the documents just read in. The i-th entry gives the % correct class for the i-th email (which corresponds to the i-th row in % the document word matrix).% Spam documents are indicated as class 1, and non-spam as class 0.% Note that for the SVM, you would want to convert these to +1 and -1.%-----------------------------V = size(trainMatrix, 2); % tokens总数neg = trainMatrix(find(trainCategory == 0), :); % non-spam样本pos = trainMatrix(find(trainCategory == 1), :); % spam样本neg_words = sum(sum(neg));%negtive document中出现tokens中词的总数 ，而不是教程中的negtive document中的词汇总数pos_words = sum(sum(pos));neg_log_prior = log(size(neg,1) / numTrainDocs); %先验概率= non-spam样本个数/样本总数pos_log_prior = log(size(pos,1) / numTrainDocs);  %先验概率= spam样本个数/样本总数for k=1:V,neg_log_phi(k) = log((sum(neg(:,k)) + 1) / (neg_words + V));%因为第k列是相应的token在所有document中出现的次数，所以直接按列求和%从分子来看，就是求第k个token在所有negtive documents中出现的总次数%从分母neg_words + V 来看，negtive documents中总的词数，并不是统计所有词汇，而是只统计字典中词出现的总次数。pos_log_phi(k) = log((sum(pos(:,k)) + 1) / (pos_words + V));end%----------------------------%下面try to get an informal sense of how indicative token $i$ is for the SPAM%classcompare_log=log(exp(pos_log_phi)./exp(neg_log_phi));[i,j]=sort(compare_log);%i是从小到大排的结果，j是sort后每个值在原来序列中的位置j(:,length(j)-4:length(j));%取j的后5位数，就是compare_log最大值所在的位置。在token list中找到排在前5的，说明这5个词对分类影响最大。

nb_test.m (test紧跟着train后执行)

[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');testMatrix = full(spmatrix);numTestDocs = size(testMatrix, 1);numTokens = size(testMatrix, 2);% Assume nb_train.m has just been executed, and all the parameters computed/needed% by your classifier are in memory through that execution. You can also assume % that the columns in the test set are arranged in exactly the same way as for the% training set (i.e., the j-th column represents the same token in the test data % matrix as in the original training data matrix).% Write code below to classify each document in the test set (ie, each row% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry % of this vector is the predicted class (1/0) for the i-th  email (i-th row % in testMatrix) in the test set.output = zeros(numTestDocs, 1);%---------------for k=1:numTestDocs,[i,j,v] = find(testMatrix(k,:));%找出其中的非零值，(i,j)是位置，v是相应位置的数值%由于p(y=1|x)和p(y=0|x)计算式的分母是一样的，所以只需要比较分子的大小neg_posterior = sum(v .* neg_log_phi(j)) + neg_log_prior;%因为在train的时候求概率都加了log处理，所以这里就直接求和pos_posterior = sum(v .* pos_log_phi(j)) + pos_log_prior;if (neg_posterior > pos_posterior)output(k) = 0;elseoutput(k) = 1;endend%---------------% Compute the error on the test seterror=0;for i=1:numTestDocs  if (category(i) ~= output(i))    error=error+1;  endend%Print out the classification error on the test seterror/numTestDocs

0 0