垃圾邮件二分类 NaiveBayes v.s SVM (matlab)

来源：互联网发布：电子数据交换技术edi 编辑：程序博客网时间：2024/06/05 08:02

Preprocess
- ReadFile
- ProcessEmail
NaiveBayes
- Train
- Classify
- Example
SVM
- Train
- Classify
- Example
Summary

垃圾邮件的二分类问题，比较朴素贝叶斯和SVM的用法。
给定一封邮件，由分类器给出这封邮件是（1）否（0）为垃圾邮件（spam）。

Preprocess

对邮件的预处理。

ReadFile

首先读入邮件，返回其内容。

function file_contents = readFile(filename)% Load File    fid = fopen(filename);    if fid        file_contents = fscanf(fid, '%c', inf);        fclose(fid);    else        file_contents = '';        fprintf('Unable to open %s\n', filename);    endend

ProcessEmail

对邮件进行预处理，有以下几种处理：

将整封邮件单词转换为小写。
去掉html格式。
将数字替换为 ‘number’。
将URL替换为 ‘httpaddr’。
将邮件地址替换为 ‘emailaddr’。
将表示money的符号替换为 ‘dollor’等。
将单词时态进行还原。e.g，”discount, discounts, discounted” -> “discount”；”include, including, includes” -> “includ”。

这些处理由正则表达式来实现。

处理完之后，将邮件映射到一个词表中，这个词表数据集由垃圾邮件中常出现的高频率词汇组成。我使用的词表包含了1899个词汇。

getVocabList函数

function vocabList = getVocabList()%% Read the fixed vocabulary list    fid = fopen('vocab.txt');% Store all dictionary words in cell array vocab{}    n = 1899;  % Total number of words in the dictionary    vocabList = cell(n, 1);    for i = 1:n        % Word Index (can ignore since it will be = i)        fscanf(fid, '%d', 1);        % Actual Word        vocabList{i} = fscanf(fid, '%s', 1);    end    fclose(fid);end

processEmail函数

function word_indices = processEmail(email_contents)% Load Vocabulary    vocabList = getVocabList();% Init return value    word_indices = [];% Lower case    email_contents = lower(email_contents);% Strip all HTML% Looks for any expression that starts with < and ends with > and replace% and does not have any < or > in the tag it with a space    email_contents = regexprep(email_contents, '<[^<>]+>', ' ');% Handle Numbers% Look for one or more characters between 0-9    email_contents = regexprep(email_contents, '[0-9]+', 'number');% Handle URLS% Look for strings starting with http:// or https://    email_contents = regexprep(email_contents, ...                           '(http|https)://[^\s]*', 'httpaddr');% Handle Email Addresses% Look for strings with @ in the middle    email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');% Handle $ sign    email_contents = regexprep(email_contents, '[$]+', 'dollar');    while ~isempty(email_contents)        % Tokenize and also get rid of any punctuation        [str, email_contents] = ...           strtok(email_contents, ...                  [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);        % Remove any non alphanumeric characters        str = regexprep(str, '[^a-zA-Z0-9]', '');        % Stem the word         % (the porterStemmer sometimes has issues, so we use a try catch block)        try str = porterStemmer(strtrim(str));         catch str = ''; continue;        end;        % Skip the word if it is too short        if length(str) < 1           continue;        end        for i = 1 : length(vocabList)            if (strcmp(str, vocabList(i)) == 1)                word_indices = [word_indices i];            end        end    endend

emailFeatures函数

function x = emailFeatures(word_indices)    n = 1899;    x = zeros(n, 1);    x(word_indices) = 1;end

NaiveBayes

Train

由贝叶斯定理可知
p(ci|w⃗ )=p(w⃗ |ci)p(ci)p(w⃗ )
其中p(ci|w⃗ )表示在给定特定向量w⃗ 的情况下，这个向量属于类别ci的概率，最终选择类别时，通常选择这个概率最大的类别。由于p(w⃗ )对所有类为常数，不影响最终结果的大小比较，所以可以不计算。

p(ci)表示类别ci出现的概率。

在训练时，为了防止除0的情况出现，使用laplace平滑处理；为了防止多个小数相乘发生下溢，将概率计算全部取对数（ln），不改变单调性的同时使得乘法操作变为加法操作。

function [p0Vec, p1Vec, pSpam] = trainNaiveBayes(X, y)% 返回值为2个向量和1个概率% p0Vec表示[p(w_0|c_0) p(w_1|c0)...]% p1Vec表示[p(w_0|c_1) p(w_1|c1)...]% pSpam表示分类器中，垃圾邮件存在的概率    numTrainDocs = size(X, 1);    numWords = size(X, 2);% p{是垃圾邮件} = 垃圾邮件数/训练集个数    pSpam = 1.0 * sum(y) / numTrainDocs;    p0Num = ones(1, numWords);    p1Num = ones(1, numWords);    p0Denom = 2.0;    p1Denom = 2.0;    for i = 1 : numTrainDocs        vec_temp = X(i, :);        if y(i) == 1            p1Num = p1Num + vec_temp;            p1Denom = p1Denom + sum(vec_temp);        else            p0Num = p0Num + vec_temp;            p0Denom = p0Denom + sum(vec_temp);        end    end    p0Vec = log(p0Num / p0Denom);    p1Vec = log(p1Num / p1Denom);end

Classify

将邮件转换为词表的01向量X，然后将相应的概率求和（已经取了对数）。

function label = classifyNaiveBayes(X, p0Vec, p1Vec, pSpam)    m = size(X, 1);    for i = 1 : m        p1(i, 1) = p1Vec * X(i, :)' + log(pSpam);        p0(i, 1) = p0Vec * X(i, :)' + log(1.0 - pSpam);    end    label = (p1 > p0);end

Example

读入训练集，训练。

%% Trainclear;clc;load('spamTrain.mat');fprintf('\nTraining NaiveBayes (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')[p0Vec, p1Vec, pSpam] = trainNaiveBayes(X, y);t = classifyNaiveBayes(X, p0Vec, p1Vec, pSpam);fprintf('Training Accuracy: %f\n', mean(double(t == y)) * 100);

读入测试集，测试。

%% Testload('spamTest.mat');fprintf('\nEvaluating the trained NaiveBayes on a test set ...\n')t = classifyNaiveBayes(Xtest, p0Vec, p1Vec, pSpam);fprintf('Test Accuracy: %f\n', mean(double(t == ytest)) * 100);

读入邮件集，分类。

%% Examplefprintf('\n testing...');for i = 1 :25    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\spam\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    if i == 1        x = emailFeatures(word_indices);        y = 1;    else        x = [x emailFeatures(word_indices)];        y = [y;1];    end    fprintf('.');    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\ham\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    x = [x emailFeatures(word_indices)];    y = [y;0];endfprintf('done!\n');x = x';t = classifyNaiveBayes(x, p0Vec, p1Vec, pSpam);fprintf('Test Accuracy: %f\n', mean(y == t) * 100.0);

SVM

使用matlab自带库函数来训练SVM。
尝试了下libsvm的包，由于资料比较少，目前认为matlab自带的函数比较好使用。

Train

使用线性核来训练数据，出现的情况是在训练集上没有误差，可是在测试集上误差非常大。
使用rbf核来训练数据，默认σ值为1，结果欠拟合。调试参数之后，得到了比较好的效果。

Classify

将训练出的模型与待测试数据传入函数内分类。

Example

读入训练集，训练。

%% Trainload('spamTrain.mat');fprintf('\nTraining Linear SVM (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')model = svmtrain(X, y, 'kernel_function', 'rbf', 'rbf_sigma', 70);p = svmclassify(model, X);fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);

读入测试集，测试。

%% Testload('spamTest.mat');fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')p = svmclassify(model, Xtest);fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);

读入邮件集，分类。

%% Examplefprintf('\n testing...');for i = 1 :25    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\spam\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    if i == 1        x = emailFeatures(word_indices);        y = 1;    else        x = [x emailFeatures(word_indices)];        y = [y;1];    end    fprintf('.');    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\ham\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    x = [x emailFeatures(word_indices)];    y = [y;0];endfprintf('done!\n');x = x';p = svmclassify(model, x);fprintf('Test Accuracy: %f\n', mean(y == p) * 100.0);

Summary

测试结果：

Training Linear SVM (Spam Classification)(this may take 1 to 2 minutes) ...Training Accuracy: 98.175000Evaluating the trained Linear SVM on a test set ...Test Accuracy: 97.900000 testing............................done!Test Accuracy: 92.000000Training NaiveBayes (Spam Classification)(this may take 1 to 2 minutes) ...Training Accuracy: 97.200000Evaluating the trained NaiveBayes on a test set ...Test Accuracy: 97.300000 testing............................done!Test Accuracy: 98.000000

SVM方法在测试中的准确率为92%，在调试的时候需要尝试变换核与核参数。
而NaiveBayes方法在测试中的准确率到达了98%，并且训练的复杂度低于SVM。
因此就目前的认知来看，NaiveBayes分类器原理更简单，实现与使用的效率比SVM更高。

0 0