垃圾邮件二分类 NaiveBayes v.s SVM (matlab)

来源:互联网 发布:电子数据交换技术edi 编辑:程序博客网 时间:2024/06/05 08:02

  • Preprocess
    • ReadFile
    • ProcessEmail
  • NaiveBayes
    • Train
    • Classify
    • Example
  • SVM
    • Train
    • Classify
    • Example
  • Summary

垃圾邮件的二分类问题,比较朴素贝叶斯和SVM的用法。
给定一封邮件,由分类器给出这封邮件是(1)否(0)为垃圾邮件(spam)。

Preprocess

对邮件的预处理。

ReadFile

首先读入邮件,返回其内容。

function file_contents = readFile(filename)% Load File    fid = fopen(filename);    if fid        file_contents = fscanf(fid, '%c', inf);        fclose(fid);    else        file_contents = '';        fprintf('Unable to open %s\n', filename);    endend

ProcessEmail

对邮件进行预处理,有以下几种处理:

  • 将整封邮件单词转换为小写。
  • 去掉html格式。
  • 将数字替换为 ‘number’。
  • 将URL替换为 ‘httpaddr’。
  • 将邮件地址替换为 ‘emailaddr’。
  • 将表示money的符号替换为 ‘dollor’等。
  • 将单词时态进行还原。e.g,”discount, discounts, discounted” -> “discount”;”include, including, includes” -> “includ”。

这些处理由正则表达式来实现。

处理完之后,将邮件映射到一个词表中,这个词表数据集由垃圾邮件中常出现的高频率词汇组成。我使用的词表包含了1899个词汇。

getVocabList函数

function vocabList = getVocabList()%% Read the fixed vocabulary list    fid = fopen('vocab.txt');% Store all dictionary words in cell array vocab{}    n = 1899;  % Total number of words in the dictionary    vocabList = cell(n, 1);    for i = 1:n        % Word Index (can ignore since it will be = i)        fscanf(fid, '%d', 1);        % Actual Word        vocabList{i} = fscanf(fid, '%s', 1);    end    fclose(fid);end

processEmail函数

function word_indices = processEmail(email_contents)% Load Vocabulary    vocabList = getVocabList();% Init return value    word_indices = [];% Lower case    email_contents = lower(email_contents);% Strip all HTML% Looks for any expression that starts with < and ends with > and replace% and does not have any < or > in the tag it with a space    email_contents = regexprep(email_contents, '<[^<>]+>', ' ');% Handle Numbers% Look for one or more characters between 0-9    email_contents = regexprep(email_contents, '[0-9]+', 'number');% Handle URLS% Look for strings starting with http:// or https://    email_contents = regexprep(email_contents, ...                           '(http|https)://[^\s]*', 'httpaddr');% Handle Email Addresses% Look for strings with @ in the middle    email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');% Handle $ sign    email_contents = regexprep(email_contents, '[$]+', 'dollar');    while ~isempty(email_contents)        % Tokenize and also get rid of any punctuation        [str, email_contents] = ...           strtok(email_contents, ...                  [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);        % Remove any non alphanumeric characters        str = regexprep(str, '[^a-zA-Z0-9]', '');        % Stem the word         % (the porterStemmer sometimes has issues, so we use a try catch block)        try str = porterStemmer(strtrim(str));         catch str = ''; continue;        end;        % Skip the word if it is too short        if length(str) < 1           continue;        end        for i = 1 : length(vocabList)            if (strcmp(str, vocabList(i)) == 1)                word_indices = [word_indices i];            end        end    endend

emailFeatures函数

function x = emailFeatures(word_indices)    n = 1899;    x = zeros(n, 1);    x(word_indices) = 1;end

NaiveBayes

Train

由贝叶斯定理可知
p(ci|w⃗ )=p(w⃗ |ci)p(ci)p(w⃗ )
其中p(ci|w⃗ )表示在给定特定向量w⃗ 的情况下,这个向量属于类别ci的概率,最终选择类别时,通常选择这个概率最大的类别。由于p(w⃗ )对所有类为常数,不影响最终结果的大小比较,所以可以不计算。

p(w⃗ |ci)表示在类ci出现的情况下,出现词表向量w⃗ 的概率,假设词汇出现互相不影响,即词汇间相互独立,可以得到
p(w⃗ |ci)=p(w0,w1,...,wn|ci)=p(w0|ci)p(w1|ci)...p(wn|ci)
对于任意0<=j<=np(wj|ci)=wjci

p(ci)表示类别ci出现的概率。

在训练时,为了防止除0的情况出现,使用laplace平滑处理;为了防止多个小数相乘发生下溢,将概率计算全部取对数(ln),不改变单调性的同时使得乘法操作变为加法操作。

function [p0Vec, p1Vec, pSpam] = trainNaiveBayes(X, y)% 返回值为2个向量和1个概率% p0Vec表示[p(w_0|c_0) p(w_1|c0)...]% p1Vec表示[p(w_0|c_1) p(w_1|c1)...]% pSpam表示分类器中,垃圾邮件存在的概率    numTrainDocs = size(X, 1);    numWords = size(X, 2);% p{是垃圾邮件} = 垃圾邮件数/训练集个数    pSpam = 1.0 * sum(y) / numTrainDocs;    p0Num = ones(1, numWords);    p1Num = ones(1, numWords);    p0Denom = 2.0;    p1Denom = 2.0;    for i = 1 : numTrainDocs        vec_temp = X(i, :);        if y(i) == 1            p1Num = p1Num + vec_temp;            p1Denom = p1Denom + sum(vec_temp);        else            p0Num = p0Num + vec_temp;            p0Denom = p0Denom + sum(vec_temp);        end    end    p0Vec = log(p0Num / p0Denom);    p1Vec = log(p1Num / p1Denom);end

Classify

将邮件转换为词表的01向量X,然后将相应的概率求和(已经取了对数)。

function label = classifyNaiveBayes(X, p0Vec, p1Vec, pSpam)    m = size(X, 1);    for i = 1 : m        p1(i, 1) = p1Vec * X(i, :)' + log(pSpam);        p0(i, 1) = p0Vec * X(i, :)' + log(1.0 - pSpam);    end    label = (p1 > p0);end

Example

读入训练集,训练。

%% Trainclear;clc;load('spamTrain.mat');fprintf('\nTraining NaiveBayes (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')[p0Vec, p1Vec, pSpam] = trainNaiveBayes(X, y);t = classifyNaiveBayes(X, p0Vec, p1Vec, pSpam);fprintf('Training Accuracy: %f\n', mean(double(t == y)) * 100);

读入测试集,测试。

%% Testload('spamTest.mat');fprintf('\nEvaluating the trained NaiveBayes on a test set ...\n')t = classifyNaiveBayes(Xtest, p0Vec, p1Vec, pSpam);fprintf('Test Accuracy: %f\n', mean(double(t == ytest)) * 100);

读入邮件集,分类。

%% Examplefprintf('\n testing...');for i = 1 :25    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\spam\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    if i == 1        x = emailFeatures(word_indices);        y = 1;    else        x = [x emailFeatures(word_indices)];        y = [y;1];    end    fprintf('.');    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\ham\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    x = [x emailFeatures(word_indices)];    y = [y;0];endfprintf('done!\n');x = x';t = classifyNaiveBayes(x, p0Vec, p1Vec, pSpam);fprintf('Test Accuracy: %f\n', mean(y == t) * 100.0);

SVM

使用matlab自带库函数来训练SVM。
尝试了下libsvm的包,由于资料比较少,目前认为matlab自带的函数比较好使用。

Train

使用线性核来训练数据,出现的情况是在训练集上没有误差,可是在测试集上误差非常大。
使用rbf核来训练数据,默认σ值为1,结果欠拟合。调试参数之后,得到了比较好的效果。

Classify

将训练出的模型与待测试数据传入函数内分类。

Example

读入训练集,训练。

%% Trainload('spamTrain.mat');fprintf('\nTraining Linear SVM (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')model = svmtrain(X, y, 'kernel_function', 'rbf', 'rbf_sigma', 70);p = svmclassify(model, X);fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);

读入测试集,测试。

%% Testload('spamTest.mat');fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')p = svmclassify(model, Xtest);fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);

读入邮件集,分类。

%% Examplefprintf('\n testing...');for i = 1 :25    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\spam\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    if i == 1        x = emailFeatures(word_indices);        y = 1;    else        x = [x emailFeatures(word_indices)];        y = [y;1];    end    fprintf('.');    fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\ham\' sprintf('%d', i) '.txt'];    file_contents = readFile(fileName);    word_indices  = processEmail(file_contents);    x = [x emailFeatures(word_indices)];    y = [y;0];endfprintf('done!\n');x = x';p = svmclassify(model, x);fprintf('Test Accuracy: %f\n', mean(y == p) * 100.0);

Summary

测试结果:

Training Linear SVM (Spam Classification)(this may take 1 to 2 minutes) ...Training Accuracy: 98.175000Evaluating the trained Linear SVM on a test set ...Test Accuracy: 97.900000 testing............................done!Test Accuracy: 92.000000Training NaiveBayes (Spam Classification)(this may take 1 to 2 minutes) ...Training Accuracy: 97.200000Evaluating the trained NaiveBayes on a test set ...Test Accuracy: 97.300000 testing............................done!Test Accuracy: 98.000000

SVM方法在测试中的准确率为92%,在调试的时候需要尝试变换核与核参数。
而NaiveBayes方法在测试中的准确率到达了98%,并且训练的复杂度低于SVM。
因此就目前的认知来看,NaiveBayes分类器原理更简单,实现与使用的效率比SVM更高。

0 0
原创粉丝点击