垃圾邮件二分类 NaiveBayes v.s SVM (matlab)
来源:互联网 发布:电子数据交换技术edi 编辑:程序博客网 时间:2024/06/05 08:02
- Preprocess
- ReadFile
- ProcessEmail
- NaiveBayes
- Train
- Classify
- Example
- SVM
- Train
- Classify
- Example
- Summary
垃圾邮件的二分类问题,比较朴素贝叶斯和SVM的用法。
给定一封邮件,由分类器给出这封邮件是(1)否(0)为垃圾邮件(spam)。
Preprocess
对邮件的预处理。
ReadFile
首先读入邮件,返回其内容。
function file_contents = readFile(filename)% Load File fid = fopen(filename); if fid file_contents = fscanf(fid, '%c', inf); fclose(fid); else file_contents = ''; fprintf('Unable to open %s\n', filename); endend
ProcessEmail
对邮件进行预处理,有以下几种处理:
- 将整封邮件单词转换为小写。
- 去掉html格式。
- 将数字替换为 ‘number’。
- 将URL替换为 ‘httpaddr’。
- 将邮件地址替换为 ‘emailaddr’。
- 将表示money的符号替换为 ‘dollor’等。
- 将单词时态进行还原。e.g,”discount, discounts, discounted” -> “discount”;”include, including, includes” -> “includ”。
这些处理由正则表达式来实现。
处理完之后,将邮件映射到一个词表中,这个词表数据集由垃圾邮件中常出现的高频率词汇组成。我使用的词表包含了1899个词汇。
getVocabList函数
function vocabList = getVocabList()%% Read the fixed vocabulary list fid = fopen('vocab.txt');% Store all dictionary words in cell array vocab{} n = 1899; % Total number of words in the dictionary vocabList = cell(n, 1); for i = 1:n % Word Index (can ignore since it will be = i) fscanf(fid, '%d', 1); % Actual Word vocabList{i} = fscanf(fid, '%s', 1); end fclose(fid);end
processEmail函数
function word_indices = processEmail(email_contents)% Load Vocabulary vocabList = getVocabList();% Init return value word_indices = [];% Lower case email_contents = lower(email_contents);% Strip all HTML% Looks for any expression that starts with < and ends with > and replace% and does not have any < or > in the tag it with a space email_contents = regexprep(email_contents, '<[^<>]+>', ' ');% Handle Numbers% Look for one or more characters between 0-9 email_contents = regexprep(email_contents, '[0-9]+', 'number');% Handle URLS% Look for strings starting with http:// or https:// email_contents = regexprep(email_contents, ... '(http|https)://[^\s]*', 'httpaddr');% Handle Email Addresses% Look for strings with @ in the middle email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');% Handle $ sign email_contents = regexprep(email_contents, '[$]+', 'dollar'); while ~isempty(email_contents) % Tokenize and also get rid of any punctuation [str, email_contents] = ... strtok(email_contents, ... [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]); % Remove any non alphanumeric characters str = regexprep(str, '[^a-zA-Z0-9]', ''); % Stem the word % (the porterStemmer sometimes has issues, so we use a try catch block) try str = porterStemmer(strtrim(str)); catch str = ''; continue; end; % Skip the word if it is too short if length(str) < 1 continue; end for i = 1 : length(vocabList) if (strcmp(str, vocabList(i)) == 1) word_indices = [word_indices i]; end end endend
emailFeatures函数
function x = emailFeatures(word_indices) n = 1899; x = zeros(n, 1); x(word_indices) = 1;end
NaiveBayes
Train
由贝叶斯定理可知
其中
对于任意
在训练时,为了防止除0的情况出现,使用laplace平滑处理;为了防止多个小数相乘发生下溢,将概率计算全部取对数(ln),不改变单调性的同时使得乘法操作变为加法操作。
function [p0Vec, p1Vec, pSpam] = trainNaiveBayes(X, y)% 返回值为2个向量和1个概率% p0Vec表示[p(w_0|c_0) p(w_1|c0)...]% p1Vec表示[p(w_0|c_1) p(w_1|c1)...]% pSpam表示分类器中,垃圾邮件存在的概率 numTrainDocs = size(X, 1); numWords = size(X, 2);% p{是垃圾邮件} = 垃圾邮件数/训练集个数 pSpam = 1.0 * sum(y) / numTrainDocs; p0Num = ones(1, numWords); p1Num = ones(1, numWords); p0Denom = 2.0; p1Denom = 2.0; for i = 1 : numTrainDocs vec_temp = X(i, :); if y(i) == 1 p1Num = p1Num + vec_temp; p1Denom = p1Denom + sum(vec_temp); else p0Num = p0Num + vec_temp; p0Denom = p0Denom + sum(vec_temp); end end p0Vec = log(p0Num / p0Denom); p1Vec = log(p1Num / p1Denom);end
Classify
将邮件转换为词表的01向量X,然后将相应的概率求和(已经取了对数)。
function label = classifyNaiveBayes(X, p0Vec, p1Vec, pSpam) m = size(X, 1); for i = 1 : m p1(i, 1) = p1Vec * X(i, :)' + log(pSpam); p0(i, 1) = p0Vec * X(i, :)' + log(1.0 - pSpam); end label = (p1 > p0);end
Example
读入训练集,训练。
%% Trainclear;clc;load('spamTrain.mat');fprintf('\nTraining NaiveBayes (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')[p0Vec, p1Vec, pSpam] = trainNaiveBayes(X, y);t = classifyNaiveBayes(X, p0Vec, p1Vec, pSpam);fprintf('Training Accuracy: %f\n', mean(double(t == y)) * 100);
读入测试集,测试。
%% Testload('spamTest.mat');fprintf('\nEvaluating the trained NaiveBayes on a test set ...\n')t = classifyNaiveBayes(Xtest, p0Vec, p1Vec, pSpam);fprintf('Test Accuracy: %f\n', mean(double(t == ytest)) * 100);
读入邮件集,分类。
%% Examplefprintf('\n testing...');for i = 1 :25 fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\spam\' sprintf('%d', i) '.txt']; file_contents = readFile(fileName); word_indices = processEmail(file_contents); if i == 1 x = emailFeatures(word_indices); y = 1; else x = [x emailFeatures(word_indices)]; y = [y;1]; end fprintf('.'); fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\ham\' sprintf('%d', i) '.txt']; file_contents = readFile(fileName); word_indices = processEmail(file_contents); x = [x emailFeatures(word_indices)]; y = [y;0];endfprintf('done!\n');x = x';t = classifyNaiveBayes(x, p0Vec, p1Vec, pSpam);fprintf('Test Accuracy: %f\n', mean(y == t) * 100.0);
SVM
使用matlab自带库函数来训练SVM。
尝试了下libsvm的包,由于资料比较少,目前认为matlab自带的函数比较好使用。
Train
使用线性核来训练数据,出现的情况是在训练集上没有误差,可是在测试集上误差非常大。
使用rbf核来训练数据,默认
Classify
将训练出的模型与待测试数据传入函数内分类。
Example
读入训练集,训练。
%% Trainload('spamTrain.mat');fprintf('\nTraining Linear SVM (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')model = svmtrain(X, y, 'kernel_function', 'rbf', 'rbf_sigma', 70);p = svmclassify(model, X);fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);
读入测试集,测试。
%% Testload('spamTest.mat');fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')p = svmclassify(model, Xtest);fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
读入邮件集,分类。
%% Examplefprintf('\n testing...');for i = 1 :25 fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\spam\' sprintf('%d', i) '.txt']; file_contents = readFile(fileName); word_indices = processEmail(file_contents); if i == 1 x = emailFeatures(word_indices); y = 1; else x = [x emailFeatures(word_indices)]; y = [y;1]; end fprintf('.'); fileName = ['C:\Users\hujie\Desktop\ML\Support Vector Machines\email\ham\' sprintf('%d', i) '.txt']; file_contents = readFile(fileName); word_indices = processEmail(file_contents); x = [x emailFeatures(word_indices)]; y = [y;0];endfprintf('done!\n');x = x';p = svmclassify(model, x);fprintf('Test Accuracy: %f\n', mean(y == p) * 100.0);
Summary
测试结果:
Training Linear SVM (Spam Classification)(this may take 1 to 2 minutes) ...Training Accuracy: 98.175000Evaluating the trained Linear SVM on a test set ...Test Accuracy: 97.900000 testing............................done!Test Accuracy: 92.000000Training NaiveBayes (Spam Classification)(this may take 1 to 2 minutes) ...Training Accuracy: 97.200000Evaluating the trained NaiveBayes on a test set ...Test Accuracy: 97.300000 testing............................done!Test Accuracy: 98.000000
SVM方法在测试中的准确率为92%,在调试的时候需要尝试变换核与核参数。
而NaiveBayes方法在测试中的准确率到达了98%,并且训练的复杂度低于SVM。
因此就目前的认知来看,NaiveBayes分类器原理更简单,实现与使用的效率比SVM更高。
- 垃圾邮件二分类 NaiveBayes v.s SVM (matlab)
- 二分类SVM方法Matlab实现
- 二分类SVM方法Matlab实现
- matlab版hog+svm图像二分类
- 二分类SVM方法Matlab实现
- matlab实现hog+svm图像二分类
- matlab版hog+svm图像二分类
- 案例:垃圾邮件二分类
- Matlab-SVM分类器
- Matlab-SVM分类器
- 基于Andrew ng课后作业6,matlab实现svm算法的垃圾邮件分类器(spam classifier)
- 朴素贝叶斯实现垃圾邮件分类------matlab实现
- SVM实现垃圾邮件分类(java调用libsvm.jar)
- weka分类器-NaiveBayes
- KNN & NaiveBayes 分类算法
- 用MatLab实现SVM分类
- 用MATLAB进行SVM分类
- Matlab实现svm的分类
- C#Base64编码
- 剑指Offer_50_数组中重复的数字
- Android SDK工具:使用layoutopt进行布局优化
- HDU 5785 Interesting(Manacher)
- 指示器
- 垃圾邮件二分类 NaiveBayes v.s SVM (matlab)
- jps原理
- 常用的 bat 命令
- linux使用--4.vim编译及使用记录
- VerifySequenceOfBST
- 安装vs2015 失败,解决步骤
- linux使用--5. ubuntu + xp 双硬盘双系统的安装
- ecshop二次开始--头像上传
- opencv中测量运行时间的函数