编程作业(六)
来源:互联网 发布:tensorflow windows10 编辑:程序博客网 时间:2024/06/04 01:02
支持向量机
支持向量机
本部分练习,我们将在2D示例数据集上使用支持向量机。通过在这些数据集上使用支持向量机,将帮助我们初识支持向量机的运行原理,以及如何使用高斯核函数支持向量机。
任务一 示例数据集1
本任务要求我们修改参数C的值,观察支持向量机对数据集的判定边界。ex6.m文件中已将相关代码备好,其代码如下:
%% =============== Part 1: Loading and Visualizing Data ================% We start the exercise by first loading and visualizing the dataset. % The following code will load the dataset into your environment and plot% the data.%fprintf('Loading and Visualizing Data ...\n')% Load from ex6data1: % You will have X, y in your environmentload('ex6data1.mat');% Plot training dataplotData(X, y);fprintf('Program paused. Press enter to continue.\n');pause;%% ==================== Part 2: Training Linear SVM ====================% The following code will train a linear SVM on the dataset and plot the% decision boundary learned.%% Load from ex6data1: % You will have X, y in your environmentload('ex6data1.mat');fprintf('\nTraining Linear SVM ...\n')% You should try to change the C value below and see how the decision% boundary varies (e.g., try C = 1000)C = 1;model = svmTrain(X, y, C, @linearKernel, 1e-3, 20);visualizeBoundaryLinear(X, y, model);fprintf('Program paused. Press enter to continue.\n');pause;
当参数C = 1时,其运行结果如下:
当参数C = 100时,其运行结果如下:
任务二 高斯核函数支持向量机
本部分我们使用高斯核函数支持向量机对数据集进行非线性地划分。
- 首先我们需要使用高斯核函数构建新的特征变量fi,i = 1, 2, 3, ......因此,我们需要在gaussianKernel.m文件中,将高斯核函数补充完整。
参考代码如下:
sim = exp(-(x1 - x2)' * (x1 - x2) / (2 * sigma * sigma));
在ex6.m文件中运行如下部分代码,可得到的结果为:0.324652。
%% =============== Part 3: Implementing Gaussian Kernel ===============% You will now implement the Gaussian kernel to use% with the SVM. You should complete the code in gaussianKernel.m%fprintf('\nEvaluating the Gaussian Kernel ...\n')x1 = [1 2 1]; x2 = [0 4 -1]; sigma = 2;sim = gaussianKernel(x1, x2, sigma);fprintf(['Gaussian Kernel between x1 = [1; 2; 1], x2 = [0; 4; -1], sigma = %f :' ... '\n\t%f\n(for sigma = 2, this value should be about 0.324652)\n'], sigma, sim);fprintf('Program paused. Press enter to continue.\n');pause;
- 我们需要导入新的数据集-样例数据集2,通过使用高斯核函数支持向量机,对该数据集进行非线性划分。如下为ex6.m文件中已备好的相关代码。
%% =============== Part 4: Visualizing Dataset 2 ================% The following code will load the next dataset into your environment and % plot the data. %fprintf('Loading and Visualizing Data ...\n')% Load from ex6data2: % You will have X, y in your environmentload('ex6data2.mat');% Plot training dataplotData(X, y);fprintf('Program paused. Press enter to continue.\n');pause;%% ========== Part 5: Training SVM with RBF Kernel (Dataset 2) ==========% After you have implemented the kernel, we can now use it to train the % SVM classifier.% fprintf('\nTraining SVM with RBF Kernel (this may take 1 to 2 minutes) ...\n');% Load from ex6data2: % You will have X, y in your environmentload('ex6data2.mat');% SVM ParametersC = 1; sigma = 0.1;% We set the tolerance and max_passes lower here so that the code will run% faster. However, in practice, you will want to run the training to% convergence.model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); visualizeBoundary(X, y, model);fprintf('Program paused. Press enter to continue.\n');pause;
其运行结果为:
- 为了更进一步熟悉使用高斯核函数支持向量机,我们导入新的数据集——样例数据集3。在ex6data3.mat文件中,其将该数据集分为了两部分,即训练集和交叉验证集。我们需要将dataset3Params.m文件补充完整,即我们需要在参数C∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}和参数σ∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}中找到一个最优值,使其能够对数据集进行正确地划分。
dataset3Params.m文件的参考代码如下:
C_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30]; sigma_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];error_val = zeros(length(C_temp), length(sigma_temp));for i = 1 : length(C_temp) for j = 1 : length(sigma_temp) model = svmTrain(X, y, C_temp(i), @(x1, x2) gaussianKernel(x1, x2, sigma_temp(j))); predictions = svmPredict(model, Xval); error_val(i, j) = mean(double(predictions ~= yval)); endend[I, J] = find(error_val == min(error_val(:))); % 找最小元素位置C = C_temp(I) % 1sigma = sigma_temp(J) % 0.100
ex6.m文件中本部分代码如下:
%% =============== Part 6: Visualizing Dataset 3 ================% The following code will load the next dataset into your environment and % plot the data. %fprintf('Loading and Visualizing Data ...\n')% Load from ex6data3: % You will have X, y in your environmentload('ex6data3.mat');% Plot training dataplotData(X, y);fprintf('Program paused. Press enter to continue.\n');pause;%% ========== Part 7: Training SVM with RBF Kernel (Dataset 3) ==========% This is a different dataset that you can use to experiment with. Try% different values of C and sigma here.% % Load from ex6data3: % You will have X, y in your environmentload('ex6data3.mat');% Try different SVM Parameters here[C, sigma] = dataset3Params(X, y, Xval, yval);% Train the SVMmodel= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));visualizeBoundary(X, y, model);fprintf('Program paused. Press enter to continue.\n');pause;
其运行结果为:
垃圾邮件分类器
通过使用支持向量机实现垃圾邮件过滤器。
任务一 电子邮件预处理
首先,我们导入样本数据,其内容如下:
众所周知,每封电子邮件都包含有类似于URLs和电子邮件地址等内容,但其主体内容是不同的。因此,为了提高分类效率,我们对这些类似的内容进行归一化处理,例如:将URLs用“httpaddr”文本代替等。该功能代码已在processEmail.m文件中准备好了。
我们运行ex6_spam.m文件中该部分代码,结果如下:
==== Processed Email ====anyon know how much it cost to host a web portal well it depend on how manivisitor you re expect thi can be anywher from less than number buck a monthto a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumbif your run someth big to unsubscrib yourself from thi mail list send anemail to emailaddr
任务二 词汇表
本部分我们利用已有的词汇表与样例邮件中的单词构建映射关系。因此,我们需要对processEmail.m文件进行补充,构建单词表映射关系,即样例邮件中的某一单词属于词汇表,则将其添加至word_indices变量;若不属于则跳过,验证下一单词。
processEmail.m文件的参考代码如下:
for i = 1 : length(vocabList) if (strcmp(vocabList{i}, str)) word_indices = [word_indices; i]; end end
任务三 提取样例邮件的特征变量
本部分将word_indices变量中的数据转换为特性变量x的向量模式,即
其中,若第i个单词在样例邮件中,则xi = 1;否则xi = 0。
emailFeatures.m文件中的参考代码如下:
x(word_indices) = 1;
任务四 使用支持向量机训练
本部分通过使用支持向量机模型对训练集进行训练,对交叉验证集进行验证。本部分代码如下:
%% =========== Part 3: Train Linear SVM for Spam Classification ========% In this section, you will train a linear classifier to determine if an% email is Spam or Not-Spam.% Load the Spam Email dataset% You will have X, y in your environmentload('spamTrain.mat');fprintf('\nTraining Linear SVM (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')C = 0.1;model = svmTrain(X, y, C, @linearKernel);p = svmPredict(model, X);fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);%% =================== Part 4: Test Spam Classification ================% After training the classifier, we can evaluate it on a test set. We have% included a test set in spamTest.mat% Load the test dataset% You will have Xtest, ytest in your environmentload('spamTest.mat');fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')p = svmPredict(model, Xtest);fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);pause;
任务五 预测垃圾邮件
本部分使用支持向量机对新的电子邮件进行预测。本部分代码为:
%% =================== Part 6: Try Your Own Emails =====================% Now that you've trained the spam classifier, you can use it on your own% emails! In the starter code, we have included spamSample1.txt,% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples. % The following code reads in one of these emails and then uses your % learned SVM classifier to determine whether the email is Spam or % Not Spam% Set the file to be read in (change this to spamSample2.txt,% emailSample1.txt or emailSample2.txt to see different predictions on% different emails types). Try your own emails as well!filename = 'spamSample1.txt';% Read and predictfile_contents = readFile(filename);word_indices = processEmail(file_contents);x = emailFeatures(word_indices);p = svmPredict(model, x);fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);fprintf('(1 indicates spam, 0 indicates not spam)\n\n');