编程作业（六）

来源：互联网发布：tensorflow windows10 编辑：程序博客网时间：2024/06/04 01:02

支持向量机

本部分练习，我们将在2D示例数据集上使用支持向量机。通过在这些数据集上使用支持向量机，将帮助我们初识支持向量机的运行原理，以及如何使用高斯核函数支持向量机。

任务一示例数据集1

本任务要求我们修改参数C的值，观察支持向量机对数据集的判定边界。ex6.m文件中已将相关代码备好，其代码如下：

%% =============== Part 1: Loading and Visualizing Data ================%  We start the exercise by first loading and visualizing the dataset. %  The following code will load the dataset into your environment and plot%  the data.%fprintf('Loading and Visualizing Data ...\n')% Load from ex6data1: % You will have X, y in your environmentload('ex6data1.mat');% Plot training dataplotData(X, y);fprintf('Program paused. Press enter to continue.\n');pause;%% ==================== Part 2: Training Linear SVM ====================%  The following code will train a linear SVM on the dataset and plot the%  decision boundary learned.%% Load from ex6data1: % You will have X, y in your environmentload('ex6data1.mat');fprintf('\nTraining Linear SVM ...\n')% You should try to change the C value below and see how the decision% boundary varies (e.g., try C = 1000)C = 1;model = svmTrain(X, y, C, @linearKernel, 1e-3, 20);visualizeBoundaryLinear(X, y, model);fprintf('Program paused. Press enter to continue.\n');pause;

当参数C = 1时，其运行结果如下：

C = 1

当参数C = 100时，其运行结果如下：

C = 100

任务二高斯核函数支持向量机

本部分我们使用高斯核函数支持向量机对数据集进行非线性地划分。

首先我们需要使用高斯核函数构建新的特征变量f_i，i = 1, 2, 3, ......因此，我们需要在gaussianKernel.m文件中，将高斯核函数补充完整。

高斯核函数

参考代码如下：

sim = exp(-(x1 - x2)' * (x1 - x2) / (2 * sigma * sigma));

在ex6.m文件中运行如下部分代码，可得到的结果为：0.324652。

%% =============== Part 3: Implementing Gaussian Kernel ===============%  You will now implement the Gaussian kernel to use%  with the SVM. You should complete the code in gaussianKernel.m%fprintf('\nEvaluating the Gaussian Kernel ...\n')x1 = [1 2 1]; x2 = [0 4 -1]; sigma = 2;sim = gaussianKernel(x1, x2, sigma);fprintf(['Gaussian Kernel between x1 = [1; 2; 1], x2 = [0; 4; -1], sigma = %f :' ...         '\n\t%f\n(for sigma = 2, this value should be about 0.324652)\n'], sigma, sim);fprintf('Program paused. Press enter to continue.\n');pause;

我们需要导入新的数据集-样例数据集2，通过使用高斯核函数支持向量机，对该数据集进行非线性划分。如下为ex6.m文件中已备好的相关代码。

%% =============== Part 4: Visualizing Dataset 2 ================%  The following code will load the next dataset into your environment and %  plot the data. %fprintf('Loading and Visualizing Data ...\n')% Load from ex6data2: % You will have X, y in your environmentload('ex6data2.mat');% Plot training dataplotData(X, y);fprintf('Program paused. Press enter to continue.\n');pause;%% ========== Part 5: Training SVM with RBF Kernel (Dataset 2) ==========%  After you have implemented the kernel, we can now use it to train the %  SVM classifier.% fprintf('\nTraining SVM with RBF Kernel (this may take 1 to 2 minutes) ...\n');% Load from ex6data2: % You will have X, y in your environmentload('ex6data2.mat');% SVM ParametersC = 1; sigma = 0.1;% We set the tolerance and max_passes lower here so that the code will run% faster. However, in practice, you will want to run the training to% convergence.model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); visualizeBoundary(X, y, model);fprintf('Program paused. Press enter to continue.\n');pause;

其运行结果为：

为了更进一步熟悉使用高斯核函数支持向量机，我们导入新的数据集——样例数据集3。在ex6data3.mat文件中，其将该数据集分为了两部分，即训练集和交叉验证集。我们需要将dataset3Params.m文件补充完整，即我们需要在参数C∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}和参数σ∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}中找到一个最优值，使其能够对数据集进行正确地划分。

dataset3Params.m文件的参考代码如下：

C_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];  sigma_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];error_val = zeros(length(C_temp), length(sigma_temp));for i = 1 : length(C_temp)    for j = 1 : length(sigma_temp)        model = svmTrain(X, y, C_temp(i), @(x1, x2) gaussianKernel(x1, x2, sigma_temp(j)));        predictions = svmPredict(model, Xval);        error_val(i, j) = mean(double(predictions ~= yval));    endend[I, J] = find(error_val ==  min(error_val(:)));    % 找最小元素位置C = C_temp(I)                                      % 1sigma = sigma_temp(J)                              % 0.100

ex6.m文件中本部分代码如下：

%% =============== Part 6: Visualizing Dataset 3 ================%  The following code will load the next dataset into your environment and %  plot the data. %fprintf('Loading and Visualizing Data ...\n')% Load from ex6data3: % You will have X, y in your environmentload('ex6data3.mat');% Plot training dataplotData(X, y);fprintf('Program paused. Press enter to continue.\n');pause;%% ========== Part 7: Training SVM with RBF Kernel (Dataset 3) ==========%  This is a different dataset that you can use to experiment with. Try%  different values of C and sigma here.% % Load from ex6data3: % You will have X, y in your environmentload('ex6data3.mat');% Try different SVM Parameters here[C, sigma] = dataset3Params(X, y, Xval, yval);% Train the SVMmodel= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));visualizeBoundary(X, y, model);fprintf('Program paused. Press enter to continue.\n');pause;

其运行结果为：

垃圾邮件分类器

通过使用支持向量机实现垃圾邮件过滤器。

任务一电子邮件预处理

首先，我们导入样本数据，其内容如下：

众所周知，每封电子邮件都包含有类似于URLs和电子邮件地址等内容，但其主体内容是不同的。因此，为了提高分类效率，我们对这些类似的内容进行归一化处理，例如：将URLs用“httpaddr”文本代替等。该功能代码已在processEmail.m文件中准备好了。

我们运行ex6_spam.m文件中该部分代码，结果如下：

==== Processed Email ====anyon know how much it cost to host a web portal well it depend on how manivisitor you re expect thi can be anywher from less than number buck a monthto a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumbif your run someth big to unsubscrib yourself from thi mail list send anemail to emailaddr

任务二词汇表

本部分我们利用已有的词汇表与样例邮件中的单词构建映射关系。因此，我们需要对processEmail.m文件进行补充，构建单词表映射关系，即样例邮件中的某一单词属于词汇表，则将其添加至word_indices变量；若不属于则跳过，验证下一单词。

processEmail.m文件的参考代码如下：

    for i = 1 : length(vocabList)        if (strcmp(vocabList{i}, str))            word_indices = [word_indices; i];        end    end

任务三提取样例邮件的特征变量

本部分将word_indices变量中的数据转换为特性变量x的向量模式，即

其中，若第i个单词在样例邮件中，则x_i = 1；否则x_i = 0。

emailFeatures.m文件中的参考代码如下：

x(word_indices) = 1;

任务四使用支持向量机训练

本部分通过使用支持向量机模型对训练集进行训练，对交叉验证集进行验证。本部分代码如下：

%% =========== Part 3: Train Linear SVM for Spam Classification ========%  In this section, you will train a linear classifier to determine if an%  email is Spam or Not-Spam.% Load the Spam Email dataset% You will have X, y in your environmentload('spamTrain.mat');fprintf('\nTraining Linear SVM (Spam Classification)\n')fprintf('(this may take 1 to 2 minutes) ...\n')C = 0.1;model = svmTrain(X, y, C, @linearKernel);p = svmPredict(model, X);fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);%% =================== Part 4: Test Spam Classification ================%  After training the classifier, we can evaluate it on a test set. We have%  included a test set in spamTest.mat% Load the test dataset% You will have Xtest, ytest in your environmentload('spamTest.mat');fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')p = svmPredict(model, Xtest);fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);pause;

任务五预测垃圾邮件

本部分使用支持向量机对新的电子邮件进行预测。本部分代码为：

%% =================== Part 6: Try Your Own Emails =====================%  Now that you've trained the spam classifier, you can use it on your own%  emails! In the starter code, we have included spamSample1.txt,%  spamSample2.txt, emailSample1.txt and emailSample2.txt as examples. %  The following code reads in one of these emails and then uses your %  learned SVM classifier to determine whether the email is Spam or %  Not Spam% Set the file to be read in (change this to spamSample2.txt,% emailSample1.txt or emailSample2.txt to see different predictions on% different emails types). Try your own emails as well!filename = 'spamSample1.txt';% Read and predictfile_contents = readFile(filename);word_indices  = processEmail(file_contents);x             = emailFeatures(word_indices);p = svmPredict(model, x);fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);fprintf('(1 indicates spam, 0 indicates not spam)\n\n');

阅读全文

0 0