Andrew NG 机器学习练习7-K-means Clustering and Principal Component Analysis

来源：互联网发布：通联数据股份公司深圳编辑：程序博客网时间：2024/05/21 03:20

1 K-means Clustering

1.1 Implementing K-means

The K-means algorithm is a method to automatically cluster similar data examples together.

The K-means algorithm is as follows:

% Initialize centroidscentroids = kMeansInitCentroids(X, K);for iter = 1:iterations    % Cluster assignment step: Assign each data point to the    % closest centroid. idx(i) corresponds to cˆ(i), the index    % of the centroid assigned to example i    idx = findClosestCentroids(X, centroids);    % Move centroid step: Compute means based on centroid    % assignments    centroids = computeMeans(X, idx, K);end

1.1.1 Finding closest centoids

%% ================= Part 1: Find Closest Centroids ====================%  To help you implement K-Means, we have divided the learning algorithm %  into two functions -- findClosestCentroids and computeCentroids. In this%  part, you should complete the code in the findClosestCentroids function. %fprintf('Finding closest centroids.\n\n');% Load an example dataset that we will be usingload('ex7data2.mat');% Select an initial set of centroidsK = 3; % 3 Centroidsinitial_centroids = [3 3; 6 2; 8 5];% Find the closest centroids for the examples using the% initial_centroidsidx = findClosestCentroids(X, initial_centroids);fprintf('Closest centroids for the first 3 examples: \n')fprintf(' %d', idx(1:3));fprintf('\n(the closest centroids should be 1, 3, 2 respectively)\n');fprintf('Program paused. Press enter to continue.\n');pause;

findClosestCentroids.m

function idx = findClosestCentroids(X, centroids)%FINDCLOSESTCENTROIDS computes the centroid memberships for every example%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids%   in idx for a dataset X where each row is a single example. idx = m x 1 %   vector of centroid assignments (i.e. each entry in range [1..K])%% Set KK = size(centroids, 1);% You need to return the following variables correctly.idx = zeros(size(X,1), 1);% ====================== YOUR CODE HERE ======================% Instructions: Go over every example, find its closest centroid, and store%               the index inside idx at the appropriate location.%               Concretely, idx(i) should contain the index of the centroid%               closest to example i. Hence, it should be a value in the %               range 1..K%% Note: You can use a for-loop over the examples to compute this.%for i=1:size(X,1)    min=100000;    for j=1:K        if sum((X(i,:)-centroids(j,:)).^2)<=min            min=sum((X(i,:)-centroids(j,:)).^2);            idx(i,1)=j;        end    endend% =============================================================end

1.1.2 Computing centroid means

Given assignments of every point to a centroid, the second phase of the algorithm recomputes, for each centroid, the mean of the points that were assigned to it.

重新计算每个类的质心。

属于该类的所有横坐标的平均值，即为该类质心的横坐标。所有纵坐标的平均值，即为该类质心的纵坐标。

%% ===================== Part 2: Compute Means =========================%  After implementing the closest centroids function, you should now%  complete the computeCentroids function.%fprintf('\nComputing centroids means.\n\n');%  Compute means based on the closest centroids found in the previous part.centroids = computeCentroids(X, idx, K);fprintf('Centroids computed after initial finding of closest centroids: \n')fprintf(' %f %f \n' , centroids');fprintf('\n(the centroids should be\n');fprintf('   [ 2.428301 3.157924 ]\n');fprintf('   [ 5.813503 2.633656 ]\n');fprintf('   [ 7.119387 3.616684 ]\n\n');fprintf('Program paused. Press enter to continue.\n');pause;

computeCentroids.m

function centroids = computeCentroids(X, idx, K)%COMPUTECENTROIDS returns the new centroids by computing the means of the %data points assigned to each centroid.%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by %   computing the means of the data points assigned to each centroid. It is%   given a dataset X where each row is a single data point, a vector%   idx of centroid assignments (i.e. each entry in range [1..K]) for each%   example, and K, the number of centroids. You should return a matrix%   centroids, where each row of centroids is the mean of the data points%   assigned to it.%% Useful variables[m n] = size(X);% You need to return the following variables correctly.centroids = zeros(K, n);% ====================== YOUR CODE HERE ======================% Instructions: Go over every centroid and compute mean of all points that%               belong to it. Concretely, the row vector centroids(i, :)%               should contain the mean of the data points assigned to%               centroid i.%% Note: You can use a for-loop over the centroids to compute this.%for i=1:K    list = find(idx==i);    for j=1:size(list,1)        centroids(i,:)=centroids(i,:)+X(list(j),:);    end;    centroids(i,:)=centroids(i,:)./size(list,1);end;% =============================================================end

1.2 K-means on example dataset

%% =================== Part 3: K-Means Clustering ======================%  After you have completed the two functions computeCentroids and%  findClosestCentroids, you have all the necessary pieces to run the%  kMeans algorithm. In this part, you will run the K-Means algorithm on%  the example dataset we have provided. %fprintf('\nRunning K-Means clustering on example dataset.\n\n');% Load an example datasetload('ex7data2.mat');% Settings for running K-MeansK = 3;max_iters = 10;% For consistency, here we set centroids to specific values% but in practice you want to generate them automatically, such as by% settings them to be random examples (as can be seen in% kMeansInitCentroids).initial_centroids = [3 3; 6 2; 8 5];% Run K-Means algorithm. The 'true' at the end tells our function to plot% the progress of K-Means[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);fprintf('\nK-Means Done.\n\n');fprintf('Program paused. Press enter to continue.\n');pause;

runkMeans.m

function [centroids, idx] = runkMeans(X, initial_centroids, ...                                      max_iters, plot_progress)%RUNKMEANS runs the K-Means algorithm on data matrix X, where each row of X%is a single example%   [centroids, idx] = RUNKMEANS(X, initial_centroids, max_iters, ...%   plot_progress) runs the K-Means algorithm on data matrix X, where each %   row of X is a single example. It uses initial_centroids used as the%   initial centroids. max_iters specifies the total number of interactions %   of K-Means to execute. plot_progress is a true/false flag that %   indicates if the function should also plot its progress as the %   learning happens. This is set to false by default. runkMeans returns %   centroids, a Kxn matrix of the computed centroids and idx, a m x 1 %   vector of centroid assignments (i.e. each entry in range [1..K])%% Set default value for plot progressif ~exist('plot_progress', 'var') || isempty(plot_progress)    plot_progress = false;end% Plot the data if we are plotting progressif plot_progress    figure;    hold on;end% Initialize values[m n] = size(X);K = size(initial_centroids, 1);centroids = initial_centroids;previous_centroids = centroids;idx = zeros(m, 1);% Run K-Meansfor i=1:max_iters    % Output progress    fprintf('K-Means iteration %d/%d...\n', i, max_iters);    if exist('OCTAVE_VERSION')        fflush(stdout);    end    % For each example in X, assign it to the closest centroid    idx = findClosestCentroids(X, centroids);    % Optionally, plot progress here    if plot_progress        plotProgresskMeans(X, centroids, previous_centroids, idx, K, i);        previous_centroids = centroids;        fprintf('Press enter to continue.\n');        pause;    end    % Given the memberships, compute new centroids    centroids = computeCentroids(X, idx, K);end% Hold off if we are plotting progressif plot_progress    hold off;endend

这里写图片描述

1.3 Random initialization

随机初始化聚类中心

kMeansInitCentroids.m

function centroids = kMeansInitCentroids(X, K)%KMEANSINITCENTROIDS This function initializes K centroids that are to be %used in K-Means on the dataset X%   centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be%   used with the K-Means on the dataset X%% You should return this values correctlycentroids = zeros(K, size(X, 2));% ====================== YOUR CODE HERE ======================% Instructions: You should set centroids to randomly chosen examples from%               the dataset X%% Randomly reorder the indices of examplesrandidx = randperm(size(X, 1));% Take the first K examples as centroidscentroids = X(randidx(1:K), :);% =============================================================end

1.4 Image compression with K-means

RGB编码：24-bit 表示每个像素点的颜色，每 8-bit（0-255）表示（red,green,blue）的编码。

我们的图片有上千种颜色，我们要将其降维到16种颜色。

将图片的每个像素作为数据样例，使用k-means 算法找到16种颜色最能将像素在3维RGB空间聚类。

每次你计算出聚类中心，你就使用16种颜色替换原始图片的像素点。

1.4.1 K-means on pixels

首先读取图片，将图片重构成 m*3 的像素颜色矩阵（m=128*128=16384）,在这之上运用 k-means.

发现前 K=16 的表示图片的颜色后，将所有像素点归为这16类。将他们的颜色换为其中心点的颜色。

这样的话，减小了需要描述这张图片的空间：
原始，24bits 对于 128*128 个像素点。总共需要：128*128*24=393216 bits.
现在：存储16种颜色需要：16*24bits，每个像素点只需要需要 4bits 存储16种像素的位置来表示使用的是哪一种颜色即可:128*128*4,所以总共需要 16*24+128*128*4=65920 bits.
相当于压缩为了以前的约1/6。

这里写图片描述

%% ============= Part 4: K-Means Clustering on Pixels ===============%  In this exercise, you will use K-Means to compress an image. To do this,%  you will first run K-Means on the colors of the pixels in the image and%  then you will map each pixel onto its closest centroid.%  %  You should now complete the code in kMeansInitCentroids.m%fprintf('\nRunning K-Means clustering on pixels from an image.\n\n');%  Load an image of a birdA = double(imread('bird_small.png'));% If imread does not work for you, you can try instead%   load ('bird_small.mat');A = A / 255; % Divide by 255 so that all values are in the range 0 - 1% Size of the imageimg_size = size(A);% Reshape the image into an Nx3 matrix where N = number of pixels.% Each row will contain the Red, Green and Blue pixel values% This gives us our dataset matrix X that we will use K-Means on.X = reshape(A, img_size(1) * img_size(2), 3);% Run your K-Means algorithm on this data% You should try different values of K and max_iters hereK = 16; max_iters = 10;% When using K-Means, it is important the initialize the centroids% randomly. % You should complete the code in kMeansInitCentroids.m before proceedinginitial_centroids = kMeansInitCentroids(X, K);% Run K-Means[centroids, idx] = runkMeans(X, initial_centroids, max_iters);fprintf('Program paused. Press enter to continue.\n');pause;%% ================= Part 5: Image Compression ======================%  In this part of the exercise, you will use the clusters of K-Means to%  compress an image. To do this, we first find the closest clusters for%  each example. After that, we fprintf('\nApplying K-Means to compress an image.\n\n');% Find closest cluster membersidx = findClosestCentroids(X, centroids);% Essentially, now we have represented the image X as in terms of the% indices in idx. % We can now recover the image from the indices (idx) by mapping each pixel% (specified by its index in idx) to the centroid valueX_recovered = centroids(idx,:);% Reshape the recovered image into proper dimensionsX_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);% Display the original image subplot(1, 2, 1);imagesc(A); title('Original');% Display compressed image side by sidesubplot(1, 2, 2);imagesc(X_recovered)title(sprintf('Compressed, with %d colors.', K));fprintf('Program paused. Press enter to continue.\n');pause;

2 Principal Component Analysis

2.1 Example Dataset

可视化使用 PCA 将数据从２D 降到１D 这个过程。
这里写图片描述

2.2 Implementing PCA

PCA 包括两个步骤：
1、计算数据的协方差（ covariance）矩阵。
2、使用 matlab 的 SVD 方法计算特征向量（ eigenvectors） U1,U2,...,Un

在使用PCA之前，归一化数据很重要。

%% ================== Part 1: Load Example Dataset  ===================%  We start this exercise by using a small dataset that is easily to%  visualize%fprintf('Visualizing example dataset for PCA.\n\n');%  The following command loads the dataset. You should now have the %  variable X in your environmentload ('ex7data1.mat');%  Visualize the example datasetplot(X(:, 1), X(:, 2), 'bo');axis([0.5 6.5 2 8]); axis square;fprintf('Program paused. Press enter to continue.\n');pause;%% =============== Part 2: Principal Component Analysis ===============%  You should now implement PCA, a dimension reduction technique. You%  should complete the code in pca.m%fprintf('\nRunning PCA on example dataset.\n\n');%  Before running PCA, it is important to first normalize X[X_norm, mu, sigma] = featureNormalize(X);%  Run PCA[U, S] = pca(X_norm);%  Compute mu, the mean of the each feature%  Draw the eigenvectors centered at mean of data. These lines show the%  directions of maximum variations in the dataset.hold on;drawLine(mu, mu + 1.5 * S(1,1) * U(:,1)', '-k', 'LineWidth', 2);drawLine(mu, mu + 1.5 * S(2,2) * U(:,2)', '-k', 'LineWidth', 2);hold off;fprintf('Top eigenvector: \n');fprintf(' U(:,1) = %f %f \n', U(1,1), U(2,1));fprintf('\n(you should expect to see -0.707107 -0.707107)\n');fprintf('Program paused. Press enter to continue.\n');pause;

pca.m

function [U, S] = pca(X)%PCA Run principal component analysis on the dataset X%   [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X%   Returns the eigenvectors U, the eigenvalues (on diagonal) in S%% Useful values[m, n] = size(X);% You need to return the following variables correctly.U = zeros(n);S = zeros(n);% ====================== YOUR CODE HERE ======================% Instructions: You should first compute the covariance matrix. Then, you%               should use the "svd" function to compute the eigenvectors%               and eigenvalues of the covariance matrix. %% Note: When computing the covariance matrix, remember to divide by m (the%       number of examples).%sigma = X' * X / m;     %计算协方差矩阵  [U,S,V] = svd(sigma);   %利用SVD函数计算降维后的特征向量集U和对角矩阵S  % =========================================================================end

2.3 Dimensionality Reduction with PCA

使用 PCA 返回的特征向量（ eigenvectors），将数据映射到低维空间 x(i)→z(i) (e.g., projecting the data from 2D to 1D)

%% =================== Part 3: Dimension Reduction ===================%  You should now implement the projection step to map the data onto the %  first k eigenvectors. The code will then plot the data in this reduced %  dimensional space.  This will show you what the data looks like when %  using only the corresponding eigenvectors to reconstruct it.%%  You should complete the code in projectData.m%fprintf('\nDimension reduction on example dataset.\n\n');%  Plot the normalized dataset (returned from pca)plot(X_norm(:, 1), X_norm(:, 2), 'bo');axis([-4 3 -4 3]); axis square%  Project the data onto K = 1 dimensionK = 1;Z = projectData(X_norm, U, K);fprintf('Projection of the first example: %f\n', Z(1));fprintf('\n(this value should be about 1.481274)\n\n');X_rec  = recoverData(Z, U, K);fprintf('Approximation of the first example: %f %f\n', X_rec(1, 1), X_rec(1, 2));fprintf('\n(this value should be about  -1.047419 -1.047419)\n\n');%  Draw lines connecting the projected points to the original pointshold on;plot(X_rec(:, 1), X_rec(:, 2), 'ro');for i = 1:size(X_norm, 1)    drawLine(X_norm(i,:), X_rec(i,:), '--k', 'LineWidth', 1);endhold offfprintf('Program paused. Press enter to continue.\n');pause;

2.3.1 Projecting the data onto the principal components

projectData.m

function Z = projectData(X, U, K)%PROJECTDATA Computes the reduced data representation when projecting only %on to the top k eigenvectors%   Z = projectData(X, U, K) computes the projection of %   the normalized inputs X into the reduced dimensional space spanned by%   the first K columns of U. It returns the projected examples in Z.%% You need to return the following variables correctly.Z = zeros(size(X, 1), K);% ====================== YOUR CODE HERE ======================% Instructions: Compute the projection of the data using only the top K %               eigenvectors in U (first K columns). %               For the i-th example X(i,:), the projection on to the k-th %               eigenvector is given as follows:%                    x = X(i, :)';%                    projection_k = x' * U(:, k);%Z = X * U(:,1:K);%计算X在新维度下的表示Z  % =============================================================end

2.3.2 Reconstructing an approximation of the data

recoverData.m

function X_rec = recoverData(Z, U, K)%RECOVERDATA Recovers an approximation of the original data when using the %projected data%   X_rec = RECOVERDATA(Z, U, K) recovers an approximation the %   original data that has been reduced to K dimensions. It returns the%   approximate reconstruction in X_rec.%% You need to return the following variables correctly.X_rec = zeros(size(Z, 1), size(U, 1));% ====================== YOUR CODE HERE ======================% Instructions: Compute the approximation of the data by projecting back%               onto the original space using the top K eigenvectors in U.%%               For the i-th example Z(i,:), the (approximate)%               recovered data for dimension j is given as follows:%                    v = Z(i, :)';%                    recovered_j = v' * U(j, 1:K)';%%               Notice that U(j, 1:K) is a row vector.%               X_rec = Z * U(:,1:K)'; %重建X,把X从K维度重建为N维度  % =============================================================end

这里写图片描述

2.4 Face Image Dataset

%% =============== Part 4: Loading and Visualizing Face Data =============%  We start the exercise by first loading and visualizing the dataset.%  The following code will load the dataset into your environment%fprintf('\nLoading face dataset.\n\n');%  Load Face datasetload ('ex7faces.mat')%  Display the first 100 faces in the datasetdisplayData(X(1:100, :));fprintf('Program paused. Press enter to continue.\n');pause;

这里写图片描述

2.4.1 PCA on Faces

%% =========== Part 5: PCA on Face Data: Eigenfaces  ===================%  Run PCA and visualize the eigenvectors which are in this case eigenfaces%  We display the first 36 eigenfaces.%fprintf(['\nRunning PCA on face dataset.\n' ...         '(this might take a minute or two ...)\n\n']);%  Before running PCA, it is important to first normalize X by subtracting %  the mean value from each feature[X_norm, mu, sigma] = featureNormalize(X);%  Run PCA[U, S] = pca(X_norm);%  Visualize the top 36 eigenvectors founddisplayData(U(:, 1:36)');fprintf('Program paused. Press enter to continue.\n');pause;

这里写图片描述

2.4.2 Dimensionality Reduction

This allows you to use your learning algorithm with a smaller input size (e.g., 100 dimensions) instead of the original 1024 dimensions. This can help speed up your learning algorithm.
这里写图片描述

%% ==== Part 7: Visualization of Faces after PCA Dimension Reduction ====%  Project images to the eigen space using the top K eigen vectors and %  visualize only using those K dimensions%  Compare to the original input, which is also displayedfprintf('\nVisualizing the projected (reduced dimension) faces.\n\n');K = 100;X_rec  = recoverData(Z, U, K);% Display normalized datasubplot(1, 2, 1);displayData(X_norm(1:100,:));title('Original faces');axis square;% Display reconstructed data from only k eigenfacessubplot(1, 2, 2);displayData(X_rec(1:100,:));title('Recovered faces');axis square;fprintf('Program paused. Press enter to continue.\n');pause;

2.5 Optional (ungraded) exercise: PCA for visualization

上面我们在3位RGB空间使用了K-means，这里我们使用PCA将3D映射为2D，以便可视化。
这里写图片描述

这里写图片描述

%% === Part 8(a): Optional (ungraded) Exercise: PCA for Visualization ===%  One useful application of PCA is to use it to visualize high-dimensional%  data. In the last K-Means exercise you ran K-Means on 3-dimensional %  pixel colors of an image. We first visualize this output in 3D, and then%  apply PCA to obtain a visualization in 2D.close all; close all; clc% Reload the image from the previous exercise and run K-Means on it% For this to work, you need to complete the K-Means assignment firstA = double(imread('bird_small.png'));% If imread does not work for you, you can try instead%   load ('bird_small.mat');A = A / 255;img_size = size(A);X = reshape(A, img_size(1) * img_size(2), 3);K = 16; max_iters = 10;initial_centroids = kMeansInitCentroids(X, K);[centroids, idx] = runkMeans(X, initial_centroids, max_iters);%  Sample 1000 random indexes (since working with all the data is%  too expensive. If you have a fast computer, you may increase this.sel = floor(rand(1000, 1) * size(X, 1)) + 1;%  Setup Color Palettepalette = hsv(K);colors = palette(idx(sel), :);%  Visualize the data and centroid memberships in 3Dfigure;scatter3(X(sel, 1), X(sel, 2), X(sel, 3), 10, colors);title('Pixel dataset plotted in 3D. Color shows centroid memberships');fprintf('Program paused. Press enter to continue.\n');pause;%% === Part 8(b): Optional (ungraded) Exercise: PCA for Visualization ===% Use PCA to project this cloud to 2D for visualization% Subtract the mean to use PCA[X_norm, mu, sigma] = featureNormalize(X);% PCA and project the data to 2D[U, S] = pca(X_norm);Z = projectData(X_norm, U, 2);% Plot in 2Dfigure;plotDataPoints(Z(sel, :), idx(sel), K);title('Pixel dataset plotted in 2D, using PCA for dimensionality reduction');fprintf('Program paused. Press enter to continue.\n');pause;

阅读全文

0 0

Andrew NG 机器学习 练习7-K-means Clustering and Principal Component Analysis

1 K-means Clustering

1.1 Implementing K-means

1.1.1 Finding closest centoids

1.1.2 Computing centroid means

1.2 K-means on example dataset

1.3 Random initialization

1.4 Image compression with K-means

1.4.1 K-means on pixels

2 Principal Component Analysis

2.1 Example Dataset

2.2 Implementing PCA

2.3 Dimensionality Reduction with PCA

2.3.1 Projecting the data onto the principal components

2.3.2 Reconstructing an approximation of the data

2.4 Face Image Dataset

2.4.1 PCA on Faces

2.4.2 Dimensionality Reduction

2.5 Optional (ungraded) exercise: PCA for visualization

Andrew NG 机器学习练习7-K-means Clustering and Principal Component Analysis