Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

来源：互联网发布：单片机语言是什么语言编辑：程序博客网时间：2024/05/24 03:19

大家好，我是Mac Jiang，今天和大家分享Coursera-Stanford University-Machine Learning-Programming Exercise 7:K-means Clustering and Principal Principal Component Analysis的第一部分的编码。第一部分讲的是K-means Clustering,即K均值算法的实现过程，虽然我写的代码是正确的，但不一定是最好的，如果有更好的实现方法，请留言指正。当然，欢迎大家转载我的博客，不过在转载之前请标明出处，谢谢。第二部分的地址为：http://blog.csdn.net/a1015553840/article/details/50879343

好的，我们开始讲解第一部分K-means Clustering的具体实现过程。

这部分的主要有两大块内容：

(1)主要是训练PCA算法，并在OpenGL上绘制出K均值算法的具体计算过程，绘制出每次分类情况和中心变换情况。

(2)利用K均值算法对一幅图像进行压缩，此图像为128*128，每个像素由RGB三种颜色标识，而每种颜色用1BYTE（8bit）表示，范围为0-255。如果不采取压缩，那么图像所占存储空间大小为128*128*3BYTE=128*128*24bits = 393,216bits。我们要进行的是利用K均值算法聚类出最常用的16中颜色，这16中颜色只要用4bit标识，加上这十六种颜色与RGB的映射关系共128*128*4 + 16*24 = 65,920bit。可以看到，压缩后存储只占压缩前存储量的1/6左右。

数据集：ex7data2.mat---用于训练K均值算法的训练样本

bird_small.png---用于做压缩测试的图像

函数：displayData.m---把训练样本X的数据可视化

drawLine.m---画出2D降为1D的直线 plotDataPoints.m---k均值算法的点，当属于不同中心时用不同颜色画出

plotProgresskMeans.m---做出k均值算法的中心 runMeans.m---运行k均值算法

ex7.m---K均值算法的主控制函数，控制算法的进行过程

kMeansInitCentroid.m---初始化k均值算法的中心，需要完善代码！

findClosestCentroids.m---将每个样本归为离他最近的中心的那一类，需要完善代码！

computeCentroids.m---将上面求得的类，计算每一类的新的中心，需要完善代码！

这部分作业共三个文件需要完善代码

K均值算法的计算为：

初始化中心；（kMeansInitCentroids.m实现）

Repeat{

from 1 to m:计算每个样本离各类中心的距离，将每个样本分别归类（findClosestCentroids.m实现）

from 1 to K:z在归类后，计算各类的中心（compureCentroids.m实现）

}

这我们需要完成的任务就是编写初始化，样本分类，求新分类中心三个操作

1.ex7的控制过程

%% Machine Learning Online Class%  Exercise 7 | Principle Component Analysis and K-Means Clustering%%  Instructions%  ------------%%  This file contains code that helps you get started on the%  exercise. You will need to complete the following functions:%%     pca.m%     projectData.m%     recoverData.m%     computeCentroids.m%     findClosestCentroids.m%     kMeansInitCentroids.m%%  For this exercise, you will not need to change any code in this file,%  or any other files other than those mentioned above.%%% Initializationclear ; close all; clc%% ================= Part 1: Find Closest Centroids ====================%  To help you implement K-Means, we have divided the learning algorithm %  into two functions -- findClosestCentroids and computeCentroids. In this%  part, you shoudl complete the code in the findClosestCentroids function. %fprintf('Finding closest centroids.\n\n');% Load an example dataset that we will be usingload('ex7data2.mat');% Select an initial set of centroidsK = 3; % 3 Centroidsinitial_centroids = [3 3; 6 2; 8 5];% Find the closest centroids for the examples using the% initial_centroidsidx = findClosestCentroids(X, initial_centroids);fprintf('Closest centroids for the first 3 examples: \n')fprintf(' %d', idx(1:3));fprintf('\n(the closest centroids should be 1, 3, 2 respectively)\n');fprintf('Program paused. Press enter to continue.\n');pause;%% ===================== Part 2: Compute Means =========================%  After implementing the closest centroids function, you should now%  complete the computeCentroids function.%fprintf('\nComputing centroids means.\n\n');%  Compute means based on the closest centroids found in the previous part.centroids = computeCentroids(X, idx, K);fprintf('Centroids computed after initial finding of closest centroids: \n')fprintf(' %f %f \n' , centroids');fprintf('\n(the centroids should be\n');fprintf('   [ 2.428301 3.157924 ]\n');fprintf('   [ 5.813503 2.633656 ]\n');fprintf('   [ 7.119387 3.616684 ]\n\n');fprintf('Program paused. Press enter to continue.\n');pause;%% =================== Part 3: K-Means Clustering ======================%  After you have completed the two functions computeCentroids and%  findClosestCentroids, you have all the necessary pieces to run the%  kMeans algorithm. In this part, you will run the K-Means algorithm on%  the example dataset we have provided. %fprintf('\nRunning K-Means clustering on example dataset.\n\n');% Load an example datasetload('ex7data2.mat');% Settings for running K-MeansK = 3;max_iters = 10;% For consistency, here we set centroids to specific values% but in practice you want to generate them automatically, such as by% settings them to be random examples (as can be seen in% kMeansInitCentroids).initial_centroids = [3 3; 6 2; 8 5];% Run K-Means algorithm. The 'true' at the end tells our function to plot% the progress of K-Means[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);fprintf('\nK-Means Done.\n\n');fprintf('Program paused. Press enter to continue.\n');pause;%% ============= Part 4: K-Means Clustering on Pixels ===============fprintf('\nRunning K-Means clustering on pixels from an image.\n\n');%  Load an image of a birdA = double(imread('bird_small.png'));A = A / 255; % Divide by 255 so that all values are in the range 0 - 1% 图片为128行，128列，每个像素RGB三种颜色，每个颜色1Byte = 8bit，共128*128*24bitsimg_size = size(A);%原图A为img_size(1)行，img_size(2)列，每个像素点的颜色由RGB三种表示，每种8bit共3字节，故为img_size(1)*img_size(2)*3%由于我们要使用K-means，所以我们要把行和列铺平，成为一个响亮，成为一个长度为img_size(1)*img_size(2)的向量，每个元素有RGB三种共3字节%把图像铺平，这样每个元素即为一个输入x，他有RGB三个维度X = reshape(A, img_size(1) * img_size(2), 3);%我们的目的是把RGB共256*256*256种颜色压缩成16种颜色，这16种颜色是通过K均值算法计算出来的%假如不压缩，原图为128*128*3Byte = 128*128*24bit = 393216bits;%如果压缩成16种颜色，那么只要4bit表示颜色的种类，然后再记录用到的这16种颜色的RGB表示16*24bits...%共16*24 +%128*128*4 = 65920bit8,图像压缩了将近6倍K = 16; max_iters = 10;%初始化中心initial_centroids = kMeansInitCentroids(X, K);%运行K均值算法[centroids, idx] = runkMeans(X, initial_centroids, max_iters);fprintf('Program paused. Press enter to continue.\n');pause;%% ================= Part 5: Image Compression ======================fprintf('\nApplying K-Means to compress an image.\n\n');% Find closest cluster membersidx = findClosestCentroids(X, centroids);% Essentially, now we have represented the image X as in terms of the% indices in idx. % We can now recover the image from the indices (idx) by mapping each pixel% (specified by it's index in idx) to the centroid valueX_recovered = centroids(idx,:);% Reshape the recovered image into proper dimensions%本实验本身并未压图片，最后只是把各点颜色用那16种代替了而已，但是提供的是一种压缩图片的思想X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);% Display the original image subplot(1, 2, 1);imagesc(A); title('Original');% Display compressed image side by sidesubplot(1, 2, 2);imagesc(X_recovered)title(sprintf('Compressed, with %d colors.', K));fprintf('Program paused. Press enter to continue.\n');pause;

Part1:Find Closest Centroids---利用ex7data2.mat和当前中心，计算的每个样本离每个中心的距离，将他们分为最近的中心的类别中

Part2:Compute Means---利用第一部分的到的新的分类，计算每个新的分类的中心

Part3:K-Means Clustering---利用K均值算法进行聚类，并画出每次聚类的类别变化过程和新中心的转变过程

Part4:K-Means Clustering Pixels---利用K均值算法对图像进行聚类分析，找到16中使用最多的颜色】

Park5:Image Compressing---在part4得到的16种颜色的基础上，对图像进行压缩。这里实际上并未对图像进行压缩，而是把图片各颜色换成16中颜色内与之相近的颜色。这里只是给我们提供这种图片压缩的方法，并未最终实现

2.kMeansInitCentroids.m的实现

function centroids = kMeansInitCentroids(X, K)%KMEANSINITCENTROIDS This function initializes K centroids that are to be %used in K-Means on the dataset X%   centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be%   used with the K-Means on the dataset X%% You should return this values correctlycentroids = zeros(K, size(X, 2));% ====================== YOUR CODE HERE ======================% Instructions: You should set centroids to randomly chosen examples from%               the dataset X%%初始化中心centroids，从X中随机取K行作为初始化中心randidx = randperm(size(X,1));            %打乱X的行，列不变centroids = X(randidx(1:K),:);            %从打乱的X中取前K个作为初始化中心% =============================================================end

初始化中心时，是随机选取训练样本X中的K个作为初始化中心，所以先打乱X，然后取前K个即可。

3.findcloestCentroids.m的实现

function idx = findClosestCentroids(X, centroids)%FINDCLOSESTCENTROIDS computes the centroid memberships for every example%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids%   in idx for a dataset X where each row is a single example. idx = m x 1 %   vector of centroid assignments (i.e. each entry in range [1..K])%% Set KK = size(centroids, 1);% You need to return the following variables correctly.idx = zeros(size(X,1), 1);% ====================== YOUR CODE HERE ======================% Instructions: Go over every example, find its closest centroid, and store%               the index inside idx at the appropriate location.%               Concretely, idx(i) should contain the index of the centroid%               closest to example i. Hence, it should be a value in the %               range 1..K%% Note: You can use a for-loop over the examples to compute this.%temp = zeros(K,1);                                           %存储样本x离各个中心距离的距离，方便求解该x离哪个点最近for i = 1:size(X,1),                                         %对X的每个样本进行遍历    for j = 1:K,                                             %在进行x(i)时候，计算他离每个中心的距离，存储在temp中        temp(j) = sum((X(i,:) -  centroids(j,:)).^2);                [value,idx(i)] = min(temp,[],1);                     %计算temp中最小值的行号，就是x(i)距离最近的中心标号    endend% =============================================================end

4.computeCentroids.m的实现

function centroids = computeCentroids(X, idx, K)%COMPUTECENTROIDS returs the new centroids by computing the means of the %data points assigned to each centroid.%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by %   computing the means of the data points assigned to each centroid. It is%   given a dataset X where each row is a single data point, a vector%   idx of centroid assignments (i.e. each entry in range [1..K]) for each%   example, and K, the number of centroids. You should return a matrix%   centroids, where each row of centroids is the mean of the data points%   assigned to it.%% Useful variables[m n] = size(X);% You need to return the following variables correctly.centroids = zeros(K, n);% ====================== YOUR CODE HERE ======================% Instructions: Go over every centroid and compute mean of all points that%               belong to it. Concretely, the row vector centroids(i, :)%               should contain the mean of the data points assigned to%               centroid i.%% Note: You can use a for-loop over the centroids to compute this.%for i = 1:K,                                              %对每个中心遍历，一个一个计算    centroids(i,:) = (X' * (idx == i)) / sum(idx == i);   %矩阵的方法，idx == i的意思是是idx向量的元素为i的位置置1，不为i的置0；                                                          %然后乘以X’就是把对应中心i的X值加起来，最后除以sum及求平均                                                          %这里实际上可以再采用一个for循环计算centroids，但是向量的方法更快，故采取向量的方法end% =============================================================end

这里对每个类别进行遍历（共K类），然后对每类计算他的中心。对每个类别进行遍历时候需要一个FOR循环；在计算每个类别的中心时也可以采取一个for循环，但是这样太慢，可以采取向量的方法加快计算速度。

向量的方法即利用（idx==i）得到一个m*1的向量，当idx对应的位置为i时，此向量对应位置为1，否则为0。

FROM:http://blog.csdn.net/a1015553840/article/details/50877623

2 0