Coursera 《Machine Learning》编程作业7：K-means聚类和主成分分析

来源：互联网发布：沈航网络自助平台套餐编辑：程序博客网时间：2024/06/06 03:22

Coursera 《Machine Learning》编程作业7：K-means聚类和主成分分析

K-means

K-means是一个迭代算法，算法接受一个未标记的数据集，然后将数据聚类成不同的组，假设我们想要将数据聚类成n个组，其方法为:
1. 首先选择K个随机的点，称为聚类中心（cluster centroids）
2. 对于数据集中的每一个数据，按照距离K个中心点的距离，将其与距离最近的中心点关联起来，与同一个中心点关联的所有点聚成一类
3. 计算每一个组的平均值，将该组所关联的中心点移动到平均值的位置
4. 重复步骤2-4直至中心点不再变化
matlab描述如下：

% Initialize centroids centroids = kMeansInitCentroids(X, K); for iter = 1:iterations     % Cluster assignment step: Assign each data point to the     % closest centroid. idx(i) corresponds to cˆ(i), the index      %of the centroid assigned to example i      idx = findClosestCentroids(X, centroids);    % Move centroid step: Compute means based on centroid     % assignments     centroids = computeMeans(X, idx, K);end

先在简单的2D数据集上运用k-means来对算法有一个直观地了解
这里写图片描述

随机初始化

作业中直接设置了K = 3，初始化聚类中心位置 = [3 3; 6 2; 8 5]
在实践中，一种初始化策略是随机选取K个数据点作为质心，代码实现如下：

function centroids = kMeansInitCentroids(X, K)%KMEANSINITCENTROIDS This function initializes K centroids that are to be %used in K-Means on the dataset X%   centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be%   used with the K-Means on the dataset X%% You should return this values correctlycentroids = zeros(K, size(X, 2));% ====================== YOUR CODE HERE ======================% Instructions: You should set centroids to randomly chosen examples from%               the dataset X%% Initialize the centroids to be random examples% Randomly reorder the indices of examples randidx = randperm(size(X, 1)); % Take the first K examples as centroids centroids = X(randidx(1:K), :);% =============================================================end

K-均值的一个问题在于，它有可能会停留在一个局部最小值处，而这取决于初始化的情况。
为了解决这个问题，我们通常需要多次运行K-均值算法，每一次都重新进行随机初始化，最后再比较多次运行K-均值的结果，选择代价函数最小的结果。这种方法在K较小的时候(2-10) 还是可行的，但是如果K较大，这么作也可能不会有明显地改善。

寻找最近重心

对于每一个点找到重心j，使得欧氏距离最小：
这里写图片描述
代码实现如下：

function idx = findClosestCentroids(X, centroids)%FINDCLOSESTCENTROIDS computes the centroid memberships for every example%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids%   in idx for a dataset X where each row is a single example. idx = m x 1 %   vector of centroid assignments (i.e. each entry in range [1..K])%% Set KK = size(centroids, 1);% You need to return the following variables correctly.idx = zeros(size(X,1), 1);% ====================== YOUR CODE HERE ======================% Instructions: Go over every example, find its closest centroid, and store%               the index inside idx at the appropriate location.%               Concretely, idx(i) should contain the index of the centroid%               closest to example i. Hence, it should be a value in the %               range 1..K%% Note: You can use a for-loop over the examples to compute this.%for i = 1:size(X,1)    M = sum((repmat(X(i,:),K,1) - centroids).^2,2);    minimum = min(M);    idx(i,:) = find(M == minimum);end%===========================================================end

计算重心

第k个团的重心计算方法如下：
这里写图片描述
代码实现如下：

function centroids = computeCentroids(X, idx, K)%COMPUTECENTROIDS returns the new centroids by computing the means of the %data points assigned to each centroid.%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by %   computing the means of the data points assigned to each centroid. It is%   given a dataset X where each row is a single data point, a vector%   idx of centroid assignments (i.e. each entry in range [1..K]) for each%   example, and K, the number of centroids. You should return a matrix%   centroids, where each row of centroids is the mean of the data points%   assigned to it.%% Useful variables[m n] = size(X);% You need to return the following variables correctly.centroids = zeros(K, n);% ====================== YOUR CODE HERE ======================% Instructions: Go over every centroid and compute mean of all points that%               belong to it. Concretely, the row vector centroids(i, :)%               should contain the mean of the data points assigned to%               centroid i.%% Note: You can use a for-loop over the centroids to compute this.%for i = 1:K    id = find(idx == i);    centroids(i,:) = sum(X(id,:)) / numel(id);end%=============================================================end

运行K-means

两部结合在一起就得到了可运行的K-means，迭代10次后得到下图：
这里写图片描述

主成分分析

首先先在一个简单的2D数据集上实现PCA

2D数据集

这里写图片描述

实现PCA

第一步是均值归一化。我们需要计算出所有特征的均值，然后令xj=xj-μj。如果特征是在不同的数量级上，我们还需要将其除以标准差σ2。
第二步是计算协方差矩阵（covariance matrix）Σ：
这里写图片描述

第三步是计算协方差矩阵的特征向量（eigenvectors）:
在Matlab/Octave里我们可以利用奇异值分解（singular value decomposition）来求解，[U, S, V] = svd(sigma)。

数据归一化

function [X_norm, mu, sigma] = featureNormalize(X)%FEATURENORMALIZE Normalizes the features in X %   FEATURENORMALIZE(X) returns a normalized version of X where%   the mean value of each feature is 0 and the standard deviation%   is 1. This is often a good preprocessing step to do when%   working with learning algorithms.mu = mean(X);X_norm = bsxfun(@minus, X, mu);sigma = std(X_norm);X_norm = bsxfun(@rdivide, X_norm, sigma);% ============================================================end

实现

计算协方差矩阵并计算协方差矩阵的特征向量，代码实现如下：

function [U, S] = pca(X)%PCA Run principal component analysis on the dataset X%   [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X%   Returns the eigenvectors U, the eigenvalues (on diagonal) in S%% Useful values[m, n] = size(X);% You need to return the following variables correctly.U = zeros(n);S = zeros(n);% ====================== YOUR CODE HERE ======================% Instructions: You should first compute the covariance matrix. Then, you%               should use the "svd" function to compute the eigenvectors%               and eigenvalues of the covariance matrix. %% Note: When computing the covariance matrix, remember to divide by m (the%       number of examples).%Sigma = X' * X / m;[U,S,V] = svd(Sigma); % =========================================================================end

可视化特征向量
这里写图片描述

PCA降维

对于一个n×n维度的矩阵，特征矩阵U是一个具有与数据之间最小投射误差的方向向量构成的矩阵。如果我们希望将数据从n维降至k维，我们只需要从U矩阵中选取前K个向量，获得一个n×k维度的矩阵，我们用Ureduce表示，然后通过如下计算获得要求的新特征向量Z：
这里写图片描述
代码实现如下：

function Z = projectData(X, U, K)%PROJECTDATA Computes the reduced data representation when projecting only %on to the top k eigenvectors%   Z = projectData(X, U, K) computes the projection of %   the normalized inputs X into the reduced dimensional space spanned by%   the first K columns of U. It returns the projected examples in Z.%% You need to return the following variables correctly.Z = zeros(size(X, 1), K);% ====================== YOUR CODE HERE ======================% Instructions: Compute the projection of the data using only the top K %               eigenvectors in U (first K columns). %               For the i-th example X(i,:), the projection on to the k-th %               eigenvector is given as follows:%                    x = X(i, :)';%                    projection_k = x' * U(:, k);%Z = X * U(:,1:K);% =============================================================end

还原数据

在压缩过数据后，我们可以采用如下方法来近似地获得原有的特征：这里写图片描述
代码实现如下：

function X_rec = recoverData(Z, U, K)%RECOVERDATA Recovers an approximation of the original data when using the %projected data%   X_rec = RECOVERDATA(Z, U, K) recovers an approximation the %   original data that has been reduced to K dimensions. It returns the%   approximate reconstruction in X_rec.%% You need to return the following variables correctly.X_rec = zeros(size(Z, 1), size(U, 1));% ====================== YOUR CODE HERE ======================% Instructions: Compute the approximation of the data by projecting back%               onto the original space using the top K eigenvectors in U.%%               For the i-th example Z(i,:), the (approximate)%               recovered data for dimension j is given as follows:%                    v = Z(i, :)';%                    recovered_j = v' * U(j, 1:K)';%%               Notice that U(j, 1:K) is a row vector.%               X_rec = Z * U(:,1:K)';% =============================================================end