Coursera 《Machine Learning》 编程作业7:K-means聚类和主成分分析
来源:互联网 发布:沈航网络自助平台套餐 编辑:程序博客网 时间:2024/06/06 03:22
Coursera 《Machine Learning》 编程作业7:K-means聚类和主成分分析
K-means
K-means是一个迭代算法,算法接受一个未标记的数据集,然后将数据聚类成不同的组,假设我们想要将数据聚类成n个组,其方法为:
1. 首先选择K个随机的点,称为聚类中心(cluster centroids)
2. 对于数据集中的每一个数据,按照距离K个中心点的距离,将其与距离最近的中心点 关联起来,与同一个中心点关联的所有点聚成一类
3. 计算每一个组的平均值,将该组所关联的中心点移动到平均值的位置
4. 重复步骤2-4直至中心点不再变化
matlab描述如下:
% Initialize centroids centroids = kMeansInitCentroids(X, K); for iter = 1:iterations % Cluster assignment step: Assign each data point to the % closest centroid. idx(i) corresponds to cˆ(i), the index %of the centroid assigned to example i idx = findClosestCentroids(X, centroids); % Move centroid step: Compute means based on centroid % assignments centroids = computeMeans(X, idx, K);end
先在简单的2D数据集上运用k-means来对算法有一个直观地了解
随机初始化
作业中直接设置了K = 3,初始化聚类中心位置 = [3 3; 6 2; 8 5]
在实践中,一种初始化策略是随机选取K个数据点作为质心,代码实现如下:
function centroids = kMeansInitCentroids(X, K)%KMEANSINITCENTROIDS This function initializes K centroids that are to be %used in K-Means on the dataset X% centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be% used with the K-Means on the dataset X%% You should return this values correctlycentroids = zeros(K, size(X, 2));% ====================== YOUR CODE HERE ======================% Instructions: You should set centroids to randomly chosen examples from% the dataset X%% Initialize the centroids to be random examples% Randomly reorder the indices of examples randidx = randperm(size(X, 1)); % Take the first K examples as centroids centroids = X(randidx(1:K), :);% =============================================================end
K-均值的一个问题在于,它有可能会停留在一个局部最小值处,而这取决于初始化的情况。
为了解决这个问题,我们通常需要多次运行K-均值算法,每一次都重新进行随机初始化,最后再比较多次运行K-均值的结果,选择代价函数最小的结果。这种方法在K较小的时候(2-10) 还是可行的,但是如果K较大,这么作也可能不会有明显地改善。
寻找最近重心
对于每一个点找到重心j,使得欧氏距离最小:
代码实现如下:
function idx = findClosestCentroids(X, centroids)%FINDCLOSESTCENTROIDS computes the centroid memberships for every example% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids% in idx for a dataset X where each row is a single example. idx = m x 1 % vector of centroid assignments (i.e. each entry in range [1..K])%% Set KK = size(centroids, 1);% You need to return the following variables correctly.idx = zeros(size(X,1), 1);% ====================== YOUR CODE HERE ======================% Instructions: Go over every example, find its closest centroid, and store% the index inside idx at the appropriate location.% Concretely, idx(i) should contain the index of the centroid% closest to example i. Hence, it should be a value in the % range 1..K%% Note: You can use a for-loop over the examples to compute this.%for i = 1:size(X,1) M = sum((repmat(X(i,:),K,1) - centroids).^2,2); minimum = min(M); idx(i,:) = find(M == minimum);end%===========================================================end
计算重心
第k个团的重心计算方法如下:
代码实现如下:
function centroids = computeCentroids(X, idx, K)%COMPUTECENTROIDS returns the new centroids by computing the means of the %data points assigned to each centroid.% centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by % computing the means of the data points assigned to each centroid. It is% given a dataset X where each row is a single data point, a vector% idx of centroid assignments (i.e. each entry in range [1..K]) for each% example, and K, the number of centroids. You should return a matrix% centroids, where each row of centroids is the mean of the data points% assigned to it.%% Useful variables[m n] = size(X);% You need to return the following variables correctly.centroids = zeros(K, n);% ====================== YOUR CODE HERE ======================% Instructions: Go over every centroid and compute mean of all points that% belong to it. Concretely, the row vector centroids(i, :)% should contain the mean of the data points assigned to% centroid i.%% Note: You can use a for-loop over the centroids to compute this.%for i = 1:K id = find(idx == i); centroids(i,:) = sum(X(id,:)) / numel(id);end%=============================================================end
运行K-means
两部结合在一起就得到了可运行的K-means,迭代10次后得到下图:
主成分分析
首先先在一个简单的2D数据集上实现PCA
2D数据集
实现PCA
第一步是均值归一化。我们需要计算出所有特征的均值,然后令xj=xj-μj。如果特征是在不同的数量级上,我们还需要将其除以标准差σ2。
第二步是计算协方差矩阵(covariance matrix)Σ:
第三步是计算协方差矩阵的特征向量(eigenvectors):
在Matlab/Octave里我们可以利用奇异值分解(singular value decomposition)来求解,[U, S, V] = svd(sigma)。
数据归一化
function [X_norm, mu, sigma] = featureNormalize(X)%FEATURENORMALIZE Normalizes the features in X % FEATURENORMALIZE(X) returns a normalized version of X where% the mean value of each feature is 0 and the standard deviation% is 1. This is often a good preprocessing step to do when% working with learning algorithms.mu = mean(X);X_norm = bsxfun(@minus, X, mu);sigma = std(X_norm);X_norm = bsxfun(@rdivide, X_norm, sigma);% ============================================================end
实现
计算协方差矩阵并计算协方差矩阵的特征向量,代码实现如下:
function [U, S] = pca(X)%PCA Run principal component analysis on the dataset X% [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X% Returns the eigenvectors U, the eigenvalues (on diagonal) in S%% Useful values[m, n] = size(X);% You need to return the following variables correctly.U = zeros(n);S = zeros(n);% ====================== YOUR CODE HERE ======================% Instructions: You should first compute the covariance matrix. Then, you% should use the "svd" function to compute the eigenvectors% and eigenvalues of the covariance matrix. %% Note: When computing the covariance matrix, remember to divide by m (the% number of examples).%Sigma = X' * X / m;[U,S,V] = svd(Sigma); % =========================================================================end
可视化特征向量
PCA降维
对于一个n×n维度的矩阵,特征矩阵U是一个具有与数据之间最小投射误差的方向向量构成的矩阵。如果我们希望将数据从n维降至k维,我们只需要从U矩阵中选取前K个向量,获得一个n×k维度的矩阵,我们用Ureduce表示,然后通过如下计算获得要求的新特征向量Z:
代码实现如下:
function Z = projectData(X, U, K)%PROJECTDATA Computes the reduced data representation when projecting only %on to the top k eigenvectors% Z = projectData(X, U, K) computes the projection of % the normalized inputs X into the reduced dimensional space spanned by% the first K columns of U. It returns the projected examples in Z.%% You need to return the following variables correctly.Z = zeros(size(X, 1), K);% ====================== YOUR CODE HERE ======================% Instructions: Compute the projection of the data using only the top K % eigenvectors in U (first K columns). % For the i-th example X(i,:), the projection on to the k-th % eigenvector is given as follows:% x = X(i, :)';% projection_k = x' * U(:, k);%Z = X * U(:,1:K);% =============================================================end
还原数据
在压缩过数据后,我们可以采用如下方法来近似地获得原有的特征:
代码实现如下:
function X_rec = recoverData(Z, U, K)%RECOVERDATA Recovers an approximation of the original data when using the %projected data% X_rec = RECOVERDATA(Z, U, K) recovers an approximation the % original data that has been reduced to K dimensions. It returns the% approximate reconstruction in X_rec.%% You need to return the following variables correctly.X_rec = zeros(size(Z, 1), size(U, 1));% ====================== YOUR CODE HERE ======================% Instructions: Compute the approximation of the data by projecting back% onto the original space using the top K eigenvectors in U.%% For the i-th example Z(i,:), the (approximate)% recovered data for dimension j is given as follows:% v = Z(i, :)';% recovered_j = v' * U(j, 1:K)';%% Notice that U(j, 1:K) is a row vector.% X_rec = Z * U(:,1:K)';% =============================================================end
可视化投影
- Coursera 《Machine Learning》 编程作业7:K-means聚类和主成分分析
- Coursera Machine Learning 第八周week8ex7 K-Means Clustering and PCA编程全套满分题目+注释
- Coursera Machine Learning 第八周 quiz Programming Exercise 7 K-means Clustering and Principal Component
- Coursera Machine Learning 作业代码
- Coursera 《Machine Learning》 编程作业1:线性回归
- Coursera Machine Learning机器学习课程编程作业参考答案
- Coursera—machine learning(Andrew Ng)第二周编程作业
- Coursera—machine learning(Andrew Ng)第三周编程作业
- Coursera—machine learning(Andrew Ng)第四周编程作业
- Coursera—machine learning(Andrew Ng)第五周编程作业
- Coursera—machine learning(Andrew Ng)第六周编程作业
- Coursera—machine learning(Andrew Ng)第七周编程作业
- Coursera—machine learning(Andrew Ng)第八周编程作业
- Machine Learning第八周笔记:K-Means和降维
- Coursera Machine Learning 作业代码 week3
- Coursera 上 Machine Learning 编程源码和解答
- Stanford Machine Learning -- 第六讲 聚类算法 k-means
- Machine Learning by Andrew Ng --- K-means
- CodeForces-776D The Door Problem
- Laravel5.4快速开发简书网站
- 牛客网--链表反转打印
- 常见智力题总结
- 2.5类型别名,auto, decltype
- Coursera 《Machine Learning》 编程作业7:K-means聚类和主成分分析
- TCP三次握手四次挥手的原因
- React的事件处理函数
- 2017年腾讯实习生在线笔试编程题(1)
- 常见IO模型
- jQuery.lazyload详解
- python(三)6种排序算法性能比较(冒泡、选择、插入、希尔、快速、归并)
- Struts tiles入门(最最简单的例子)
- 51nod 1391 01串