机器学习小组知识点35:二分K-means聚类

来源:互联网 发布:sql语句设置别名 编辑:程序博客网 时间:2024/06/06 05:24

常规的KMeans算法的误差通常只能收敛到局部最小,在此,引入一种称为二分K-Means(bisecting kmeans)的算法,相较于常规的KMeans,二分KMeans不急于一来就随机K个聚类中心,而是首先把所有点归为一个簇,然后将该簇一分为二。计算各个所得簇的代价函数(即均方误差),选择误差最大的簇再进行划分(即最大程度地减少误差),重复该过程直至达到期望的簇数目。

二分K-means算法的主要思想:

首先将所有点作为一个簇,然后将该簇一分为二。之后选择能最大程度降低聚类代价函数(也就是误差平方和)的簇划分为两个簇。以此进行下去,直到簇的数目等于用户给定的数目K为止

二分KMeans算法流程大致如下:

初始化簇表,使之包含由所有的点组成的簇。
repeat
{
{对选定的簇进行多次二分试验}
     for i=1 to 试验次数 do
       使用基本k均值,二分选定的簇。
    endfor
从二分试验中选择具有最小误差的两个簇。
将这两个簇添加到簇表中。
}until 簇表中包含k个簇

虽然二分KMeans能带来全局最优解,但是我们也可以看到,该算法是一个贪心算法,因此计算量不小。

附录:Matlab代码实现

function bikMeans%%clcclearclose all%%biK = 4;biDataSet = load('testSet.txt');[row,col] = size(biDataSet);% 存储质心矩阵biCentSet = zeros(biK,col);% 初始化设定cluster数量为1numCluster = 1;%第一列存储每个点被分配的质心,第二列存储点到质心的距离biClusterAssume = zeros(row,2);%初始化质心biCentSet(1,:) = mean(biDataSet)for i = 1:row biClusterAssume(i,1) = numCluster; biClusterAssume(i,2) = distEclud(biDataSet(i,:),biCentSet(1,:));endwhile numCluster < biK minSSE = 10000; %寻找对哪个cluster进行划分最好,也就是寻找SSE最小的那个cluster for j = 1:numCluster curCluster = biDataSet(find(biClusterAssume(:,1) == j),:); [spiltCentSet,spiltClusterAssume] = kMeans(curCluster,2); spiltSSE = sum(spiltClusterAssume(:,2)); noSpiltSSE = sum(biClusterAssume(find(biClusterAssume(:,1)~=j),2)); curSSE = spiltSSE + noSpiltSSE; fprintf('第%d个cluster被划分后的误差为:%f \n' , [j, curSSE]) if (curSSE < minSSE) minSSE = curSSE; bestClusterToSpilt = j; bestClusterAssume = spiltClusterAssume; bestCentSet = spiltCentSet; end end bestClusterToSpilt bestCentSet %更新cluster的数目 numCluster = numCluster + 1; bestClusterAssume(find(bestClusterAssume(:,1) == 1),1) = bestClusterToSpilt; bestClusterAssume(find(bestClusterAssume(:,1) == 2),1) = numCluster; % 更新和添加质心坐标 biCentSet(bestClusterToSpilt,:) = bestCentSet(1,:); biCentSet(numCluster,:) = bestCentSet(2,:); biCentSet % 更新被划分的cluster的每个点的质心分配以及误差 biClusterAssume(find(biClusterAssume(:,1) == bestClusterToSpilt),:) = bestClusterAssume;endfigure%scatter(dataSet(:,1),dataSet(:,2),5)for i = 1:biK pointCluster = find(biClusterAssume(:,1) == i); scatter(biDataSet(pointCluster,1),biDataSet(pointCluster,2),5) hold onend%hold onscatter(biCentSet(:,1),biCentSet(:,2),300,'+')hold offend% 计算欧式距离function dist = distEclud(vecA,vecB) dist = sum(power((vecA-vecB),2));end% K-means算法function [centSet,clusterAssment] = kMeans(dataSet,K)[row,col] = size(dataSet);% 存储质心矩阵centSet = zeros(K,col);% 随机初始化质心for i= 1:col minV = min(dataSet(:,i)); rangV = max(dataSet(:,i)) - minV; centSet(:,i) = repmat(minV,[K,1]) + rangV*rand(K,1);end% 用于存储每个点被分配的cluster以及到质心的距离clusterAssment = zeros(row,2);clusterChange = true;while clusterChange clusterChange = false; % 计算每个点应该被分配的cluster for i = 1:row % 这部分可能可以优化 minDist = 10000; minIndex = 0; for j = 1:K distCal = distEclud(dataSet(i,:) , centSet(j,:)); if (distCal < minDist) minDist = distCal; minIndex = j; end end if minIndex ~= clusterAssment(i,1)  clusterChange = true; end clusterAssment(i,1) = minIndex; clusterAssment(i,2) = minDist; end% 更新每个cluster 的质心 for j = 1:K simpleCluster = find(clusterAssment(:,1) == j); centSet(j,:) = mean(dataSet(simpleCluster',:)); endendend
0 0