二分k均值 Python实现

来源：互联网发布：seo网站结构分析工具编辑：程序博客网时间：2024/05/18 23:53

二分k-均值算法：

算法思想：

首先将所有点作为一个簇，然后将该簇一分为二。之后选择能最大程度降低聚类代价函数（也就是误差平方和）的簇划分为两个簇。以此进行下去，直到簇的数目等于用户给定的数目k为止。

算法伪代码：

*************************************************************

将所有数据点看成一个簇

当簇数目小于k时

对每一个簇

在给定的簇上面进行k-均值聚类（k=2）

计算总误差

选择使得误差最大的那个簇进行划分操作

*************************************************************

Python代码实现：

from numpy import *import pdbimport matplotlib.pyplot as pltdef createCenter(dataSet,k):    n = shape(dataSet)[0]    d = shape(dataSet)[1]    centroids = zeros((k,d))    for i in range(k):        c = int(random.uniform(0,n-1))  #float        centroids[i,:] = dataSet[c,:]    return centroids    def getDist(vec1,vec2):    return sqrt(sum(power(vec1 - vec2,2)))    def kmeans(dataSet,k):    n = shape(dataSet)[0]    clusterAssment = mat(zeros((n,2)))    centroids = createCenter(dataSet,k)        clusterChnaged = True    while clusterChnaged:        clusterChnaged = False                for i in range(n):            minDist = inf            minIndex = -1            for j in range(k):                distJI = getDist(dataSet[i,:],centroids[j,:])                if distJI < minDist:                    minDist = distJI                    minIndex = j            if clusterAssment[i,0] != minIndex:  #Convergence condition: distributions no longer change                clusterChnaged = True                clusterAssment[i,:] = minIndex,minDist**2                #update centroids        for  i in range(k):            ptsdataSet = dataSet[nonzero(clusterAssment[:,0].A == i)[0]]            centroids[i,:] = mean(ptsdataSet,axis = 0)         return centroids,clusterAssment     def print_result(dataSet,k,centroids,clusterAssment):    n,d = dataSet.shape    if d !=2:        print "Cannot draw!"        return 1    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']    if k > len(mark):        print "Sorry your k is too large"        return 1            for i in range(n):        markIndex = int(clusterAssment[i,0])        plt.plot(dataSet[i, 0],dataSet[i, 1],mark[markIndex])    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']      # draw the centroids      for i in range(k):          plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)      plt.show()     def biKmeans(dataSet, k):      numSamples = dataSet.shape[0]      # first column stores which cluster this sample belongs to,      # second column stores the error between this sample and its centroid      clusterAssment = mat(zeros((numSamples, 2)))        # step 1: the init cluster is the whole data set      centroid = mean(dataSet, axis = 0).tolist()[0]      centList = [centroid]      for i in xrange(numSamples):          clusterAssment[i, 1] = getDist(mat(centroid), dataSet[i, :])**2        while (len(centList) < k):          # min sum of square error          minSSE = inf          numCurrCluster = len(centList)          # for each cluster          for i in range(numCurrCluster):              # step 2: get samples in cluster i              pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]                # step 3: cluster it to 2 sub-clusters using k-means              centroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2)                # step 4: calculate the sum of square error after split this cluster              splitSSE = sum(splitClusterAssment[:, 1])              notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])              currSplitSSE = splitSSE + notSplitSSE                # step 5: find the best split cluster which has the min sum of square error              if currSplitSSE < minSSE:                  minSSE = currSplitSSE                  bestCentroidToSplit = i                  bestNewCentroids = centroids.copy()                  bestClusterAssment = splitClusterAssment.copy()            # step 6: modify the cluster index for adding new cluster          bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrCluster          bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit            # step 7: update and append the centroids of the new 2 sub-cluster          centList[bestCentroidToSplit] = bestNewCentroids[0, :]          centList.append(bestNewCentroids[1, :])            # step 8: update the index and error of the samples whose cluster have been changed          clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssment        plt.figure()        print_result(dataSet,len(centList),mat(centList),clusterAssment)              print 'Congratulations, cluster using bi-kmeans complete!'      return mat(centList), clusterAssment

其中，biKmeans(dataSet,k)为二分算法的主体，过程大体如下：

1.初始化质心，并建立所需要的数据存储结构

2.对每一个簇进行二分，选出最好的

3.更新各个簇的元素个数

划分结果：

二分的优点：

二分K均值算法可以加速K-means算法的执行速度，因为相似度计算少了
不受初始化问题的影响，因为随机点选取少了，且每一步保证误差最小

k均值的结果：

0 0