二分k均值 Python实现

来源:互联网 发布:seo网站结构分析工具 编辑:程序博客网 时间:2024/05/18 23:53

二分k-均值算法:

算法思想:

首先将所有点作为一个簇,然后将该簇一分为二。之后选择能最大程度降低聚类代价函数(也就是误差平方和)的簇划分为两个簇。以此进行下去,直到簇的数目等于用户给定的数目k为止

算法伪代码:

*************************************************************
将所有数据点看成一个簇
当簇数目小于k时
对每一个簇

在给定的簇上面进行k-均值聚类(k=2)

计算总误差

选择使得误差最大的那个簇进行划分操作

*************************************************************

Python代码实现:

from numpy import *import pdbimport matplotlib.pyplot as pltdef createCenter(dataSet,k):    n = shape(dataSet)[0]    d = shape(dataSet)[1]    centroids = zeros((k,d))    for i in range(k):        c = int(random.uniform(0,n-1))  #float        centroids[i,:] = dataSet[c,:]    return centroids    def getDist(vec1,vec2):    return sqrt(sum(power(vec1 - vec2,2)))    def kmeans(dataSet,k):    n = shape(dataSet)[0]    clusterAssment = mat(zeros((n,2)))    centroids = createCenter(dataSet,k)        clusterChnaged = True    while clusterChnaged:        clusterChnaged = False                for i in range(n):            minDist = inf            minIndex = -1            for j in range(k):                distJI = getDist(dataSet[i,:],centroids[j,:])                if distJI < minDist:                    minDist = distJI                    minIndex = j            if clusterAssment[i,0] != minIndex:  #Convergence condition: distributions no longer change                clusterChnaged = True                clusterAssment[i,:] = minIndex,minDist**2                #update centroids        for  i in range(k):            ptsdataSet = dataSet[nonzero(clusterAssment[:,0].A == i)[0]]            centroids[i,:] = mean(ptsdataSet,axis = 0)         return centroids,clusterAssment     def print_result(dataSet,k,centroids,clusterAssment):    n,d = dataSet.shape    if d !=2:        print "Cannot draw!"        return 1    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']    if k > len(mark):        print "Sorry your k is too large"        return 1            for i in range(n):        markIndex = int(clusterAssment[i,0])        plt.plot(dataSet[i, 0],dataSet[i, 1],mark[markIndex])    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']      # draw the centroids      for i in range(k):          plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)      plt.show()     def biKmeans(dataSet, k):      numSamples = dataSet.shape[0]      # first column stores which cluster this sample belongs to,      # second column stores the error between this sample and its centroid      clusterAssment = mat(zeros((numSamples, 2)))        # step 1: the init cluster is the whole data set      centroid = mean(dataSet, axis = 0).tolist()[0]      centList = [centroid]      for i in xrange(numSamples):          clusterAssment[i, 1] = getDist(mat(centroid), dataSet[i, :])**2        while (len(centList) < k):          # min sum of square error          minSSE = inf          numCurrCluster = len(centList)          # for each cluster          for i in range(numCurrCluster):              # step 2: get samples in cluster i              pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]                # step 3: cluster it to 2 sub-clusters using k-means              centroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2)                # step 4: calculate the sum of square error after split this cluster              splitSSE = sum(splitClusterAssment[:, 1])              notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])              currSplitSSE = splitSSE + notSplitSSE                # step 5: find the best split cluster which has the min sum of square error              if currSplitSSE < minSSE:                  minSSE = currSplitSSE                  bestCentroidToSplit = i                  bestNewCentroids = centroids.copy()                  bestClusterAssment = splitClusterAssment.copy()            # step 6: modify the cluster index for adding new cluster          bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrCluster          bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit            # step 7: update and append the centroids of the new 2 sub-cluster          centList[bestCentroidToSplit] = bestNewCentroids[0, :]          centList.append(bestNewCentroids[1, :])            # step 8: update the index and error of the samples whose cluster have been changed          clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssment        plt.figure()        print_result(dataSet,len(centList),mat(centList),clusterAssment)              print 'Congratulations, cluster using bi-kmeans complete!'      return mat(centList), clusterAssment  

其中,biKmeans(dataSet,k)为二分算法的主体,过程大体如下:

1.初始化质心,并建立所需要的数据存储结构

2.对每一个簇进行二分,选出最好的

3.更新各个簇的元素个数

划分结果:


二分的优点:

  • 二分K均值算法可以加速K-means算法的执行速度,因为相似度计算少了
  • 不受初始化问题的影响,因为随机点选取少了,且每一步保证误差最小

k均值的结果:


0 0
原创粉丝点击