二分k均值 Python实现
来源:互联网 发布:seo网站结构分析工具 编辑:程序博客网 时间:2024/05/18 23:53
二分k-均值算法:
算法思想:
首先将所有点作为一个簇,然后将该簇一分为二。之后选择能最大程度降低聚类代价函数(也就是误差平方和)的簇划分为两个簇。以此进行下去,直到簇的数目等于用户给定的数目k为止。
算法伪代码:
*************************************************************将所有数据点看成一个簇
当簇数目小于k时
对每一个簇
在给定的簇上面进行k-均值聚类(k=2)
计算总误差
选择使得误差最大的那个簇进行划分操作*************************************************************
Python代码实现:
from numpy import *import pdbimport matplotlib.pyplot as pltdef createCenter(dataSet,k): n = shape(dataSet)[0] d = shape(dataSet)[1] centroids = zeros((k,d)) for i in range(k): c = int(random.uniform(0,n-1)) #float centroids[i,:] = dataSet[c,:] return centroids def getDist(vec1,vec2): return sqrt(sum(power(vec1 - vec2,2))) def kmeans(dataSet,k): n = shape(dataSet)[0] clusterAssment = mat(zeros((n,2))) centroids = createCenter(dataSet,k) clusterChnaged = True while clusterChnaged: clusterChnaged = False for i in range(n): minDist = inf minIndex = -1 for j in range(k): distJI = getDist(dataSet[i,:],centroids[j,:]) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i,0] != minIndex: #Convergence condition: distributions no longer change clusterChnaged = True clusterAssment[i,:] = minIndex,minDist**2 #update centroids for i in range(k): ptsdataSet = dataSet[nonzero(clusterAssment[:,0].A == i)[0]] centroids[i,:] = mean(ptsdataSet,axis = 0) return centroids,clusterAssment def print_result(dataSet,k,centroids,clusterAssment): n,d = dataSet.shape if d !=2: print "Cannot draw!" return 1 mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr'] if k > len(mark): print "Sorry your k is too large" return 1 for i in range(n): markIndex = int(clusterAssment[i,0]) plt.plot(dataSet[i, 0],dataSet[i, 1],mark[markIndex]) mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb'] # draw the centroids for i in range(k): plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12) plt.show() def biKmeans(dataSet, k): numSamples = dataSet.shape[0] # first column stores which cluster this sample belongs to, # second column stores the error between this sample and its centroid clusterAssment = mat(zeros((numSamples, 2))) # step 1: the init cluster is the whole data set centroid = mean(dataSet, axis = 0).tolist()[0] centList = [centroid] for i in xrange(numSamples): clusterAssment[i, 1] = getDist(mat(centroid), dataSet[i, :])**2 while (len(centList) < k): # min sum of square error minSSE = inf numCurrCluster = len(centList) # for each cluster for i in range(numCurrCluster): # step 2: get samples in cluster i pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :] # step 3: cluster it to 2 sub-clusters using k-means centroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2) # step 4: calculate the sum of square error after split this cluster splitSSE = sum(splitClusterAssment[:, 1]) notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1]) currSplitSSE = splitSSE + notSplitSSE # step 5: find the best split cluster which has the min sum of square error if currSplitSSE < minSSE: minSSE = currSplitSSE bestCentroidToSplit = i bestNewCentroids = centroids.copy() bestClusterAssment = splitClusterAssment.copy() # step 6: modify the cluster index for adding new cluster bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrCluster bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit # step 7: update and append the centroids of the new 2 sub-cluster centList[bestCentroidToSplit] = bestNewCentroids[0, :] centList.append(bestNewCentroids[1, :]) # step 8: update the index and error of the samples whose cluster have been changed clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssment plt.figure() print_result(dataSet,len(centList),mat(centList),clusterAssment) print 'Congratulations, cluster using bi-kmeans complete!' return mat(centList), clusterAssment
其中,biKmeans(dataSet,k)为二分算法的主体,过程大体如下:
1.初始化质心,并建立所需要的数据存储结构
2.对每一个簇进行二分,选出最好的
3.更新各个簇的元素个数
划分结果:
二分的优点:
- 二分K均值算法可以加速K-means算法的执行速度,因为相似度计算少了
- 不受初始化问题的影响,因为随机点选取少了,且每一步保证误差最小
k均值的结果:
0 0
- 二分k均值 Python实现
- 二分K均值c++实现
- 机器学习经典算法详解及Python实现--聚类及K均值、二分K-均值聚类算法
- 机器学习经典算法详解及Python实现--聚类及K均值、二分K-均值聚类算法
- 【二分-kMeans算法】二分K均值聚类分析与Python代码实现
- 二分K-均值算法 bisecting K-means in Python
- 《机器学习实战》之二分K-均值聚类算法的python实现
- 二分k-均值算法
- 聚类分析的K均值算法(Python实现)
- python实现之K-均值聚类
- 二分k均值聚类
- 二分k均值聚类
- C++实现K均值
- 二分K均值(bisecting k-means)算法
- k-Means(二分k-均值算法)
- k均值聚类和二分k均值聚类
- python面向对象实现K均值聚类
- Python K均值聚类
- android---自定义折线图
- 一步步教你从VC 6.0 转到 Visual Studio 编写C程序
- 错误:push.default is unset; its implicit 的解决
- 从数组中取出n个元素的所有组合(递归实现)
- android---简单语音合成
- 二分k均值 Python实现
- 改变Activity启动时的默认动画
- 在C++中将一个GUID变量转换成为string变量
- android---高德地图(5)导航界面(语音播报)实现
- android---高德地图(4)路线规划
- mysql 空间计算
- 剑指offer(39):数字在排序数组中出现的次数
- [转]JAVA String 类
- Android Glide图片加载框架图片变色变绿解决方法