【机器学习实战04】k-均值聚类算法

来源:互联网 发布:windows中的用户账户 编辑:程序博客网 时间:2024/05/09 06:21

1、聚类定义

聚类是一种无监督学习,它将相似的对象归为一类,簇内的对象越相似,聚类的效果越好。k-均值首先发现k个不同的簇,且每个簇的中心采用簇中所含值的均值计算而成。

2、开发机器学习应用程序的步骤

(1)收集数据:收集各种样本数据,为了节省时间,可以使用公开的可用数据源
(2)准备输入数据:确保数据格式符合要求,本书采用的格式是Python语言的List。
(3)数据分析:人工分析以前得到的数据,确保数据集中没有垃圾数据。
(4)训练算法,这一步才是机器学习真正的开始,对于监督学习,我们把之前的格式化数据输入到算法中,得到分类模型,方便后续处理。对于无监督学习,不存在目标变量,因此也不需要训练算法!!!
(5)测试算法:这一步使用第4步机器学习的模型。这一步主要测试算法的工作效率。
(6)使用算法:将机器学习算法转换为应用程序,执行实际任务。

3、K-均值聚类

首先,随机确定k个初始点作为质心。然后为每个点找距离其最近的质心,将其分配到该质心所对应的簇中。完成这一步之后,更新每个簇的质心为该簇所有点的平均值。迭代以上过程,直到簇分配结果不再改变,完成K-均值聚类。

4、k-均值伪代码

创建k个点作为起始质心(随机选择)
当任意一个点的簇分配结果发生改变时
        对数据集中的每个数据点
                对每个质心
                        计算质心与数据点之间的距离
                将数据点分配到距其最近的簇
         对每一个簇,计算簇中所有点的均值并将均值作为质心

5、算法实现

k-均值聚类支持函数

#encoding:utf-8from numpy import *from math import *#加载数据集def loadDataSet(fileName):      #将一个文本文件导入到列表中    dataMat = []                #创建一个dataMat空列表    fr = open(fileName)    for line in fr.readlines():  #一行一行的读取文本文件        curLine = line.strip().split('\t')        fltLine = map(float,curLine) #将文本文件转化为字符型        dataMat.append(fltLine)    return dataMat#计算两个向量的欧式距离  def distEclud(vecA, vecB):          return sqrt(sum(power(vecA - vecB, 2))) #随机构建初始质心def randCent(dataSet, k):               #构建一个包含k个随机质心的集合        #数据列数    n = shape(dataSet)[1]    #初始化质心    centroids = mat(zeros((k,n)))    for j in range(n):                          #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内        minJ = min(dataSet[:,j])         rangeJ = float(max(dataSet[:,j]) - minJ)        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))    return centroids#数据导入data = loadDataSet('testSet.txt')print "data:"print dataprint "\n"#将数据按照矩阵方式输出dataMat = mat(data)print "dataMat:"print dataMatprint "\n"#矩阵中的最大值和最小值print "矩阵中的最大值和最小值:"min1 = min(dataMat[:,0])print "min(dataMat[:,0]):"print min1min2 = min(dataMat[:,1])print "min(dataMat[:,1]):"print min2max1 = max(dataMat[:,0])print "max(dataMat[:,0]):"print max1max2 = max(dataMat[:,1])print "max(dataMat[:,1]):"print max2print "\n"#k个随机质心的集合,随机质心必须在整个数据集的边界之内,也就是在最小值和最大值之间centroids = randCent(dataMat, 2)print "质心:"print centroidsprint "\n"#计算距离distance = distEclud(dataMat[0],dataMat[1])print "distance:"print distance

Output:

data(列表形式):[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]dataMat(矩阵形式):[[ 1.658985  4.285136] [-3.453687  3.424321] [ 4.838138 -1.151539] [-5.379713 -3.362104] [ 0.972564  2.924086] [-3.567919  1.531611] [ 0.450614 -3.302219] [-3.487105 -1.724432] [ 2.668759  1.594842] [-3.156485  3.191137] [ 3.165506 -3.999838] [-2.786837 -3.099354] [ 4.208187  2.984927] [-2.123337  2.943366] [ 0.704199 -0.479481] [-0.39237  -3.963704] [ 2.831667  1.574018] [-0.790153  3.343144] [ 2.943496 -3.357075] [-3.195883 -2.283926] [ 2.336445  2.875106] [-1.786345  2.554248] [ 2.190101 -1.90602 ] [-3.403367 -2.778288] [ 1.778124  3.880832] [-1.688346  2.230267] [ 2.592976 -2.054368] [-4.007257 -3.207066] [ 2.257734  3.387564] [-2.679011  0.785119] [ 0.939512 -4.023563] [-3.674424 -2.261084] [ 2.046259  2.735279] [-3.18947   1.780269] [ 4.372646 -0.822248] [-2.579316 -3.497576] [ 1.889034  5.1904  ] [-0.798747  2.185588] [ 2.83652  -2.658556] [-3.837877 -3.253815] [ 2.096701  3.886007] [-2.709034  2.923887] [ 3.367037 -3.184789] [-2.121479 -4.232586] [ 2.329546  3.179764] [-3.284816  3.273099] [ 3.091414 -3.815232] [-3.762093 -2.432191] [ 3.542056  2.778832] [-1.736822  4.241041] [ 2.127073 -2.98368 ] [-4.323818 -3.938116] [ 3.792121  5.135768] [-4.786473  3.358547] [ 2.624081 -3.260715] [-4.009299 -2.978115] [ 2.493525  1.96371 ] [-2.513661  2.642162] [ 1.864375 -3.176309] [-3.171184 -3.572452] [ 2.89422   2.489128] [-2.562539  2.884438] [ 3.491078 -3.947487] [-2.565729 -2.012114] [ 3.332948  3.983102] [-1.616805  3.573188] [ 2.280615 -2.559444] [-2.651229 -3.103198] [ 2.321395  3.154987] [-1.685703  2.939697] [ 3.031012 -3.620252] [-4.599622 -2.185829] [ 4.196223  1.126677] [-2.133863  3.093686] [ 4.668892 -2.562705] [-2.793241 -2.149706] [ 2.884105  3.043438] [-2.967647  2.848696] [ 4.479332 -1.764772] [-4.905566 -2.91107 ]]矩阵中的最大值和最小值:min(dataMat[:,0]):[[-5.379713]]min(dataMat[:,1]):[[-4.232586]]max(dataMat[:,0]):[[ 4.838138]]max(dataMat[:,1]):[[ 5.1904]]质心:[[-0.63033973  3.67867827] [ 4.73957277 -1.20849738]]distance:5.18463281668

所有支持函数正常运行之后,就可以准备实现完整的k-均值聚类算法了。

k-均值聚类算法:

#encoding:utf-8from numpy import *from math import *#加载数据集def loadDataSet(fileName):      #将一个文本文件导入到列表中    dataMat = []                #创建一个dataMat空列表    fr = open(fileName)    for line in fr.readlines():  #一行一行的读取文本文件        curLine = line.strip().split('\t')        fltLine = map(float,curLine) #将文本文件转化为字符型        dataMat.append(fltLine)    return dataMat#计算两个向量的欧式距离  def distEclud(vecA, vecB):          return sqrt(sum(power(vecA - vecB, 2))) #随机构建初始质心def randCent(dataSet, k):               #构建一个包含k个随机质心的集合        #数据列数    n = shape(dataSet)[1]    #初始化质心    centroids = mat(zeros((k,n)))    for j in range(n):                          #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内        minJ = min(dataSet[:,j])         rangeJ = float(max(dataSet[:,j]) - minJ)        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))    return centroids #k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):        #数据点的行数    m = shape(dataSet)[0]    #用于记录数据点到质心的距离平方    clusterAssment = mat(zeros((m,2)))     #中心点                                   centroids = createCent(dataSet, k)    #聚类结束标志    clusterChanged = True    while clusterChanged:        clusterChanged = False        #遍历每条数据        for i in range(m):                #设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心            minDist = inf; minIndex = -1            #遍历每个质心            for j in range(k):                distJI = distMeas(centroids[j,:],dataSet[i,:])                if distJI < minDist:                        #将数据归为最近的质心                    minDist = distJI; minIndex = j            #簇分配结果发生改变,更新标志            if clusterAssment[i,0] != minIndex: clusterChanged = True            clusterAssment[i,:] = minIndex,minDist**2        print centroids        #更新质心        for cent in range(k):#recalculate centroids            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean     return centroids, clusterAssment#数据导入data = loadDataSet('testSet.txt')print "data:"print dataprint "\n"#将数据按照矩阵方式输出dataMat = mat(data)print "dataMat:"print dataMatprint "\n"centroids, clusterAssment = kMeans(dataMat,4)print centroids,clusterAssment

Output:

data:[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]dataMat:[[ 1.658985  4.285136] [-3.453687  3.424321] [ 4.838138 -1.151539] [-5.379713 -3.362104] [ 0.972564  2.924086] [-3.567919  1.531611] [ 0.450614 -3.302219] [-3.487105 -1.724432] [ 2.668759  1.594842] [-3.156485  3.191137] [ 3.165506 -3.999838] [-2.786837 -3.099354] [ 4.208187  2.984927] [-2.123337  2.943366] [ 0.704199 -0.479481] [-0.39237  -3.963704] [ 2.831667  1.574018] [-0.790153  3.343144] [ 2.943496 -3.357075] [-3.195883 -2.283926] [ 2.336445  2.875106] [-1.786345  2.554248] [ 2.190101 -1.90602 ] [-3.403367 -2.778288] [ 1.778124  3.880832] [-1.688346  2.230267] [ 2.592976 -2.054368] [-4.007257 -3.207066] [ 2.257734  3.387564] [-2.679011  0.785119] [ 0.939512 -4.023563] [-3.674424 -2.261084] [ 2.046259  2.735279] [-3.18947   1.780269] [ 4.372646 -0.822248] [-2.579316 -3.497576] [ 1.889034  5.1904  ] [-0.798747  2.185588] [ 2.83652  -2.658556] [-3.837877 -3.253815] [ 2.096701  3.886007] [-2.709034  2.923887] [ 3.367037 -3.184789] [-2.121479 -4.232586] [ 2.329546  3.179764] [-3.284816  3.273099] [ 3.091414 -3.815232] [-3.762093 -2.432191] [ 3.542056  2.778832] [-1.736822  4.241041] [ 2.127073 -2.98368 ] [-4.323818 -3.938116] [ 3.792121  5.135768] [-4.786473  3.358547] [ 2.624081 -3.260715] [-4.009299 -2.978115] [ 2.493525  1.96371 ] [-2.513661  2.642162] [ 1.864375 -3.176309] [-3.171184 -3.572452] [ 2.89422   2.489128] [-2.562539  2.884438] [ 3.491078 -3.947487] [-2.565729 -2.012114] [ 3.332948  3.983102] [-1.616805  3.573188] [ 2.280615 -2.559444] [-2.651229 -3.103198] [ 2.321395  3.154987] [-1.685703  2.939697] [ 3.031012 -3.620252] [-4.599622 -2.185829] [ 4.196223  1.126677] [-2.133863  3.093686] [ 4.668892 -2.562705] [-2.793241 -2.149706] [ 2.884105  3.043438] [-2.967647  2.848696] [ 4.479332 -1.764772] [-4.905566 -2.91107 ]][[ 3.23128394  0.95068522] [ 1.44392461 -0.39902729] [-1.62230418 -2.52007168] [-3.51187771  3.79667864]][[ 2.95373358  2.32801413] [ 2.5935345  -2.92880329] [-3.01169468 -3.01238673] [-2.46154315  2.78737555]][[ 2.6265299   3.10868015] [ 2.80293085 -2.7315146 ] [-3.38237045 -2.9473363 ] [-2.46154315  2.78737555]][[ 2.6265299   3.10868015] [ 2.80293085 -2.7315146 ] [-3.38237045 -2.9473363 ] [-2.46154315  2.78737555]] [[  0.00000000e+00   2.32019150e+00] [  3.00000000e+00   1.39004893e+00] [  1.00000000e+00   6.63839104e+00] [  2.00000000e+00   4.16140951e+00] [  0.00000000e+00   2.76967820e+00] [  3.00000000e+00   2.80101213e+00] [  1.00000000e+00   5.85909807e+00] [  2.00000000e+00   1.50646425e+00] [  0.00000000e+00   2.29348924e+00] [  3.00000000e+00   6.45967483e-01] [  1.00000000e+00   1.74010499e+00] [  2.00000000e+00   3.77769471e-01] [  0.00000000e+00   2.51695402e+00] [  3.00000000e+00   1.38716420e-01] [  1.00000000e+00   9.47633071e+00] [  2.00000000e+00   9.97310599e+00] [  0.00000000e+00   2.39726914e+00] [  3.00000000e+00   3.10242360e+00] [  1.00000000e+00   4.11084375e-01] [  2.00000000e+00   4.74890795e-01] [  0.00000000e+00   1.38706133e-01] [  3.00000000e+00   5.10240996e-01] [  1.00000000e+00   1.05700176e+00] [  2.00000000e+00   2.90181828e-02] [  0.00000000e+00   1.31601105e+00] [  3.00000000e+00   9.08203769e-01] [  1.00000000e+00   5.02608557e-01] [  2.00000000e+00   4.57942717e-01] [  0.00000000e+00   2.13786618e-01] [  3.00000000e+00   4.05632356e+00] [  1.00000000e+00   5.14171888e+00] [  2.00000000e+00   5.56237495e-01] [  0.00000000e+00   4.76142736e-01] [  3.00000000e+00   1.54414110e+00] [  1.00000000e+00   6.10930460e+00] [  2.00000000e+00   9.47660177e-01] [  0.00000000e+00   4.87745774e+00] [  3.00000000e+00   3.12703929e+00] [  1.00000000e+00   6.45118831e-03] [  2.00000000e+00   3.01415411e-01] [  0.00000000e+00   8.84955695e-01] [  3.00000000e+00   7.98870968e-02] [  1.00000000e+00   5.23673430e-01] [  2.00000000e+00   3.24171404e+00] [  0.00000000e+00   9.32523506e-02] [  3.00000000e+00   9.13705455e-01] [  1.00000000e+00   1.25766593e+00] [  2.00000000e+00   4.09563895e-01] [  0.00000000e+00   9.46987842e-01] [  3.00000000e+00   2.63836399e+00] [  1.00000000e+00   5.20371222e-01] [  2.00000000e+00   1.86796790e+00] [  0.00000000e+00   5.46768776e+00] [  3.00000000e+00   5.73153563e+00] [  1.00000000e+00   3.12040332e-01] [  2.00000000e+00   3.93986735e-01] [  0.00000000e+00   1.32864695e+00] [  3.00000000e+00   2.38032454e-02] [  1.00000000e+00   1.07872914e+00] [  2.00000000e+00   4.35369355e-01] [  0.00000000e+00   4.55502856e-01] [  3.00000000e+00   1.96212809e-02] [  1.00000000e+00   1.95213538e+00] [  2.00000000e+00   1.54154401e+00] [  0.00000000e+00   1.26364010e+00] [  3.00000000e+00   1.33108375e+00] [  1.00000000e+00   3.02422139e-01] [  2.00000000e+00   5.58860689e-01] [  0.00000000e+00   9.52516316e-02] [  3.00000000e+00   6.25129762e-01] [  1.00000000e+00   8.41875177e-01] [  2.00000000e+00   2.06159470e+00] [  0.00000000e+00   6.39227291e+00] [  3.00000000e+00   2.01200372e-01] [  1.00000000e+00   3.51030769e+00] [  2.00000000e+00   9.83287604e-01] [  0.00000000e+00   7.06014703e-02] [  3.00000000e+00   2.59901305e-01] [  1.00000000e+00   3.74491207e+00] [  2.00000000e+00   2.32143993e+00]]

6、二分k-均值聚类

二分-K均值是为了解决k-均值的用户自定义输入簇值k所延伸出来的自己判断k数目,其基本思路是:
为了得到k个簇,将所有点的集合分裂成两个簇,从这些簇中选取一个继续分裂,如此下去,直到产生k个簇。

#encoding :utf-8from numpy import *from math import *#加载数据集def loadDataSet(fileName):      #将一个文本文件导入到列表中    dataMat = []                #创建一个dataMat空列表    fr = open(fileName)    for line in fr.readlines():  #一行一行的读取文本文件        curLine = line.strip().split('\t')        fltLine = map(float,curLine) #将文本文件转化为字符型        dataMat.append(fltLine)    return dataMat#计算两个向量的欧式距离  def distEclud(vecA, vecB):          return sqrt(sum(power(vecA - vecB, 2))) #随机构建初始质心def randCent(dataSet, k):               #构建一个包含k个随机质心的集合        #数据列数    n = shape(dataSet)[1]    #初始化质心    centroids = mat(zeros((k,n)))    for j in range(n):                          #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内        minJ = min(dataSet[:,j])         rangeJ = float(max(dataSet[:,j]) - minJ)        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))    return centroids #k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):        #数据点的行数    m = shape(dataSet)[0]    #用于记录数据点到质心的距离平方    clusterAssment = mat(zeros((m,2)))     #中心点                                   centroids = createCent(dataSet, k)    #聚类结束标志    clusterChanged = True    while clusterChanged:        clusterChanged = False        #遍历每条数据        for i in range(m):                #设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心            minDist = inf; minIndex = -1            #遍历每个质心            for j in range(k):                distJI = distMeas(centroids[j,:],dataSet[i,:])                if distJI < minDist:                        #将数据归为最近的质心                    minDist = distJI; minIndex = j            #簇分配结果发生改变,更新标志            if clusterAssment[i,0] != minIndex: clusterChanged = True            clusterAssment[i,:] = minIndex,minDist**2        print centroids        #更新质心        for cent in range(k):#recalculate centroids            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean     return centroids, clusterAssmentdef biKmeans(dataSet, k, distMeas=distEclud):    m = shape(dataSet)[0]    clusterAssment = mat(zeros((m,2)))    #创建一个初始簇    centroid0 = mean(dataSet, axis=0).tolist()[0]    centList =[centroid0] #create a list with one centroid    #计算初始误差     for j in range(m):#calc initial Error        clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2    while (len(centList) < k):        lowestSSE = inf        #遍历每个簇,尝试划分每个簇          for i in range(len(centList)):            ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#get the data points currently in cluster i            centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)            #划分后的误差平方和            sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum            #剩余的误差之和            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])            print "sseSplit, and notSplit: ",sseSplit,sseNotSplit            if (sseSplit + sseNotSplit) < lowestSSE:                bestCentToSplit = i                bestNewCents = centroidMat                bestClustAss = splitClustAss.copy()                lowestSSE = sseSplit + sseNotSplit        #更新分配结果        bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever        bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit        print 'the bestCentToSplit is: ',bestCentToSplit        print 'the len of bestClustAss is: ', len(bestClustAss)        centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]#replace a centroid with two best centroids         centList.append(bestNewCents[1,:].tolist()[0])        clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE    return mat(centList), clusterAssment#数据导入data = loadDataSet('testSet.txt')print "data:"print dataprint "\n"#将数据按照矩阵方式输出dataMat = mat(data)print "dataMat:"print dataMatprint "\n"mat, clusterAssment = biKmeans(dataMat, 4)print mat, clusterAssment

Output:

data:[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]dataMat:[[ 1.658985  4.285136] [-3.453687  3.424321] [ 4.838138 -1.151539] [-5.379713 -3.362104] [ 0.972564  2.924086] [-3.567919  1.531611] [ 0.450614 -3.302219] [-3.487105 -1.724432] [ 2.668759  1.594842] [-3.156485  3.191137] [ 3.165506 -3.999838] [-2.786837 -3.099354] [ 4.208187  2.984927] [-2.123337  2.943366] [ 0.704199 -0.479481] [-0.39237  -3.963704] [ 2.831667  1.574018] [-0.790153  3.343144] [ 2.943496 -3.357075] [-3.195883 -2.283926] [ 2.336445  2.875106] [-1.786345  2.554248] [ 2.190101 -1.90602 ] [-3.403367 -2.778288] [ 1.778124  3.880832] [-1.688346  2.230267] [ 2.592976 -2.054368] [-4.007257 -3.207066] [ 2.257734  3.387564] [-2.679011  0.785119] [ 0.939512 -4.023563] [-3.674424 -2.261084] [ 2.046259  2.735279] [-3.18947   1.780269] [ 4.372646 -0.822248] [-2.579316 -3.497576] [ 1.889034  5.1904  ] [-0.798747  2.185588] [ 2.83652  -2.658556] [-3.837877 -3.253815] [ 2.096701  3.886007] [-2.709034  2.923887] [ 3.367037 -3.184789] [-2.121479 -4.232586] [ 2.329546  3.179764] [-3.284816  3.273099] [ 3.091414 -3.815232] [-3.762093 -2.432191] [ 3.542056  2.778832] [-1.736822  4.241041] [ 2.127073 -2.98368 ] [-4.323818 -3.938116] [ 3.792121  5.135768] [-4.786473  3.358547] [ 2.624081 -3.260715] [-4.009299 -2.978115] [ 2.493525  1.96371 ] [-2.513661  2.642162] [ 1.864375 -3.176309] [-3.171184 -3.572452] [ 2.89422   2.489128] [-2.562539  2.884438] [ 3.491078 -3.947487] [-2.565729 -2.012114] [ 3.332948  3.983102] [-1.616805  3.573188] [ 2.280615 -2.559444] [-2.651229 -3.103198] [ 2.321395  3.154987] [-1.685703  2.939697] [ 3.031012 -3.620252] [-4.599622 -2.185829] [ 4.196223  1.126677] [-2.133863  3.093686] [ 4.668892 -2.562705] [-2.793241 -2.149706] [ 2.884105  3.043438] [-2.967647  2.848696] [ 4.479332 -1.764772] [-4.905566 -2.91107 ]][[-0.61154259 -0.99840509] [ 0.21655089  0.84736297]][[-0.77465184 -2.93442862] [ 0.47379212  2.62599895]][[-0.54735726 -2.93692713] [ 0.2978695   2.76065064]][[-0.2897198  -2.83942545] [ 0.08249337  2.94802785]]sseSplit, and notSplit:  792.916856537 0.0the bestCentToSplit is:  0the len of bestClustAss is:  80[[-1.675842   -3.5709562 ] [-0.44613055 -1.66674358]][[-3.17656652 -2.99858519] [ 2.90100553 -2.66351205]][[-3.38237045 -2.9473363 ] [ 2.80293085 -2.7315146 ]]sseSplit, and notSplit:  83.5874695564 326.284075201[[ 0.17931125  3.21151684] [-2.49413705  4.77613273]][[ 1.6577805   2.93121792] [-2.84303986  2.97924629]][[ 2.6265299   3.10868015] [-2.46154315  2.78737555]]sseSplit, and notSplit:  66.36683512 466.632781336the bestCentToSplit is:  0the len of bestClustAss is:  40[[-1.61809661 -4.16618929] [-2.19748766 -3.31337473]][[-1.2569245  -4.098145  ] [-3.61853111 -2.81946867]]sseSplit, and notSplit:  19.619284753 377.270301893[[-4.25497296  3.22644812] [-3.07524324  4.4511578 ]][[-3.28879656  2.53721789] [ 1.06125497  3.06729526]][[-2.64677572  2.78993217] [ 2.31553173  3.07737886]][[-2.46154315  2.78737555] [ 2.6265299   3.10868015]]sseSplit, and notSplit:  66.36683512 83.5874695564[[ 1.60206889 -3.5436413 ] [ 3.03830692 -1.88925585]][[ 2.3728161 -3.548637 ] [ 3.2330456 -1.9143922]][[ 2.44798442 -3.43588358] [ 3.3353505  -1.67496113]][[ 2.47787177 -3.37608915] [ 3.406612   -1.53444757]]sseSplit, and notSplit:  31.6296069802 358.885318066the bestCentToSplit is:  1the len of bestClustAss is:  40[[-3.38237045 -2.9473363 ] [-2.46154315  2.78737555] [ 2.80293085 -2.7315146 ] [ 2.6265299   3.10868015]] [[  3.00000000e+00   2.32019150e+00] [  1.00000000e+00   1.39004893e+00] [  2.00000000e+00   6.63839104e+00] [  0.00000000e+00   4.16140951e+00] [  3.00000000e+00   2.76967820e+00] [  1.00000000e+00   2.80101213e+00] [  2.00000000e+00   5.85909807e+00] [  0.00000000e+00   1.50646425e+00] [  3.00000000e+00   2.29348924e+00] [  1.00000000e+00   6.45967483e-01] [  2.00000000e+00   1.74010499e+00] [  0.00000000e+00   3.77769471e-01] [  3.00000000e+00   2.51695402e+00] [  1.00000000e+00   1.38716420e-01] [  2.00000000e+00   9.47633071e+00] [  0.00000000e+00   9.97310599e+00] [  3.00000000e+00   2.39726914e+00] [  1.00000000e+00   3.10242360e+00] [  2.00000000e+00   4.11084375e-01] [  0.00000000e+00   4.74890795e-01] [  3.00000000e+00   1.38706133e-01] [  1.00000000e+00   5.10240996e-01] [  2.00000000e+00   1.05700176e+00] [  0.00000000e+00   2.90181828e-02] [  3.00000000e+00   1.31601105e+00] [  1.00000000e+00   9.08203769e-01] [  2.00000000e+00   5.02608557e-01] [  0.00000000e+00   4.57942717e-01] [  3.00000000e+00   2.13786618e-01] [  1.00000000e+00   4.05632356e+00] [  2.00000000e+00   5.14171888e+00] [  0.00000000e+00   5.56237495e-01] [  3.00000000e+00   4.76142736e-01] [  1.00000000e+00   1.54414110e+00] [  2.00000000e+00   6.10930460e+00] [  0.00000000e+00   9.47660177e-01] [  3.00000000e+00   4.87745774e+00] [  1.00000000e+00   3.12703929e+00] [  2.00000000e+00   6.45118831e-03] [  0.00000000e+00   3.01415411e-01] [  3.00000000e+00   8.84955695e-01] [  1.00000000e+00   7.98870968e-02] [  2.00000000e+00   5.23673430e-01] [  0.00000000e+00   3.24171404e+00] [  3.00000000e+00   9.32523506e-02] [  1.00000000e+00   9.13705455e-01] [  2.00000000e+00   1.25766593e+00] [  0.00000000e+00   4.09563895e-01] [  3.00000000e+00   9.46987842e-01] [  1.00000000e+00   2.63836399e+00] [  2.00000000e+00   5.20371222e-01] [  0.00000000e+00   1.86796790e+00] [  3.00000000e+00   5.46768776e+00] [  1.00000000e+00   5.73153563e+00] [  2.00000000e+00   3.12040332e-01] [  0.00000000e+00   3.93986735e-01] [  3.00000000e+00   1.32864695e+00] [  1.00000000e+00   2.38032454e-02] [  2.00000000e+00   1.07872914e+00] [  0.00000000e+00   4.35369355e-01] [  3.00000000e+00   4.55502856e-01] [  1.00000000e+00   1.96212809e-02] [  2.00000000e+00   1.95213538e+00] [  0.00000000e+00   1.54154401e+00] [  3.00000000e+00   1.26364010e+00] [  1.00000000e+00   1.33108375e+00] [  2.00000000e+00   3.02422139e-01] [  0.00000000e+00   5.58860689e-01] [  3.00000000e+00   9.52516316e-02] [  1.00000000e+00   6.25129762e-01] [  2.00000000e+00   8.41875177e-01] [  0.00000000e+00   2.06159470e+00] [  3.00000000e+00   6.39227291e+00] [  1.00000000e+00   2.01200372e-01] [  2.00000000e+00   3.51030769e+00] [  0.00000000e+00   9.83287604e-01] [  3.00000000e+00   7.06014703e-02] [  1.00000000e+00   2.59901305e-01] [  2.00000000e+00   3.74491207e+00] [  0.00000000e+00   2.32143993e+00]]
0 0