【机器学习实战04】k-均值聚类算法
来源:互联网 发布:windows中的用户账户 编辑:程序博客网 时间:2024/05/09 06:21
1、聚类定义
聚类是一种无监督学习,它将相似的对象归为一类,簇内的对象越相似,聚类的效果越好。k-均值首先发现k个不同的簇,且每个簇的中心采用簇中所含值的均值计算而成。
2、开发机器学习应用程序的步骤
(1)收集数据:收集各种样本数据,为了节省时间,可以使用公开的可用数据源
(2)准备输入数据:确保数据格式符合要求,本书采用的格式是Python语言的List。
(3)数据分析:人工分析以前得到的数据,确保数据集中没有垃圾数据。
(4)训练算法,这一步才是机器学习真正的开始,对于监督学习,我们把之前的格式化数据输入到算法中,得到分类模型,方便后续处理。对于无监督学习,不存在目标变量,因此也不需要训练算法!!!
(5)测试算法:这一步使用第4步机器学习的模型。这一步主要测试算法的工作效率。
(6)使用算法:将机器学习算法转换为应用程序,执行实际任务。
3、K-均值聚类
首先,随机确定k个初始点作为质心。然后为每个点找距离其最近的质心,将其分配到该质心所对应的簇中。完成这一步之后,更新每个簇的质心为该簇所有点的平均值。迭代以上过程,直到簇分配结果不再改变,完成K-均值聚类。
4、k-均值伪代码
创建k个点作为起始质心(随机选择)
当任意一个点的簇分配结果发生改变时
对数据集中的每个数据点
对每个质心
计算质心与数据点之间的距离
将数据点分配到距其最近的簇
对每一个簇,计算簇中所有点的均值并将均值作为质心
5、算法实现
k-均值聚类支持函数
#encoding:utf-8from numpy import *from math import *#加载数据集def loadDataSet(fileName): #将一个文本文件导入到列表中 dataMat = [] #创建一个dataMat空列表 fr = open(fileName) for line in fr.readlines(): #一行一行的读取文本文件 curLine = line.strip().split('\t') fltLine = map(float,curLine) #将文本文件转化为字符型 dataMat.append(fltLine) return dataMat#计算两个向量的欧式距离 def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2))) #随机构建初始质心def randCent(dataSet, k): #构建一个包含k个随机质心的集合 #数据列数 n = shape(dataSet)[1] #初始化质心 centroids = mat(zeros((k,n))) for j in range(n): #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内 minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) return centroids#数据导入data = loadDataSet('testSet.txt')print "data:"print dataprint "\n"#将数据按照矩阵方式输出dataMat = mat(data)print "dataMat:"print dataMatprint "\n"#矩阵中的最大值和最小值print "矩阵中的最大值和最小值:"min1 = min(dataMat[:,0])print "min(dataMat[:,0]):"print min1min2 = min(dataMat[:,1])print "min(dataMat[:,1]):"print min2max1 = max(dataMat[:,0])print "max(dataMat[:,0]):"print max1max2 = max(dataMat[:,1])print "max(dataMat[:,1]):"print max2print "\n"#k个随机质心的集合,随机质心必须在整个数据集的边界之内,也就是在最小值和最大值之间centroids = randCent(dataMat, 2)print "质心:"print centroidsprint "\n"#计算距离distance = distEclud(dataMat[0],dataMat[1])print "distance:"print distance
Output:
data(列表形式):[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]dataMat(矩阵形式):[[ 1.658985 4.285136] [-3.453687 3.424321] [ 4.838138 -1.151539] [-5.379713 -3.362104] [ 0.972564 2.924086] [-3.567919 1.531611] [ 0.450614 -3.302219] [-3.487105 -1.724432] [ 2.668759 1.594842] [-3.156485 3.191137] [ 3.165506 -3.999838] [-2.786837 -3.099354] [ 4.208187 2.984927] [-2.123337 2.943366] [ 0.704199 -0.479481] [-0.39237 -3.963704] [ 2.831667 1.574018] [-0.790153 3.343144] [ 2.943496 -3.357075] [-3.195883 -2.283926] [ 2.336445 2.875106] [-1.786345 2.554248] [ 2.190101 -1.90602 ] [-3.403367 -2.778288] [ 1.778124 3.880832] [-1.688346 2.230267] [ 2.592976 -2.054368] [-4.007257 -3.207066] [ 2.257734 3.387564] [-2.679011 0.785119] [ 0.939512 -4.023563] [-3.674424 -2.261084] [ 2.046259 2.735279] [-3.18947 1.780269] [ 4.372646 -0.822248] [-2.579316 -3.497576] [ 1.889034 5.1904 ] [-0.798747 2.185588] [ 2.83652 -2.658556] [-3.837877 -3.253815] [ 2.096701 3.886007] [-2.709034 2.923887] [ 3.367037 -3.184789] [-2.121479 -4.232586] [ 2.329546 3.179764] [-3.284816 3.273099] [ 3.091414 -3.815232] [-3.762093 -2.432191] [ 3.542056 2.778832] [-1.736822 4.241041] [ 2.127073 -2.98368 ] [-4.323818 -3.938116] [ 3.792121 5.135768] [-4.786473 3.358547] [ 2.624081 -3.260715] [-4.009299 -2.978115] [ 2.493525 1.96371 ] [-2.513661 2.642162] [ 1.864375 -3.176309] [-3.171184 -3.572452] [ 2.89422 2.489128] [-2.562539 2.884438] [ 3.491078 -3.947487] [-2.565729 -2.012114] [ 3.332948 3.983102] [-1.616805 3.573188] [ 2.280615 -2.559444] [-2.651229 -3.103198] [ 2.321395 3.154987] [-1.685703 2.939697] [ 3.031012 -3.620252] [-4.599622 -2.185829] [ 4.196223 1.126677] [-2.133863 3.093686] [ 4.668892 -2.562705] [-2.793241 -2.149706] [ 2.884105 3.043438] [-2.967647 2.848696] [ 4.479332 -1.764772] [-4.905566 -2.91107 ]]矩阵中的最大值和最小值:min(dataMat[:,0]):[[-5.379713]]min(dataMat[:,1]):[[-4.232586]]max(dataMat[:,0]):[[ 4.838138]]max(dataMat[:,1]):[[ 5.1904]]质心:[[-0.63033973 3.67867827] [ 4.73957277 -1.20849738]]distance:5.18463281668
所有支持函数正常运行之后,就可以准备实现完整的k-均值聚类算法了。
k-均值聚类算法:
#encoding:utf-8from numpy import *from math import *#加载数据集def loadDataSet(fileName): #将一个文本文件导入到列表中 dataMat = [] #创建一个dataMat空列表 fr = open(fileName) for line in fr.readlines(): #一行一行的读取文本文件 curLine = line.strip().split('\t') fltLine = map(float,curLine) #将文本文件转化为字符型 dataMat.append(fltLine) return dataMat#计算两个向量的欧式距离 def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2))) #随机构建初始质心def randCent(dataSet, k): #构建一个包含k个随机质心的集合 #数据列数 n = shape(dataSet)[1] #初始化质心 centroids = mat(zeros((k,n))) for j in range(n): #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内 minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) return centroids #k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): #数据点的行数 m = shape(dataSet)[0] #用于记录数据点到质心的距离平方 clusterAssment = mat(zeros((m,2))) #中心点 centroids = createCent(dataSet, k) #聚类结束标志 clusterChanged = True while clusterChanged: clusterChanged = False #遍历每条数据 for i in range(m): #设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心 minDist = inf; minIndex = -1 #遍历每个质心 for j in range(k): distJI = distMeas(centroids[j,:],dataSet[i,:]) if distJI < minDist: #将数据归为最近的质心 minDist = distJI; minIndex = j #簇分配结果发生改变,更新标志 if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 print centroids #更新质心 for cent in range(k):#recalculate centroids ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean return centroids, clusterAssment#数据导入data = loadDataSet('testSet.txt')print "data:"print dataprint "\n"#将数据按照矩阵方式输出dataMat = mat(data)print "dataMat:"print dataMatprint "\n"centroids, clusterAssment = kMeans(dataMat,4)print centroids,clusterAssment
Output:
data:[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]dataMat:[[ 1.658985 4.285136] [-3.453687 3.424321] [ 4.838138 -1.151539] [-5.379713 -3.362104] [ 0.972564 2.924086] [-3.567919 1.531611] [ 0.450614 -3.302219] [-3.487105 -1.724432] [ 2.668759 1.594842] [-3.156485 3.191137] [ 3.165506 -3.999838] [-2.786837 -3.099354] [ 4.208187 2.984927] [-2.123337 2.943366] [ 0.704199 -0.479481] [-0.39237 -3.963704] [ 2.831667 1.574018] [-0.790153 3.343144] [ 2.943496 -3.357075] [-3.195883 -2.283926] [ 2.336445 2.875106] [-1.786345 2.554248] [ 2.190101 -1.90602 ] [-3.403367 -2.778288] [ 1.778124 3.880832] [-1.688346 2.230267] [ 2.592976 -2.054368] [-4.007257 -3.207066] [ 2.257734 3.387564] [-2.679011 0.785119] [ 0.939512 -4.023563] [-3.674424 -2.261084] [ 2.046259 2.735279] [-3.18947 1.780269] [ 4.372646 -0.822248] [-2.579316 -3.497576] [ 1.889034 5.1904 ] [-0.798747 2.185588] [ 2.83652 -2.658556] [-3.837877 -3.253815] [ 2.096701 3.886007] [-2.709034 2.923887] [ 3.367037 -3.184789] [-2.121479 -4.232586] [ 2.329546 3.179764] [-3.284816 3.273099] [ 3.091414 -3.815232] [-3.762093 -2.432191] [ 3.542056 2.778832] [-1.736822 4.241041] [ 2.127073 -2.98368 ] [-4.323818 -3.938116] [ 3.792121 5.135768] [-4.786473 3.358547] [ 2.624081 -3.260715] [-4.009299 -2.978115] [ 2.493525 1.96371 ] [-2.513661 2.642162] [ 1.864375 -3.176309] [-3.171184 -3.572452] [ 2.89422 2.489128] [-2.562539 2.884438] [ 3.491078 -3.947487] [-2.565729 -2.012114] [ 3.332948 3.983102] [-1.616805 3.573188] [ 2.280615 -2.559444] [-2.651229 -3.103198] [ 2.321395 3.154987] [-1.685703 2.939697] [ 3.031012 -3.620252] [-4.599622 -2.185829] [ 4.196223 1.126677] [-2.133863 3.093686] [ 4.668892 -2.562705] [-2.793241 -2.149706] [ 2.884105 3.043438] [-2.967647 2.848696] [ 4.479332 -1.764772] [-4.905566 -2.91107 ]][[ 3.23128394 0.95068522] [ 1.44392461 -0.39902729] [-1.62230418 -2.52007168] [-3.51187771 3.79667864]][[ 2.95373358 2.32801413] [ 2.5935345 -2.92880329] [-3.01169468 -3.01238673] [-2.46154315 2.78737555]][[ 2.6265299 3.10868015] [ 2.80293085 -2.7315146 ] [-3.38237045 -2.9473363 ] [-2.46154315 2.78737555]][[ 2.6265299 3.10868015] [ 2.80293085 -2.7315146 ] [-3.38237045 -2.9473363 ] [-2.46154315 2.78737555]] [[ 0.00000000e+00 2.32019150e+00] [ 3.00000000e+00 1.39004893e+00] [ 1.00000000e+00 6.63839104e+00] [ 2.00000000e+00 4.16140951e+00] [ 0.00000000e+00 2.76967820e+00] [ 3.00000000e+00 2.80101213e+00] [ 1.00000000e+00 5.85909807e+00] [ 2.00000000e+00 1.50646425e+00] [ 0.00000000e+00 2.29348924e+00] [ 3.00000000e+00 6.45967483e-01] [ 1.00000000e+00 1.74010499e+00] [ 2.00000000e+00 3.77769471e-01] [ 0.00000000e+00 2.51695402e+00] [ 3.00000000e+00 1.38716420e-01] [ 1.00000000e+00 9.47633071e+00] [ 2.00000000e+00 9.97310599e+00] [ 0.00000000e+00 2.39726914e+00] [ 3.00000000e+00 3.10242360e+00] [ 1.00000000e+00 4.11084375e-01] [ 2.00000000e+00 4.74890795e-01] [ 0.00000000e+00 1.38706133e-01] [ 3.00000000e+00 5.10240996e-01] [ 1.00000000e+00 1.05700176e+00] [ 2.00000000e+00 2.90181828e-02] [ 0.00000000e+00 1.31601105e+00] [ 3.00000000e+00 9.08203769e-01] [ 1.00000000e+00 5.02608557e-01] [ 2.00000000e+00 4.57942717e-01] [ 0.00000000e+00 2.13786618e-01] [ 3.00000000e+00 4.05632356e+00] [ 1.00000000e+00 5.14171888e+00] [ 2.00000000e+00 5.56237495e-01] [ 0.00000000e+00 4.76142736e-01] [ 3.00000000e+00 1.54414110e+00] [ 1.00000000e+00 6.10930460e+00] [ 2.00000000e+00 9.47660177e-01] [ 0.00000000e+00 4.87745774e+00] [ 3.00000000e+00 3.12703929e+00] [ 1.00000000e+00 6.45118831e-03] [ 2.00000000e+00 3.01415411e-01] [ 0.00000000e+00 8.84955695e-01] [ 3.00000000e+00 7.98870968e-02] [ 1.00000000e+00 5.23673430e-01] [ 2.00000000e+00 3.24171404e+00] [ 0.00000000e+00 9.32523506e-02] [ 3.00000000e+00 9.13705455e-01] [ 1.00000000e+00 1.25766593e+00] [ 2.00000000e+00 4.09563895e-01] [ 0.00000000e+00 9.46987842e-01] [ 3.00000000e+00 2.63836399e+00] [ 1.00000000e+00 5.20371222e-01] [ 2.00000000e+00 1.86796790e+00] [ 0.00000000e+00 5.46768776e+00] [ 3.00000000e+00 5.73153563e+00] [ 1.00000000e+00 3.12040332e-01] [ 2.00000000e+00 3.93986735e-01] [ 0.00000000e+00 1.32864695e+00] [ 3.00000000e+00 2.38032454e-02] [ 1.00000000e+00 1.07872914e+00] [ 2.00000000e+00 4.35369355e-01] [ 0.00000000e+00 4.55502856e-01] [ 3.00000000e+00 1.96212809e-02] [ 1.00000000e+00 1.95213538e+00] [ 2.00000000e+00 1.54154401e+00] [ 0.00000000e+00 1.26364010e+00] [ 3.00000000e+00 1.33108375e+00] [ 1.00000000e+00 3.02422139e-01] [ 2.00000000e+00 5.58860689e-01] [ 0.00000000e+00 9.52516316e-02] [ 3.00000000e+00 6.25129762e-01] [ 1.00000000e+00 8.41875177e-01] [ 2.00000000e+00 2.06159470e+00] [ 0.00000000e+00 6.39227291e+00] [ 3.00000000e+00 2.01200372e-01] [ 1.00000000e+00 3.51030769e+00] [ 2.00000000e+00 9.83287604e-01] [ 0.00000000e+00 7.06014703e-02] [ 3.00000000e+00 2.59901305e-01] [ 1.00000000e+00 3.74491207e+00] [ 2.00000000e+00 2.32143993e+00]]
6、二分k-均值聚类
二分-K均值是为了解决k-均值的用户自定义输入簇值k所延伸出来的自己判断k数目,其基本思路是:
为了得到k个簇,将所有点的集合分裂成两个簇,从这些簇中选取一个继续分裂,如此下去,直到产生k个簇。
#encoding :utf-8from numpy import *from math import *#加载数据集def loadDataSet(fileName): #将一个文本文件导入到列表中 dataMat = [] #创建一个dataMat空列表 fr = open(fileName) for line in fr.readlines(): #一行一行的读取文本文件 curLine = line.strip().split('\t') fltLine = map(float,curLine) #将文本文件转化为字符型 dataMat.append(fltLine) return dataMat#计算两个向量的欧式距离 def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2))) #随机构建初始质心def randCent(dataSet, k): #构建一个包含k个随机质心的集合 #数据列数 n = shape(dataSet)[1] #初始化质心 centroids = mat(zeros((k,n))) for j in range(n): #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内 minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) return centroids #k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): #数据点的行数 m = shape(dataSet)[0] #用于记录数据点到质心的距离平方 clusterAssment = mat(zeros((m,2))) #中心点 centroids = createCent(dataSet, k) #聚类结束标志 clusterChanged = True while clusterChanged: clusterChanged = False #遍历每条数据 for i in range(m): #设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心 minDist = inf; minIndex = -1 #遍历每个质心 for j in range(k): distJI = distMeas(centroids[j,:],dataSet[i,:]) if distJI < minDist: #将数据归为最近的质心 minDist = distJI; minIndex = j #簇分配结果发生改变,更新标志 if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 print centroids #更新质心 for cent in range(k):#recalculate centroids ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean return centroids, clusterAssmentdef biKmeans(dataSet, k, distMeas=distEclud): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2))) #创建一个初始簇 centroid0 = mean(dataSet, axis=0).tolist()[0] centList =[centroid0] #create a list with one centroid #计算初始误差 for j in range(m):#calc initial Error clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2 while (len(centList) < k): lowestSSE = inf #遍历每个簇,尝试划分每个簇 for i in range(len(centList)): ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#get the data points currently in cluster i centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas) #划分后的误差平方和 sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum #剩余的误差之和 sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1]) print "sseSplit, and notSplit: ",sseSplit,sseNotSplit if (sseSplit + sseNotSplit) < lowestSSE: bestCentToSplit = i bestNewCents = centroidMat bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit #更新分配结果 bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit print 'the bestCentToSplit is: ',bestCentToSplit print 'the len of bestClustAss is: ', len(bestClustAss) centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]#replace a centroid with two best centroids centList.append(bestNewCents[1,:].tolist()[0]) clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE return mat(centList), clusterAssment#数据导入data = loadDataSet('testSet.txt')print "data:"print dataprint "\n"#将数据按照矩阵方式输出dataMat = mat(data)print "dataMat:"print dataMatprint "\n"mat, clusterAssment = biKmeans(dataMat, 4)print mat, clusterAssment
Output:
data:[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]dataMat:[[ 1.658985 4.285136] [-3.453687 3.424321] [ 4.838138 -1.151539] [-5.379713 -3.362104] [ 0.972564 2.924086] [-3.567919 1.531611] [ 0.450614 -3.302219] [-3.487105 -1.724432] [ 2.668759 1.594842] [-3.156485 3.191137] [ 3.165506 -3.999838] [-2.786837 -3.099354] [ 4.208187 2.984927] [-2.123337 2.943366] [ 0.704199 -0.479481] [-0.39237 -3.963704] [ 2.831667 1.574018] [-0.790153 3.343144] [ 2.943496 -3.357075] [-3.195883 -2.283926] [ 2.336445 2.875106] [-1.786345 2.554248] [ 2.190101 -1.90602 ] [-3.403367 -2.778288] [ 1.778124 3.880832] [-1.688346 2.230267] [ 2.592976 -2.054368] [-4.007257 -3.207066] [ 2.257734 3.387564] [-2.679011 0.785119] [ 0.939512 -4.023563] [-3.674424 -2.261084] [ 2.046259 2.735279] [-3.18947 1.780269] [ 4.372646 -0.822248] [-2.579316 -3.497576] [ 1.889034 5.1904 ] [-0.798747 2.185588] [ 2.83652 -2.658556] [-3.837877 -3.253815] [ 2.096701 3.886007] [-2.709034 2.923887] [ 3.367037 -3.184789] [-2.121479 -4.232586] [ 2.329546 3.179764] [-3.284816 3.273099] [ 3.091414 -3.815232] [-3.762093 -2.432191] [ 3.542056 2.778832] [-1.736822 4.241041] [ 2.127073 -2.98368 ] [-4.323818 -3.938116] [ 3.792121 5.135768] [-4.786473 3.358547] [ 2.624081 -3.260715] [-4.009299 -2.978115] [ 2.493525 1.96371 ] [-2.513661 2.642162] [ 1.864375 -3.176309] [-3.171184 -3.572452] [ 2.89422 2.489128] [-2.562539 2.884438] [ 3.491078 -3.947487] [-2.565729 -2.012114] [ 3.332948 3.983102] [-1.616805 3.573188] [ 2.280615 -2.559444] [-2.651229 -3.103198] [ 2.321395 3.154987] [-1.685703 2.939697] [ 3.031012 -3.620252] [-4.599622 -2.185829] [ 4.196223 1.126677] [-2.133863 3.093686] [ 4.668892 -2.562705] [-2.793241 -2.149706] [ 2.884105 3.043438] [-2.967647 2.848696] [ 4.479332 -1.764772] [-4.905566 -2.91107 ]][[-0.61154259 -0.99840509] [ 0.21655089 0.84736297]][[-0.77465184 -2.93442862] [ 0.47379212 2.62599895]][[-0.54735726 -2.93692713] [ 0.2978695 2.76065064]][[-0.2897198 -2.83942545] [ 0.08249337 2.94802785]]sseSplit, and notSplit: 792.916856537 0.0the bestCentToSplit is: 0the len of bestClustAss is: 80[[-1.675842 -3.5709562 ] [-0.44613055 -1.66674358]][[-3.17656652 -2.99858519] [ 2.90100553 -2.66351205]][[-3.38237045 -2.9473363 ] [ 2.80293085 -2.7315146 ]]sseSplit, and notSplit: 83.5874695564 326.284075201[[ 0.17931125 3.21151684] [-2.49413705 4.77613273]][[ 1.6577805 2.93121792] [-2.84303986 2.97924629]][[ 2.6265299 3.10868015] [-2.46154315 2.78737555]]sseSplit, and notSplit: 66.36683512 466.632781336the bestCentToSplit is: 0the len of bestClustAss is: 40[[-1.61809661 -4.16618929] [-2.19748766 -3.31337473]][[-1.2569245 -4.098145 ] [-3.61853111 -2.81946867]]sseSplit, and notSplit: 19.619284753 377.270301893[[-4.25497296 3.22644812] [-3.07524324 4.4511578 ]][[-3.28879656 2.53721789] [ 1.06125497 3.06729526]][[-2.64677572 2.78993217] [ 2.31553173 3.07737886]][[-2.46154315 2.78737555] [ 2.6265299 3.10868015]]sseSplit, and notSplit: 66.36683512 83.5874695564[[ 1.60206889 -3.5436413 ] [ 3.03830692 -1.88925585]][[ 2.3728161 -3.548637 ] [ 3.2330456 -1.9143922]][[ 2.44798442 -3.43588358] [ 3.3353505 -1.67496113]][[ 2.47787177 -3.37608915] [ 3.406612 -1.53444757]]sseSplit, and notSplit: 31.6296069802 358.885318066the bestCentToSplit is: 1the len of bestClustAss is: 40[[-3.38237045 -2.9473363 ] [-2.46154315 2.78737555] [ 2.80293085 -2.7315146 ] [ 2.6265299 3.10868015]] [[ 3.00000000e+00 2.32019150e+00] [ 1.00000000e+00 1.39004893e+00] [ 2.00000000e+00 6.63839104e+00] [ 0.00000000e+00 4.16140951e+00] [ 3.00000000e+00 2.76967820e+00] [ 1.00000000e+00 2.80101213e+00] [ 2.00000000e+00 5.85909807e+00] [ 0.00000000e+00 1.50646425e+00] [ 3.00000000e+00 2.29348924e+00] [ 1.00000000e+00 6.45967483e-01] [ 2.00000000e+00 1.74010499e+00] [ 0.00000000e+00 3.77769471e-01] [ 3.00000000e+00 2.51695402e+00] [ 1.00000000e+00 1.38716420e-01] [ 2.00000000e+00 9.47633071e+00] [ 0.00000000e+00 9.97310599e+00] [ 3.00000000e+00 2.39726914e+00] [ 1.00000000e+00 3.10242360e+00] [ 2.00000000e+00 4.11084375e-01] [ 0.00000000e+00 4.74890795e-01] [ 3.00000000e+00 1.38706133e-01] [ 1.00000000e+00 5.10240996e-01] [ 2.00000000e+00 1.05700176e+00] [ 0.00000000e+00 2.90181828e-02] [ 3.00000000e+00 1.31601105e+00] [ 1.00000000e+00 9.08203769e-01] [ 2.00000000e+00 5.02608557e-01] [ 0.00000000e+00 4.57942717e-01] [ 3.00000000e+00 2.13786618e-01] [ 1.00000000e+00 4.05632356e+00] [ 2.00000000e+00 5.14171888e+00] [ 0.00000000e+00 5.56237495e-01] [ 3.00000000e+00 4.76142736e-01] [ 1.00000000e+00 1.54414110e+00] [ 2.00000000e+00 6.10930460e+00] [ 0.00000000e+00 9.47660177e-01] [ 3.00000000e+00 4.87745774e+00] [ 1.00000000e+00 3.12703929e+00] [ 2.00000000e+00 6.45118831e-03] [ 0.00000000e+00 3.01415411e-01] [ 3.00000000e+00 8.84955695e-01] [ 1.00000000e+00 7.98870968e-02] [ 2.00000000e+00 5.23673430e-01] [ 0.00000000e+00 3.24171404e+00] [ 3.00000000e+00 9.32523506e-02] [ 1.00000000e+00 9.13705455e-01] [ 2.00000000e+00 1.25766593e+00] [ 0.00000000e+00 4.09563895e-01] [ 3.00000000e+00 9.46987842e-01] [ 1.00000000e+00 2.63836399e+00] [ 2.00000000e+00 5.20371222e-01] [ 0.00000000e+00 1.86796790e+00] [ 3.00000000e+00 5.46768776e+00] [ 1.00000000e+00 5.73153563e+00] [ 2.00000000e+00 3.12040332e-01] [ 0.00000000e+00 3.93986735e-01] [ 3.00000000e+00 1.32864695e+00] [ 1.00000000e+00 2.38032454e-02] [ 2.00000000e+00 1.07872914e+00] [ 0.00000000e+00 4.35369355e-01] [ 3.00000000e+00 4.55502856e-01] [ 1.00000000e+00 1.96212809e-02] [ 2.00000000e+00 1.95213538e+00] [ 0.00000000e+00 1.54154401e+00] [ 3.00000000e+00 1.26364010e+00] [ 1.00000000e+00 1.33108375e+00] [ 2.00000000e+00 3.02422139e-01] [ 0.00000000e+00 5.58860689e-01] [ 3.00000000e+00 9.52516316e-02] [ 1.00000000e+00 6.25129762e-01] [ 2.00000000e+00 8.41875177e-01] [ 0.00000000e+00 2.06159470e+00] [ 3.00000000e+00 6.39227291e+00] [ 1.00000000e+00 2.01200372e-01] [ 2.00000000e+00 3.51030769e+00] [ 0.00000000e+00 9.83287604e-01] [ 3.00000000e+00 7.06014703e-02] [ 1.00000000e+00 2.59901305e-01] [ 2.00000000e+00 3.74491207e+00] [ 0.00000000e+00 2.32143993e+00]]
- 【机器学习实战04】k-均值聚类算法
- 《机器学习实战》第十章 :K-均值聚类算法
- 【机器学习实战-python3】K-均值聚类算法
- 机器学习实战:K-均值及二分K-均值聚类算法
- 【机器学习实战之三】:C++实现K-均值(K-Means)聚类算法
- 机器学习实战之 第10章 K-Means(K-均值)聚类算法
- 【机器学习实战】第10章 K-Means(K-均值)聚类算法
- 《机器学习实战》kMeans算法(K均值聚类算法)
- 机器学习算法 - k-means Clustering K均值聚类
- 机器学习--k均值聚类(k-means)算法
- 《机器学习实战》之K-均值聚类算法的python实现
- 《机器学习实战》之二分K-均值聚类算法的python实现
- 《机器学习实战》二分-kMeans算法(二分K均值聚类)
- 机器学习实战 第十章 利用K-均值聚类算法对未标注数据分组
- 机器学习实战-利用K-均值聚类算法对未标注数据分组
- 机器学习实战——k—均值聚类算法
- 【机器学习实战】第10章 K-Means(均值)聚类算法
- 机器学习实战笔记-利用K均值聚类算法对未标注数据分组
- JavaScript 闭包
- Struts知识总结
- isset()与empty()区别
- codeforces 374D 树状数组或者线段树
- Windows线程创建、退出及资源释放
- 【机器学习实战04】k-均值聚类算法
- 进程同步之信号量机制(pv操作)及三个经典同步问题
- 让你的程序性能获得百倍的提升—Redis基础使用指南
- jzoj 4718. 【GDOI2017模拟7.20】准备食物2 费用流
- javaweb中的filter
- UVA 7392 Bundles of Joy(乱搞)
- Redis和Memcache的区别
- Http--基础理解
- java中懒汉饿汉编写及比较