K-means和PAM聚类算法Python实现及对比
来源:互联网 发布:pdf文件制作软件 编辑:程序博客网 时间:2024/06/08 13:32
K-means(K均值划分)聚类:简单的说,一般流程如下:先随机选取k个点,将每个点分配给它们,得到最初的k个分类;在每个分类中计算均值,将点重新分配,划归到最近的中心点;重复上述步骤直到点的划归不再改变。下图是K-means方法的示意。
PAM(Partition Around Medoids)是K-medoid(K中心点划分)的基础算法,基本流程如下:首先随机选择k个对象作为中心,把每个对象分配给离它最近的中心。然后随机地选择一个非中心对象替换中心对象,计算分配后的距离改进量。聚类的过程就是不断迭代,进行中心对象和非中心对象的反复替换过程,直到目标函数不再有改进为止。非中心点和中心点替换的具体类别如下图分析(用h替换i相对j的开销)。
数据集:N=300,K=15
9.802 10.13210.35 9.76810.098 9.9889.73 9.919.754 10.439.836 9.90210.238 9.8669.53 9.86210.154 9.829.336 10.4569.378 10.219.712 10.2649.638 10.2089.518 9.95610.236 9.919.4 10.08610.196 9.74610.138 9.82810.062 10.2610.394 9.98410.284 10.3489.706 9.9789.906 10.58810.356 9.1989.954 9.7049.796 10.37810.386 10.60810.41 9.91210.172 10.59810.286 9.7129.932 10.23410.298 9.94810.352 9.9329.848 10.32810.514 10.4989.944 9.9349.92 10.0229.908 10.60610.182 9.9910.256 9.2512.04 10.02812.082 10.04412.4 10.15611.988 9.92612.34 9.91812.228 9.97812.348 10.48812.044 9.35811.736 10.12212.35 9.79811.246 10.12212.276 10.9912.374 10.01812.53 1012.27 9.79212.364 10.17612.458 10.1811.952 9.68211.772 9.92411.502 10.00812.134 9.48211.628 10.28612.064 9.61611.906 9.8211.736 10.2912.114 10.90411.59 9.71212.648 9.81412.164 11.01812.22 9.79611.846 9.63411.808 10.05812.096 9.84611.594 10.07812.252 9.93811.998 9.67611.894 10.01212.274 9.93612.176 10.36412.104 10.38811.372 11.46610.94 11.48211.084 11.55411.232 11.37411.22 11.6410.962 11.7511.014 11.74611.524 10.98211.012 11.36411.2 11.06211.626 11.89411.23 11.72811.144 11.9111.106 11.86811.53 11.91811.21 11.11410.746 11.70211.154 11.69211.412 11.92410.948 11.53210.988 12.29810.96 11.39211.656 11.34611.178 12.06211.368 11.5611.264 11.72411.554 11.57610.974 11.11411.12 11.63411.51 12.05210.95 11.40211.864 12.40611.198 10.85411.65 11.49611.248 11.72211.602 11.88811.424 11.45411.312 11.71810.736 11.6811.56 11.79810.028 12.2689.282 11.9769.178 11.539.954 12.3989.622 11.5589.914 11.8449.07 11.09210.578 11.3549.582 12.149.622 11.5289.35 11.7110.234 11.9748.986 12.319.438 12.119.592 12.0129.666 11.889.364 12.0129.71 11.7729.992 11.8369.916 12.0289.382 12.2269.808 12.239.272 12.1529.392 11.189.28 11.9769.848 11.6329.322 11.5149.718 11.959.12 11.768.978 12.3710.072 12.2029.966 11.8229.506 11.6489.702 11.5369.45 11.969.916 11.9629.96 11.5389.014 11.7449.024 11.84610.296 11.617.87 10.8388.164 10.5348.214 10.628.166 10.6988.05 10.7467.978 11.138.08 10.9928.472 11.0828.494 10.5848.354 10.388.096 10.7147.882 10.8327.908 11.3467.814 10.8728.28 10.1048.082 10.6768.068 10.1188.116 10.6988.042 10.798.096 10.8788.124 10.9328.632 11.1248.27 10.7167.622 10.1488.198 11.3988.582 11.0647.942 11.0768.004 10.5748.504 11.3788.118 11.0127.874 11.2967.668 10.9247.966 10.727.94 10.9967.988 11.2288.164 11.1128.386 10.7728.248 10.9948.286 10.7348.224 10.3167.976 9.5787.876 8.7968.172 9.018.068 9.2028.416 8.6548.71 8.4588.056 8.4347.304 9.2668.118 8.6087.616 9.4468.092 8.9568.368 8.9688.022 9.3348.32 9.0627.832 8.9527.704 8.6728.236 9.1088.37 8.9048.352 8.8968.046 9.2287.71 9.5388.534 8.557.996 9.1728.046 9.2048.622 9.1747.776 8.8988.226 9.0387.904 9.1947.874 8.8567.992 8.9528.262 9.4688.088 9.2948.034 9.7928.352 9.0167.85 9.3348.404 9.3667.892 8.8088.202 9.2327.668 9.0268.242 9.3089.432 8.6110.066 8.199.146 8.0449.662 7.8669.6 7.8748.618 8.5529.334 7.6589.424 7.838.892 8.1669.386 7.7469.878 8.0549.558 7.9489.222 8.0029.52 8.2829.76 7.9329.568 8.0529.736 7.5529.584 8.4789.358 8.2429.404 7.799.458 8.549.482 7.7668.844 8.0249.29 8.4729.274 7.5669.11 8.0149.542 7.6889.432 8.1229.786 8.0669.382 7.6649.404 8.2289.146 8.1589.622 8.00410.286 7.8929.43 7.6769.44 8.0589.788 7.6849.586 7.919.694 7.4489.576 7.86611.442 8.68811.466 8.55810.674 8.93611.23 8.12611.614 8.58811.59 8.49611.536 7.58611.638 8.26611.16 8.4310.904 8.53211.284 8.74211.25 8.19210.84 8.21811.798 8.83611.51 8.09410.932 7.79611.404 8.20611.088 8.32611.334 8.1711.272 8.39411.59 8.40811.212 8.51611.566 8.02411.246 8.58411.252 8.56610.78 8.29411.04 8.32211.198 7.88611.168 8.26211.88 8.0811.356 8.58611.182 8.34210.836 8.66411.696 8.90611.282 8.2810.718 8.53410.444 8.68411.124 8.61811.392 8.9411.212 8.30816.674 9.63816.162 10.30216.612 10.21816.1 9.70216.404 10.07215.93 10.10616.128 9.88816.41 10.18815.982 9.9216.224 10.0216.296 9.45816.586 10.17416.314 10.71616.278 9.45216.622 9.65216.22 9.49416.626 10.16216.982 10.59616.27 10.12816.202 9.716.532 9.77617.124 9.72616.47 9.69816.004 10.2816.366 9.79616.268 9.52216.13 9.74816.67 10.49816.488 10.54216.57 10.2116.456 10.11216.482 9.98616.584 9.75416.1 9.9316.226 9.6716.448 9.56616.572 9.62416.436 9.4116.502 9.9816.418 9.96614.362 14.64414.138 14.6314.064 15.07213.692 14.95814.238 15.29613.73 15.12813.952 14.86813.986 14.7713.916 14.99613.874 14.95414.168 15.27614.278 15.15214.098 14.913.764 15.21213.948 14.21814.13 14.83813.362 14.93213.546 14.84414.13 15.1113.816 14.60214.386 14.68613.786 14.72614.204 14.82213.856 15.20614.074 14.38413.68 14.98814.204 14.97613.388 15.3913.708 15.04814.114 15.36614.4 15.0414.194 15.0413.888 15.43613.958 15.32213.922 14.80213.652 14.60214.294 14.99613.81 14.52613.408 15.3413.834 14.7788.826 16.4748.33 16.4888.468 16.3788.904 15.8468.662 16.3548.684 16.7768.33 16.0668.904 16.4028.778 16.4868.81 16.4588.398 16.5768.542 15.9189.064 16.4569.152 16.0948.614 15.9088.566 17.0128.12 16.118.844 16.0268.398 16.2828.808 15.598.502 16.1668.942 16.198.376 16.1128.518 15.848.878 16.0048.582 16.7748.248 16.1548.588 16.248.706 16.3748.524 16.3928.458 16.4528.83 16.368.616 16.1128.844 16.3628.468 15.9288.62 16.6748.974 16.538.826 16.0848.104 15.9628.386 16.244.576 12.8784.46 13.163.632 12.8624.238 12.5064.348 13.2683.788 12.3724.19 12.7723.86 12.7063.978 13.3084.336 12.8544.218 13.034.25 13.0024.334 13.064.654 12.5664.38 12.7923.968 13.0164.614 12.5263.95 12.674.038 12.674.426 12.2384.066 12.5144.248 12.3924.61 12.954.328 12.994.5 12.5224.176 12.714.492 12.4644.134 12.8344.316 12.7644.454 12.0844.052 12.9344.26 13.1184.058 13.7184.24 12.6263.838 12.2324.128 13.43.764 12.384.424 13.1864.234 12.9945.13 13.2963.9 6.7423.994 7.2064.278 7.2224.172 6.8483.882 6.8943.936 6.9944.162 6.873.762 7.14.256 7.6124.55 6.8224.062 6.9844.026 7.234.364 7.1844.292 7.2084.288 6.914.018 7.0624.07 7.1044.548 7.6544.402 7.0823.692 7.494.888 7.1944.456 7.1464.732 7.1544.088 7.2124.502 6.9283.402 7.4344.246 6.6924.166 7.2564.852 6.834.398 7.4285.016 7.0544.25 6.763.738 7.0824.254 7.2644.122 7.2383.878 7.2324.55 7.294.03 7.1264.412 7.0224.276 7.2448.376 3.7888.81 3.8648.218 3.5488.374 3.7488.102 4.38.386 3.9528.858 3.2748.884 3.5048.294 3.388.38 3.1788.738 4.2949.1 3.849.086 3.68.616 3.458.624 3.6988.632 3.828.286 3.7048.816 3.588.722 3.8548.298 3.3789.014 4.0348.87 3.5548.562 3.6628.6 3.8288.94 3.8368.768 4.328.838 3.9268.288 3.4668.652 3.7828.376 3.9567.724 3.4148.374 4.1368.6 3.7129.026 3.7888.534 3.2528.874 3.6028.796 3.8888.592 3.9888.98 4.0148.562 3.85613.894 4.1614.278 5.2614.364 4.74814.108 4.91813.998 5.49814.4 5.29614.3 5.36813.958 5.3513.842 4.98413.85 4.24613.978 5.35614.366 5.10414.272 4.9414.336 5.17614.744 5.24814.306 5.0613.986 5.0514.44 5.3314.004 4.9213.332 4.59214.218 5.54414.154 4.76813.468 4.9213.67 5.40613.664 5.01614.12 4.8713.836 4.5114.204 5.06414.004 5.22813.266 4.85813.668 5.3414.528 4.81214.318 4.59214.018 5.18214.37 4.88414.198 4.80414.32 4.5913.636 5.21814.41 4.65614.02 5.614
K-means python代码实现:
# coding=utf-8from numpy import *def loadDataSet(fileName): dataMat = [] fr = open(fileName) for line in fr.readlines(): curLine = line.strip().split('\t') fltLine = map(float, curLine) #transfer to float dataMat.append(fltLine) return dataMat# 计算两个向量的距离,用的是欧几里得距离def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2)))''' n = shape(dataSet)[1] #return column centroids = mat(zeros((k, n))) for j in range(n): minJ = min(dataSet[:, j]) rangeJ = float(max(array(dataSet)[:, j]) - minJ) centroids[:, j] = minJ + rangeJ * random.rand(k, 1) return centroids'''# 随机生成初始的质心(ng的课说的初始方式是随机选K个点)def randCent(dataSet, k): import random n = shape(dataSet)[1] # return column cent_return = mat(zeros((k, n))) size = len(dataSet) centroids = random.sample([i for i in range(size)], k) j=0 for i in centroids: cent_return[j] = (dataSet[i]) j+=1 return cent_returndef kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): m = shape(dataSet)[0] #return row clusterAssment = mat(zeros((m, 2))) # create mat to assign data points # to a centroid, also holds SE of each point centroids = createCent(dataSet, k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): # for each data point assign it to the closest centroid minDist = inf minIndex = -1 for j in range(k): #cluster distJI = distMeas(centroids[j, :], dataSet[i, :]) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i, 0] != minIndex: clusterChanged = True clusterAssment[i, :] = minIndex, minDist ** 2 print centroids for cent in range(k): # recalculate centroids ptsInClust = dataSet[nonzero(clusterAssment[:, 0].A == (cent))[0]] # get all the point in this cluster if len(ptsInClust): centroids[(cent), :] = mean(ptsInClust, axis=0) # assign centroid to mean return centroids, clusterAssmentdef show(dataSet, k, centroids, clusterAssment): from matplotlib import pyplot as plt numSamples, dim = dataSet.shape mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr', 'xr', 'sb', 'sg', 'sk', '2r', '<b', '<g', '+b', '+g', 'pb'] for i in xrange(numSamples): markIndex = int(clusterAssment[i, 0]) plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex]) #mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb', 'or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr'] #for i in range(k): #plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize=12) plt.show()def getDataset(filename, k_sample): import linecache import random dataMat = [] myfile = open(filename) lines = len(myfile.readlines()) SampleLine = random.sample([i for i in range(lines)], k_sample) for i in SampleLine: theline = linecache.getline(filename, i) curLine = theline.strip().split() fltLine = map(float, curLine) # transfer to float dataMat.append(fltLine) return dataMatdef main(): dataMat = mat(loadDataSet('R15.txt')) myCentroids, clustAssing = kMeans(dataMat, 15) print myCentroids show(dataMat, 15, myCentroids, clustAssing)if __name__ == '__main__': main()
其中K-means对数据聚类的效果如下图所示:
分析上述两幅图可以看到,同样的代码产生的效果却有好有坏,第一幅图效果不是很理想,第二幅图的分类完全正确。根据调试窗口分析原因:由于第一次的数据点都是随机生成,如果K个点里面有两个点很相近,则会出现分类不清的问题。
PAM python代码实现:
# coding=utf-8import randomfrom numpy import *def loadDataSet(fileName): dataMat = [] fr = open(fileName) for line in fr.readlines(): curLine = line.strip().split() fltLine = map(float, curLine) # transfer to float dataMat.append(fltLine) return dataMatdef pearson_distance(vector1, vector2): from scipy.spatial.distance import pdist X = vstack([vector1, vector2]) d2 = pdist(X) return d2distances_cache = {}def totalcost(blogwords, costf, medoids_idx): size = len(blogwords) total_cost = 0.0 medoids = {} for idx in medoids_idx: medoids[idx] = [] for i in range(size): choice = None min_cost = inf for m in medoids: tmp = distances_cache.get((m, i), None) if tmp == None: tmp = pearson_distance(blogwords[m], blogwords[i]) distances_cache[(m, i)] = tmp if tmp < min_cost: choice = m min_cost = tmp medoids[choice].append(i) total_cost += min_cost return total_cost, medoidsdef kmedoids(blogwords, k): import random size = len(blogwords) medoids_idx = random.sample([i for i in range(size)], k) pre_cost, medoids = totalcost(blogwords, pearson_distance, medoids_idx) print pre_cost current_cost = inf # maxmum of pearson_distances is 2. best_choice = [] best_res = {} iter_count = 0 while 1: for m in medoids: for item in medoids[m]: if item != m: idx = medoids_idx.index(m) swap_temp = medoids_idx[idx] medoids_idx[idx] = item tmp, medoids_ = totalcost(blogwords, pearson_distance, medoids_idx) # print tmp,'-------->',medoids_.keys() if tmp < current_cost: best_choice = list(medoids_idx) best_res = dict(medoids_) current_cost = tmp medoids_idx[idx] = swap_temp iter_count += 1 print current_cost, iter_count if best_choice == medoids_idx: break if current_cost <= pre_cost: pre_cost = current_cost medoids = best_res medoids_idx = best_choice return current_cost, best_choice, best_resdef show(dataSet, k, centroids, clusterAssment): from matplotlib import pyplot as plt numSamples, dim = dataSet.shape mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr', 'xr', 'sb', 'sg', 'sk', '2r', '<b', '<g', '+b', '+g', 'pb'] for i in xrange(numSamples): # markIndex = int(clusterAssment[i, 0]) for j in range(len(clusterAssment)): if i in clusterAssment[clusterAssment.keys()[j]]: plt.plot(dataSet[i, 0], dataSet[i, 1], mark[j]) #mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb', 'or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr'] #for i in range(k): #plt.plot(centroids[i][0, 0], centroids[i][0, 1], mark[i], markersize=12) plt.show()def getDataset(filename, k_sample): import linecache import random dataMat = [] myfile = open(filename) lines = len(myfile.readlines()) SampleLine = random.sample([i for i in range(lines)], k_sample) for i in SampleLine: theline = linecache.getline(filename, i) curLine = theline.strip().split() fltLine = map(float, curLine) # transfer to float dataMat.append(fltLine) return dataMatif __name__ == '__main__': dataMat = getDataset('R15.txt',150) best_cost, best_choice, best_medoids = kmedoids(dataMat, 15) dataMat = mat(dataMat) listone = [] for i in range(len(best_choice)): listone.append(dataMat[best_choice[i]]) show(dataMat, 15, listone, best_medoids)
这里由于运行时间的限制,PAM算法对数据只进行了部分采样处理,可以看到数据点较少。分类效果稳定,但不是最佳。
阅读全文
1 0
- K-means和PAM聚类算法Python实现及对比
- Python实现K-Means聚类算法
- k-means聚类算法python实现
- K-means、K-means ++、K-modes和K-prototype聚类算法简述 附Python代码
- 聚类算法——python实现k-means算法
- Python sklearn K-means算法及文本聚类实践
- python实现k-means聚类算法--可用
- python K-Means聚类算法的实现
- Python 实现K-means算法
- Python实现k-means算法
- Python实现k-means算法
- k-means算法Python实现
- K-means算法 Python实现
- K-Means聚类算法原理及实现
- Bisecting k-means聚类算法及实现
- K-Means聚类算法的原理及实现
- K-Means聚类算法的原理及实现【转】
- K-Means聚类算法的原理及实现
- 51nod 1088 最长回文子串
- HDU
- 计算机网络 滑动窗口协议
- 静态查找(顺序查找和折半查找)
- 一维数组排序
- K-means和PAM聚类算法Python实现及对比
- 打印okhttp请求log信息
- 登录注册使用数据库
- 视频编码中常用熵编码介绍
- 【Struts】接收表单传递给Action的参数
- 欢迎使用CSDN-markdown编辑器
- 【STM32】一些基础的操作
- java.util.concurrent的线程池
- 为什么说ArrayList是线程不安全的?