聚类算法（2）

来源：互联网发布：录制脚本的软件编辑：程序博客网时间：2024/06/12 22:19

（接上《聚类算法（1）》）

除了在聚类算法（1）中的分级聚类，关于聚类还可以使用K-均值聚类（也是为了克服分级聚类的两个不足之处）。

算法描述：

这里有比较详尽的算法描述。（嗯，《Programming Collective Intelligence 》上有幅超级好的插图。。。）

#----------------------------------------------------------------------# K-均值聚类#----------------------------------------------------------------------def kcluster(rows, distancefunc=pearson, k=4): """K-Means Clustering""" # 确定每个点的最大值和最小值 ranges = [ (min([row[i] for row in rows]), max([row[i] for row in rows])) for i in range(len(rows[0]))] # 随机创建K个中心点 clusters = [ [random.random()*(ranges[i][1]-ranges[i][0]) + ranges[i][0] for i in range(len(rows[0]))] for j in range(k) ] lastmatches = None for t in range(100): bestmatches = [[] for i in range(k)] # 在每一行寻找距离最近的中心点 for j in range(len(rows)): row = rows[j] bestmatch = 0 for i in range(k): d = distancefunc(clusters[i], row) if d < distancefunc(clusters[bestmatch], row): bestmatch = i bestmatches[bestmatch].append(j) # 如果结果与上一次相同，则过程结束 if bestmatches == lastmatches: break lastmatches = bestmatches # 移动中心点 for i in range(k): avgs = [0.0]*len(rows[0]) if len(bestmatches[i]) > 0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m] += rows[rowid][m] for j in range(len(avgs)): avgs[j] /= len(bestmatches[i]) clusters[i] = avgs return bestmatches

说明：

1.rows是什么？

rows是从数据文件中读入的关于单词的频度数据。比如这里的rows，共有32项（有32个博客，即初始时有32个聚类），每项有629个数据（即，有629个单词，当然这些数据只是数字，是某单词（共629个）（该单词保存在其他地方）所对应的频度）。而我们进行的是，根据这些单词，分类那32个博客。故，我们的pearson算法计算的是一个长度629的列表与另一个629长的列表之间的距离。

这个629长的列表即是我们所谓的点（我们要聚类的对象）。

2.clusters是什么？

说clusters是K个中心点。那么K个中心点到底是什么？注意这里的“点”和平常所谓的点是不同的，上面的粗体部分已具体解释。说，clusters是k个中心点，即是说，clusters是个长度为k的列表，表中的每个元素是长度为620的数组（其中为我们随机产生的单词频度）。

那么，

“ 在每一行寻找距离最近的中心点”的意义也就明确了：将这32个条目（每个条目629项数据）中的某些同这k个所谓的中心点相关联。

另，

注意“移动中心点”中的if len(bestmatches[i]) > 0:一句是不可缺少的。因为可能这K个点中存在某些点，没有数据与他们相近，于是他们的长度就为0。

同时注意bestmatches中存的是什么？看代码：bestmatches[bestmatch].append(j)，我们知，bestmatches中存放的实际是某条数据在入内数据集中的序号。而在计算平均数时，也是计算的这些所有数据的平均数。

总之，记住这里的点实际上是个表。

最后再append一个Tanimoto算法，用于距离度量。更精确的说，是用于这样的数据的距离度量：该数据中只可能有0,1两种取值，其中0代表不存在，1代表存在。（即yes和no的问题）

#----------------------------------------------------------------------# 距离度量： Tanimoto算法def tanimoto(v1, v2): c1,c2,shr = 0,0,0 for i in range(len(v1)): if v1[i]: c1 += 1 if v2[i]: c2 += 1 if v1[i] and v2[i]: shr += 1 return (1.0 - (float(shr))/(c1+c2-shr))

注意这里是距离度量，因此，返回0，表示二者之间无距离，即两者喜欢的东西一模一样；返回1，则表示两者不存在都喜欢的东西（因为我们进行的是对一个列表度量距离，该列表可能代表了两个人喜欢的东西的列表）。

最后，贴出完整的代码：

数据聚类算法：

# chr03 -- clusters.pyfrom math import sqrtimport random#----------------------------------------------------------------------# 加载数据文件def readfile(filename): """load data""" with open(filename) as file: lines = [line for line in file] # 第一列是列标题 colnames = lines[0].strip().split('/t')[1:] rownames = [] data = [] for line in lines[1:]: p = line.strip().split('/t') # 每行第一列是行名 rownames.append(p[0]) # 其余部分是数据 data.append([float(x) for x in p[1:]]) return rownames,colnames,data#----------------------------------------------------------------------# 皮尔逊算法--求相似度def pearson(v1, v2): """pearson""" # 求和 sum1 = sum(v1) sum2 = sum(v2) # 求平方和 sum1sq = sum([pow(v,2) for v in v1]) sum2sq = sum([pow(v,2) for v in v2]) # 求乘积之和 psum = sum([v1[i]*v2[i] for i in range(len(v1))]) # 计算pearson score num = psum - (sum1*sum2/len(v1)) den = sqrt((sum1sq-pow(sum1,2)/len(v1))*(sum2sq-pow(sum2,2)/len(v1))) if den == 0: return 0 return 1.0-num/den#----------------------------------------------------------------------# 距离度量： Tanimoto算法def tanimoto(v1, v2): c1,c2,shr = 0,0,0 for i in range(len(v1)): if v1[i]: c1 += 1 if v2[i]: c2 += 1 if v1[i] and v2[i]: shr += 1 return (1.0 - (float(shr))/(c1+c2-shr)) #----------------------------------------------------------------------# 聚类算法def hcluster(rows, distancefunc = pearson): """cluster""" distances = {} currentclusterid = 1 # 最开始的聚类就是数据集中的行 clust = [bicluster(rows[i], id=i) for i in range(len(rows))] while len(clust) > 1: lowestpair = (0,1) closest = distancefunc(clust[0].vec, clust[1].vec) # 遍历每个配对，寻找最小值 for i in range(len(clust)): for j in range(i+1, len(clust)): # 用distances缓存距离的计算值 if (clust[i].id, clust[j].id) not in distances: distances[(clust[i].id, clust[j].id)] = distancefunc(clust[i].vec, clust[j].vec) d = distances[(clust[i].id, clust[j].id)] if d < closest: closest = d lowestpair = (i, j) # 计算两个聚类的平均值 mergevec = [(clust[lowestpair[0]].vec[i] + clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # 建立新的聚类 newcluster = bicluster(mergevec, left = clust[lowestpair[0]], right = clust[lowestpair[1]], distance = closest, id = currentclusterid) # 不在原始集合中的聚类，其id为负数 currentclusterid -= 1 del clust[lowestpair[0]] del clust[lowestpair[1] - 1] clust.append(newcluster) return clust[0]#----------------------------------------------------------------------# 列聚类 -- 倒置数据#----------------------------------------------------------------------def rotatematrix(data): newdata = [] for i in range(len(data[0])): newrow = [data[j][i] for j in range(len(data))] newdata.append(newrow) return newdata#----------------------------------------------------------------------# K-均值聚类#----------------------------------------------------------------------def kcluster(rows, distancefunc=pearson, k=4): """K-Means Clustering""" # 确定每个点的最大值和最小值 ranges = [ (min([row[i] for row in rows]), max([row[i] for row in rows])) for i in range(len(rows[0]))] # 随机创建K个中心点 clusters = [ [random.random()*(ranges[i][1]-ranges[i][0]) + ranges[i][0] for i in range(len(rows[0]))] for j in range(k) ] lastmatches = None for t in range(100): bestmatches = [[] for i in range(k)] # 在每一行寻找距离最近的中心点 for j in range(len(rows)): row = rows[j] bestmatch = 0 for i in range(k): d = distancefunc(clusters[i], row) if d < distancefunc(clusters[bestmatch], row): bestmatch = i bestmatches[bestmatch].append(j) # 如果结果与上一次相同，则过程结束 if bestmatches == lastmatches: break lastmatches = bestmatches # 移动中心点 for i in range(k): avgs = [0.0]*len(rows[0]) if len(bestmatches[i]) > 0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m] += rows[rowid][m] for j in range(len(avgs)): avgs[j] /= len(bestmatches[i]) clusters[i] = avgs return bestmatches ########################################################################class bicluster: """cluster data store class""" #---------------------------------------------------------------------- def __init__(self, vec, left=None, right=None, distance=0.0, id=None): """Constructor""" self.vec = vec self.left = left self.right = right self.distance = distance self.id = id #----------------------------------------------------------------------# test -- clusters.py#----------------------------------------------------------------------if __name__ == '__main__': print('test: clusters.py') print('====================') blogname, words, data = readfile('blogdata.txt') # rdata = rotatematrix(data) # clust = hcluster(rdata) # drawdendrogram(clust, blogname) kcluster(data, k=10)

从博客订阅源中生成初始数据：

# chr03 -- generatefeedvector.py# -*- coding: gb2312 -*-import feedparserimport re#----------------------------------------------------------------------# 返回一个RSS订阅源的标题及一个包含单词计数情况的词典def getwordcounts(url): """counting the words in a Feed""" d = feedparser.parse(url) wc = {} # 若为空则直接返回 if not d.entries: return None,None # 遍历所有条目 for e in d.entries: if 'summary' in e: summary = e.summary else: summary = e.description # 提取一个单词列表 words = getwords(e.title+' '+summary) for word in words: wc.setdefault(word, 0) wc[word] += 1 return d.feed.title,wc#----------------------------------------------------------------------# 返回剥去html标记的单词列表def getwords(html): """return words list without html signal""" # 去掉HTML标记并将其替换成'' txt = re.compile(r'<[^>]+>').sub('', html) # 利用所有非字母字符拆分单词 words = re.compile(r'[^A-Z^a-z]+').split(txt) # 转换为小写形式 return [word.lower() for word in words if word != '']#----------------------------------------------------------------------# main codeapcount = {} # 出现这些单词的博客数目wordcounts = {} # 单词计数with open('feedlist.txt') as file: feedlist = [line for line in file]for feedurl in feedlist: title, wc = getwordcounts(feedurl) if title == None: continue wordcounts[title] = wc for word, count in wc.items(): apcount.setdefault(word, 0) if count >= 1: apcount[word] += 1wordlist = []for w,bc in apcount.items(): frac = float(bc)/len(feedlist) if 0.1<frac<0.5: wordlist.append(w)with open('blogdata.txt', 'w') as outfile: outfile.write('Blog') for word in wordlist: outfile.write('/t%s' % word) outfile.write('/n') for blog,wc in wordcounts.items(): outfile.write(blog) for word in wordlist: if word in wc: outfile.write('/t%d' % wc[word]) else: outfile.write('/t0') outfile.write('/n')#----------------------------------------------------------------------# test: generatefeedvector.py#----------------------------------------------------------------------if __name__ == '__main__': pass

作树状图：

# chr03 -- drawtree.pyfrom math import sqrtfrom PIL import Image,ImageDraw#----------------------------------------------------------------------# 树状图#----------------------------------------------------------------------# 树状图 -- 树的总高度def getheight(clust): """get the height of the tree""" if clust.left == None and clust.right == None: return 1 else: return (getheight(clust.left) + getheight(clust.right)) #----------------------------------------------------------------------# 树状图 -- 树的距离def getdepth(clust): """the depth of the tree""" if clust.left == None and clust.right == None: return 0 else: return (max(getdepth(clust.left), getdepth(clust.right)) + clust.distance)#----------------------------------------------------------------------# 生成图片def drawdendrogram(clust, labels, jpeg = 'clusters.jpg'): """draw the picture""" # 获得高度和宽度 h = getheight(clust)*30 w = 1400 depth = getdepth(clust) # 由于宽度固定，因此需要对距离进行调整 scaling = float(w-150)/depth # 创建白色背景的图片 img = Image.new('RGB', (w,h), (255,255,255)) draw = ImageDraw.Draw(img) draw.line((0,h/2,10,h/2), fill=(255,0,0)) # 画第一个节点 drawnode(draw, clust, 10, (h/2), scaling, labels) img.save(jpeg, 'JPEG')#----------------------------------------------------------------------# 画节点和连线def drawnode(draw, clust, x, y, scaling, labels): """draw node and lined node""" if clust.id < 0: h1 = getheight(clust.left)*20 h2 = getheight(clust.right)*20 top = y - (h1+h2)/2 bottom = y + (h1+h2)/2 # 线的长度 l1 = clust.distance*scaling # 聚类到其子节点的垂直线 draw.line((x, top+h1/2, x, bottom-h2/2), fill=(255,0,0)) # 连接左侧节点的水平线 draw.line((x, top+h1/2, x+l1, top+h1/2), fill=(255,0,0)) # 连接右侧节点的水平线 draw.line((x, bottom-h2/2, x+l1, bottom-h2/2), fill=(255,0,0)) # 绘制左右节点 drawnode(draw, clust.left, x+l1, top+h1/2, scaling, labels) drawnode(draw, clust.right, x+l1, bottom-h2/2, scaling, labels) else: # 是叶节点则绘制其标签 draw.text((x+5,y-7), labels[clust.id], (0,0,0)) #----------------------------------------------------------------------# test -- drawtree.py#----------------------------------------------------------------------if __name__ == '__main__': print('test: clusters.py') print('====================') pass

完

By Kewing