c++ hac例子

来源：互联网发布：如何在淘宝上编辑：程序博客网时间：2024/05/02 05:00

http://blog.o-x-t.com/2009/01/23/hierarchical_clustering/

下面是对其代码阅读笔记：

扁平聚类 -- 无结构。高效,o(n)
层次聚类 -- 不需要事先指定簇的数目，效率低，至少o(n^2)

算法 -- single link
   complete link
   group average
   centroid

hac -- 自底向上算法，合并操作是单调的
hac截断 -- 事先给定的相似度水平上进行截断，拐点截断，指定截断

title --> day是|之前的，title_string是|之后去除空格的，word是|之后的word

create D vectors for each title
首先取出title titles_it
对每个title_it处理：
   对D resize
   对每个title_it取出器word，title_words
   从all_words里边找到word是否存在，存在则设相应位置为1，D[index] = 1，这里的D是different words

D_size -- 所有不重复word数量
N -- titles数量

所有操作都是针对titles的

classification(SINGLE_LINKAGE, JACCARD);

C中存放距离matrix       std::vector< std::vector< distances > > C;

P优先级队列，存放排序后的距离   std::vector< std::multiset< distances, Cmp > > P;

II，存放活动簇

A，存放的是titles cluster   std::vector< std::vector< int > > A;

C，P，II，A都初始化为N维，N是titles的个数，即常说的文档个数

i --> 0 to N - 1
   V_temp   -->   std::vector<distances>，看成C的一维数据
   D_temp   -->   distance D_temp
   Q_temp   -->   std::multiset<distances, Cmp>，看成P的一个维的数据

   j   -->   0 to N - 1
       计算距离，赋值到D_temp和Q_temp
   C[i]=V_temp;
   II[i]=1;
   P[i]=Q_temp;


   std::vector<int> A_i;
   A_i.push_back(i);
   A[i]=A_i;   //cluster的初值只有一个，即自己

Cmp：
   当距离相同时，谁的index靠前谁排前面
   距离不同时，谁的距离小谁排前面

距离算法：

dot_distance -- 就是计算两个title有多少个单词相同

jaccard_distance -- 1 - （相同的单词数）/（不同的单词数）

原代码的D_temp.dist = - jaccard_distance(titles[i].words, titles[j].words);有错，应该去掉前面的"-"号

构建聚类
n   -->   0 to N - K - 1
   min_dist = 1000
   min_index = 0
   k   -->   0 to N - 2
       k簇是活动的则和min_dist比较首个title的dist，小于则替换为min_dist，由于P是排好序的，只需要比较第一个

    k1,k2是最近距离的文档
   N_k1 -- k1中存放的cluster数量
   N_k2 -- k2中存放的cluster数量
   k2加入k1
   II[k2] = 0   //从活动cluster中清除出去
   重新计算P