动态层次聚类方法示例

来源：互联网发布：新浪微博淘宝客粉丝编辑：程序博客网时间：2024/04/30 04:59

Connections 17(2):78-80
Copyright 1994 INSNA

Stephen P. Borgatti
University of South Carolina

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this:

Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.

Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

Compute distances (similarities) between the new cluster and each of the old clusters.

Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Step 3 can be done in different ways, which is what distinguishessingle-linkfromcomplete-linkandaverage-linkclustering. Insingle-linkclustering (also called theconnectednessorminimummethod), we consider the distance between one cluster and another cluster to be equal to the shortestdistance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. Incomplete-linkclustering (also called thediameterormaximummethod), we consider the distance between one cluster and another cluster to be equal to the longestdistance from any member of one cluster to any member of the other cluster. Inaverage-linkclustering, we consider the distance between one cluster and another cluster to be equal to theaveragedistance from any member of one cluster to any member of the other cluster. A variation on average-link clustering is the UCLUS method of D'Andrade (1978) which uses the median distance.

Example.The following pages trace a hierarchical clustering of distances in miles between U.S. cities. The method of clustering issingle-link.

Input distance matrix:

BOSNYDCMIACHISEASFLADENBOS020642915049632976309529791949NY206023313088022815293427861771DC429233010756712684279926311616MIA150413081075013293273305326872037CHI96380267113290201321422054996SEA29762815268432732013080811311307SF3095293427993053214280803791235LA29792786263126872054113137901059DEN19491771161620379961307123510590

The nearest pair of cities is BOS and NY, at distance 206. These are merged into a single cluster called "BOS/NY".

Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "BOS/NY" to DC is chosen to be 233, which is the distance from NY to DC. Similarly, the distance from "BOS/NY" to DEN is chosen to be 1771.

After merging BOS with NY:

BOS/NYDCMIACHISEASFLADENBOS/NY022313088022815293427861771DC223010756712684279926311616MIA13081075013293273305326872037CHI80267113290201321422054996SEA2815268432732013080811311307SF293427993053214280803791235LA2786263126872054113137901059DEN1771161620379961307123510590

The nearest pair of objects is BOS/NY and DC, at distance 223. These are merged into a single cluster called "BOS/NY/DC". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:

After merging DC with BOS-NY:

BOS/NY/DCMIACHISEASFLADENBOS/NY/DC010756712684279926311616MIA1075013293273305326872037CHI67113290201321422054996SEA268432732013080811311307SF27993053214280803791235LA263126872054113137901059DEN161620379961307123510590

Now, the nearest pair of objects is SF and LA, at distance 379. These are merged into a single cluster called "SF/LA". Then we compute the distance from this new cluster to all other objects, to get a new distance matrix:

After merging SF with LA:

BOS/

NY/DC

MIACHISEASF/LADENBOS/NY/DC01075671268426311616MIA107501329327326872037CHI6711329020132054996SEA26843273201308081307SF/LA26312687205480801059DEN16162037996130710590

Now, the nearest pair of objects is CHI and BOS/NY/DC, at distance 671. These are merged into a single cluster called "BOS/NY/DC/CHI". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:

After merging CHI with BOS/NY/DC:

BOS/NY/DC/

CHI

MIASEASF/LADENBOS/NY/DC/CHI0107520132054996MIA10750327326872037SEA2013327308081307SF/LA2054268780801059DEN9962037130710590

Now, the nearest pair of objects is SEA and SF/LA, at distance 808. These are merged into a single cluster called "SF/LA/SEA". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:

After merging SEA with SF/LA:

BOS/NY/DC/CHIMIASF/LA/SEADENBOS/NY/DC/CHI010752013996MIA1075026872037SF/LA/SEA2054268701059DEN996203710590

Now, the nearest pair of objects is DEN and BOS/NY/DC/CHI, at distance 996. These are merged into a single cluster called "BOS/NY/DC/CHI/DEN". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:

After merging DEN with BOS/NY/DC/CHI:

BOS/NY/DC/CHI/DENMIASF/LA/SEABOS/NY/DC/CHI/DEN010751059MIA107502687SF/LA/SEA105926870

Now, the nearest pair of objects is BOS/NY/DC/CHI/DEN and SF/LA/SEA, at distance 1059. These are merged into a single cluster called "BOS/NY/DC/CHI/DEN/SF/LA/SEA". Then we compute the distance from this new compound object to all other objects, to get a new distance matrix:

After merging SF/LA/SEA with BOS/NY/DC/CHI/DEN:

BOS/NY/DC/CHI/DEN/SF/LA/SEAMIABOS/NY/DC/CHI/DEN/SF/LA/SEA01075MIA10750

Finally, we merge the last two clusters at level 1075. This process is summarized by the clustering diagram printed by many software packages:

In the diagram, the columns are associated with the items and the rows are associated with levels (stages) of clustering. An 'X' is placed between two columns in a given row if the corresponding items are merged at that stage in the clustering.

0 0