聚类1

来源：互联网发布：javascript库知乎编辑：程序博客网时间：2024/05/01 15:45

Foundations and Trends Rin

Machine Learning

Vol. 2, No. 3 (2009) 235–274

c 2010 U. von Luxburg

DOI: 10.1561/2200000008

Clustering Stability: An Overview

By Ulrike von Luxburg

Contents

1 Introduction

2 Clustering Stability: Deﬁnition and

Implementation

3 Stability Analysis of the K-MeansAlgorithm

3.1

3.2

3.3

The Idealized K-MeansAlgorithm

The Actual K-MeansAlgorithm

Relationships between the results

236

239

246

248

257

262

266

269

272

4 Beyond K-Means

5 Outlook

References

Foundations and Trends Rin

Machine Learning

Vol. 2, No. 3 (2009) 235–274

c 2010 U. von Luxburg

DOI: 10.1561/2200000008

Clustering Stability: An Overview

Ulrike von Luxburg

Max Planck Institute for Biological Cybernetics, T¨bingen, Germany,u

ulrike.luxburg@tuebingen.mpg.de

Abstract

A popular method for selecting the number of clusters is based on

stability arguments: one chooses the number of clusters such that the

corresponding clustering results are “most stable”. In recent years, a

series of papers has analyzed the behavior of this method from a theo-

retical point of view. However, the results are very technical and diﬃ-

cult to interpret for non-experts. In this monograph we give a high-level

overview about the existing literature on clustering stability. In addi-

tion to presenting the results in a slightly informal but accessible way,

we relate them to each other and discuss their diﬀerent implications.

Introduction

Model selection is a diﬃcult problem in non-parametric clustering. The

obvious reason is that, as opposed to supervised classiﬁcation, there is

no ground truth against which we could “test” our clustering results.

One of the most pressing questions in practice is how to determine the

number of clusters. Various ad hoc methods have been suggested in

the literature, but none of them is entirely convincing. These methods

usually suﬀer from the fact that they implicitly have to deﬁne “what a

clustering is” before they can assign diﬀerent scores to diﬀerent num-

bers of clusters. In recent years a new method has become increasingly

popular: selecting the number of clusters based on clustering stability.

Instead of deﬁning “what is a clustering”, the basic philosophy is simply

that a clustering should be a structure on the data set that is “stable”.

That is, if applied to several data sets from the same underlying model

or of the same data-generating process, a clustering algorithm should

obtain similar results. In this philosophy it is not so important how

the clusters look (this is taken care of by the clustering algorithm), but

that they can be constructed in a stable manner.

The basic intuition of why people believe that this is a good principle

can be described by Figure 1.1. Shown is a data distribution with four

236

237

Sample 1

k = 2:

Sample 2

k = 5:

Fig. 1.1 Idea of clustering stability. Instable clustering solutions if the number of clusters

is too small (ﬁrst row) or too large (second row). See text for details.

underlying clusters (depicted by the black circles), and diﬀerent sam-

ples from this distribution (depicted by red diamonds). If we cluster this

data set into K= 2 clusters, there are two reasonable solutions: a hori-

zontal and a vertical split. If a clustering algorithm is applied repeatedly

to diﬀerent samples from this distribution, it might sometimes con-

struct the horizontal and sometimes the vertical solution. Obviously,

these two solutions are very diﬀerent from each other, hence the clus-

tering results are instable. Similar eﬀects take place if we start with

K = 5. In this case, we necessarily have to split an existing cluster into

two clusters, and depending on the sample this could happen to any

of the four clusters. Again the clustering solution is instable. Finally,

if we apply the algorithm with the correct numberK = 4, we observe

stable results (not shown in the ﬁgure): the clustering algorithm always

discovers the correct clusters (maybe up to a few outlier points). In this

example, the stability principle detects the correct number of clusters.

At ﬁrst glance, using stability-based principles for model selection

appears to be very attractive. It is elegant as it avoids to deﬁne what a

good clustering is. It is a meta-principle that can be applied to any basic

clustering algorithm and does not require a particular clustering model.

Finally, it sounds “very fundamental” from a philosophy of inference

point of view.

238

Introduction

However, the longer one thinks about this principle, the less obvious

it becomes that model selection based on clustering stability “always

works”. What is clear is that solutions that are completely instable

should not be considered at all. However, if there are several stable

solutions, is it always the best choice to select the one corresponding

to the most stable results? One could conjecture that the most sta-

ble parameter always corresponds to the simplest solution, but clearly

there exist situations where the most simple solution is not what we

are looking for. To ﬁnd out how model selection based on clustering

stability works we need theoretical results.

In this monograph we discuss a series of theoretical results on clus-

tering stability that have been obtained in recent years. In Section 2

we review diﬀerent protocols for how clustering stability is computed

and used for model selection. In Section 3 we concentrate on theoretical

results for the K-meansalgorithm and discuss their various relations.

This is the main section of the paper. Results for more general cluster-

ing algorithms are presented in Section 4.

Clustering Stability:

Deﬁnition and Implementation

A clusteringCK of a data setS = {X1 , . . . , Xn} is a function that

assigns labels to all points of S, that is CK :S → {1, . . . , K}. Here K

denotes the number of clusters. A clustering algorithm is a procedure

that takes a set Sof points as input and outputs a clustering ofS.

The clustering algorithms considered in this monograph take an addi-

tional parameter as input, namely the numberK of clusters they are

supposed to construct. We analyze clustering stability in astatistical

setup. The data setS is assumed to consist ofn data pointsX1 , . . . , Xn

that have been drawn independently from some unknown underlying

distribution Pon some space X. The ﬁnal goal is to use these sample

points to construct a good partition of the underlying spaceX . For

some theoretical results it will be easier to ignore sampling eﬀects and

directly work on the underlying spaceX endowed with the probability

distribution P. This can be considered as the case of having “inﬁnitely

many” data points. We sometimes call this the limit case forn → ∞.

Assume we agree on a way to compute distancesd(C, C ) between

diﬀerent clusteringsC and C (see below for details). Then, for a ﬁxed

probability distribution P, a ﬁxed number Kof clusters and a ﬁxed

sample size n,the instability of a clustering algorithmis deﬁned as the

239

240

Clustering Stability: Deﬁnition and Implementation

expected distance between two clusteringsCK (Sn), CK(Sn ) on diﬀerent

data sets Sn, Snof size n, that is:

Instab(K, n):= E d(CK(Sn ), CK(Sn )) .

(2.1)

The expectation is taken with respect to the drawing of the two sam-

ples.

In practice, a large variety of methods has been devised to compute

stability scores and use them for model selection. On a very general

level they work as follows:

Given: a set Sof data points, a clustering algorithm A that takes

the number kof clusters as input

(1) For k= 2, . . . , kmax

(a) Generate perturbed versions Sb (b = 1,. . . , bmax) of the

original data set (for example by subsampling or

adding noise, see below).

(b) For b= 1, . . . , bmax:

Cluster the data set Sbwith algorithm Ainto k

clusters to obtain clusteringCb .

Compute pairwise distances d(Cb ,Cb )between these

clusterings (using one of the distance functions

described below).

(d) Compute instability as the mean distance between

clusterings Cb:

Instab(k, n)=

b2max

bmax

d(Cb, Cb).

b,b =1

(2) Choose the parameter kthat gives the best stability, in the

simplest case as follows:

K := argmin Instab(k,n)

(see below for more options).

This scheme gives a very rough overview of how clustering stability

can be used for model selection. In practice, many details have to be

taken into account, and they will be discussed in the next section.

Finally, we want to mention an approach that is vaguely related to

clustering stability, namely the ensemble method [26]. Here, an ensem-

ble of algorithmsis applied to one ﬁxed data set. Then a ﬁnal clustering

241

is built from the results of the individual algorithms. We are not going

to discuss this approach in our monograph.

Generating perturbed versions of the data set.To be able to

evaluate the stability of a ﬁxed clustering algorithm we need to run

the clustering algorithm several times on slightly diﬀerent data sets.

To this end we need to generate perturbed versions of the original data

set. In practice, the following schemes have been used:

• Draw a random subsample of the original data set without

replacement [5, 12, 15, 17].

• Add random noise to the original data points [8, 19].

• If the original data set is high-dimensional, use diﬀerent ran-

dom projections in low-dimensional spaces, and then cluster

the low-dimensional data sets [25].

• If we work in a model-based framework, sample data from

the model [14].

• Draw a random sample of the original data with replacement.

This approach has not been reported in the literature yet, but

it avoids the problem of setting the size of the subsample. For

good reasons, this kind of sampling is the standard in the

bootstrap literature [11] and might also have advantages in

the stability setting. This scheme requires that the algorithm

can deal with weighted data points (because some data points

will occur several times in the sample).

In all cases, there is a trade-oﬀ that has to be treated carefully. If we

change the data set too much (for example, the subsample is too small,

or the noise too large), then we might destroy the structure we want

to discover by clustering. If we change the data set too little, then the

clustering algorithm will always obtain the same results, and we will

observe trivial stability. It is hard to quantify this trade-oﬀ in practice.

Which clusterings to compare? Diﬀerent protocols are used to com-

pare the clusterings on the diﬀerent data setsSb .

• Compare the clustering of the original data set with the clus-

terings obtained on subsamples [17].

242

Clustering Stability: Deﬁnition and Implementation

• Compare clusterings of overlapping subsamples on the data

points where both clusterings are deﬁned [5].

• Compare clusterings of disjoint subsamples [12, 15]. Here we

ﬁrst need to apply an extension operator to extend each clus-

tering to the domain of the other clustering.

Distances between clusterings. If two clusterings are deﬁned on the

same data points, then it is straightforward to compute a distance score

between these clusterings based on any of the well-known clustering

distances such as the Rand index, Jaccard index, Hamming distance,

minimal matching distance, and Variation of Information distance [18].

All these distances count, in some way or the other, points or pairs of

points on which the two clusterings agree or disagree. The most conve-

nient choice from a theoretical point of view is the minimal matching

distance. For two clusterings C, C of the same data set ofn points it is

deﬁned as:

dMM(C, C ) := min

πn

1 {C(Xi)=π(C (Xi ))} ,l

i=1

(2.2)

where the minimum is taken over all permutationsπ of the K labels.

Intuitively, the minimal matching distance measures the same quantity

as the 0–1-loss used in supervised classiﬁcation. For a stability study

involving the adjusted Rand index or an adjusted mutual information

index see Vinh and Epps [27].

If two clusterings are deﬁned on diﬀerent data sets one has two

choices. If the two data sets have a big overlap one can use arestriction

operator to restrict the clusterings to the points that are contained in

both data sets. On this restricted set one can then compute a standard

distance between the two clusterings. The other possibility is to use

an extension operatorto extend both clusterings from their domain to

the domain of the other clustering. Then one can compute a standard

distance between the two clusterings as they are now both deﬁned

on the joint domain. For center-based clusterings, as constructed by

the K-meansalgorithm, a natural extension operator exists. Namely,

to a new data point we simply assign the label of the closest cluster

center. A more general scheme to extend an existing clustering to new

243

data points is to train a classiﬁer on the old data points and use its

predictions as labels on the new data points. However, in the context

of clustering stability it is not obvious what kind of bias we introduce

with this approach.

Stability scores and their normalization.The stability protocol

outlined above results in a set of distance values (d(Cb, Cb))b,b =1,...,bmax .

In most approaches, one summarizes these values by taking their mean:

Instab(k, n)=

b2max

bmax

d(Cb, Cb).

b,b =1

Note that the mean is the simplest summary statistic one can compute

based on the distance values d(Cb, Cb). A diﬀerent approach is to use the

area under the cumulative distribution function of the distance values

as the stability score, see Ben-Hur et al. [5] or Bertoni and Valentitni [6]

for details. In principle one could also come up with more elaborate

statistics based on distance values. To the best of our knowledge, such

concepts have not been used so far.

The simplest way to select the numberK of clusters is to minimize

the instability:

K = argmin Instab(k,n).

k=2,...,kmax

This approach has been suggested in Levine and Domany [17]. However,

an important fact to note is that Instab(k,n) trivially scales withk,

regardless of what the underlying data structure is. For example, in

the top left plot in Figure 2.1 we can see that even for a completely

unclustered data set, Instab(n, k) increases with k. When using stability

for model selection, one should correct for the trivial scaling of Instab,

otherwise it might be meaningless to take the minimum afterwards.

There exist several diﬀerentnormalizationprotocols:

• Normalization using a reference null distribution [6, 12]. One

repeatedly samples data sets from some reference null distri-

bution. Such a distribution is deﬁned on the same domain as

the data points, but does not possess any cluster structure.

In simple cases one can use the uniform distribution on the

244

Clustering Stability: Deﬁnition and Implementation

Data set: uniform

stability (not normalized)

Data set: four Gaussians

stability (not normalized)

0.5

0.8

0.6

0.4

0.8

0.6

0.4

1.5

0.5

510

stability (normalized)

510

stability on reference distribution

0.8

0.6

0.4

510

stability on reference distribution

510

stability (normalized)

Fig. 2.1 Normalized stability scores. Left plots: Data points from a uniform density on

[0, 1]2. Right plots: Data points from a mixture of four well-separated Gaussians inR2 . The

ﬁrst row always shows the unnormalized instability Instab forK = 2, . . . , 15. The second row

shows the instability Instabnormobtained on a reference distribution (uniform distribution).

The third row shows the normalized stability Instabnorm.

data domain as null distribution. A more practical approach

is to scramble the individual dimensions of the existing data

points and use the “scrambled points” as null distribution

(see [6, 12] for details). Once we have drawn several data

sets from the null distribution, we cluster them using our

clustering algorithm and compute the corresponding stabil-

ity score Instabnullas above. The normalized stabilityis then

deﬁned as Instabnorm:= Instab/Instabnull.

• Normalization by random labels [15]. First, we cluster each

of the data sets Sbas in the protocol above to obtain the

clusterings Cb. Then, we randomly permute these labels. That

is, we assign the label to data pointXi that belonged to

Xπ(i), where π is a permutation of {1,. . . , n}. This leads to a

permuted clustering Cb,perm . We then compute the stability

score Instab as above, and similarly we compute Instabperm

for the permuted clusterings. Thenormalized stability is then

deﬁned as Instabnorm:= Instab/Instabperm.

Once we computed the normalized stability scores Instabnormwe can

choose the number of clusters that has smallest normalized instability,

245

that is:

K = argmin Instabnorm(k, n).

k=2,...,kmax

This approach has been taken for example in Ben-Hur et al. [5] and

Lange et al. [15].

Selecting Kbased on statistical tests. A second approach to select

the ﬁnal number of clusters is to use a statistical test. Similarly to

the normalization considered above, the idea is to compute stability

scores not only on the actual data set, but also on “null data sets”

drawn from some reference null distribution. Then one tests whether,

for a given parameter k,the stability on the actual data is signiﬁcantly

larger than the one computed on the null data. If there are several

values kfor which this is the case, then one selects the one that is most

signiﬁcant. The most well-known implementation of such a procedure

uses bootstrap methods [12]. Other authors use aχ2 -test [6] or a test

based on the Bernstein inequality [7].

To summarize, there are many diﬀerent implementations for select-

ing the number Kof clusters based on stability scores. Until now,

there does not exist any convincing empirical study that thoroughly

compares all these approaches on a variety of data sets. In my opin-

ion, even fundamental issues such as the normalization have not been

investigated in enough detail. For example, in my experience normal-

ization often has no eﬀect whatsoever (but I did not conduct a thorough

study either). To put stability-based model selection on a ﬁrm ground

it would be crucial to compare the diﬀerent approaches with each other

in an extensive case study.

Stability Analysis of the K-MeansAlgorithm

The vast majority of papers about clustering stability use theK-means

algorithm as basic clustering algorithm. In this section we discuss the

stability results for the K-meansalgorithm in depth. Later, in Sec-

tion 4 we will see how these results can be extended to other clustering

algorithms.

For simpler reference we brieﬂy recapitulate theK-means algorithm

(details can be found in many text books, for example [13]). Given a set

of ndata points X1, . . . , Xn∈Rdand a ﬁxed number Kof clusters to

construct, the K-meansalgorithm attempts to minimize the clustering

objective function:

(n)

QK(c1 , . . . , cK )

i=1

k=1,...,K

min

Xi− ck2 ,

(3.1)

where c1, . . . , cKdenote the centers of the Kclusters. In the limit

n → ∞,the K-meansclustering is the one that minimizes the limit

objective function:

QK(c1 , . . . , cK ) =

(∞)

k=1,...,K

min

X −ck

dP (X),

(3.2)

where Pis the underlying probability distribution.

246

247

Given an initial set c<0>= {c<0>, . . . , c<0>} of centers, theK-means1K

algorithm iterates the following two steps until convergence:

(1) Assign data points to closest cluster centers:

∀i= 1,. . . , n :

C <t>(Xi ) := argmin Xi− c<t>.k

k=1,...K

(2) Re-adjust cluster means:

∀k= 1,. . . , K :

c<t+1>:=k

Xi,

{i | C <t>(Xi )=k}

where Nkdenotes the number of points in cluster k.

It is well known that, in general, theK-means algorithm terminates

(n)

in a local optimum of QKand does not necessarily ﬁnd the global

optimum. We study the K-meansalgorithm in two diﬀerent scenarios:

The idealized scenario: Here we assume an idealized algorithm that

always ﬁnds the globaloptimum of the K-meansobjective function

(n)

QK. For simplicity, we call this algorithm the idealizedK-means

algorithm.

The realistic scenario: Here we analyze the actualK-means

algorithm as described above. In particular, we take into account its

property of getting stuck in local optima. We also take into account

the initialization of the algorithm.

In both scenarios, our theoretical investigations are based on the

following simple protocol to compute the stability of theK-means

algorithm:

(1) We assume to have access to as many independent samples

of size nof the underlying distribution as we want. That is,

we ignore artifacts introduced by the fact that in practice we

draw subsamples of one ﬁxed, given sample and thus might

introduce a bias.

(2) As distance between two K-meansclusterings of two samples

S, Swe use the minimal matching distance between the

extended clusterings on the domainS ∪S.

248

Stability Analysis of the K-MeansAlgorithm

(3) We work with the expected minimal matching distance as

in Equation (2.1), that is we analyze Instab rather than

the practically used Instab. This does not do much harm as

instability scores are highly concentrated around their means

anyway.

3.1

The Idealized K-MeansAlgorithm

In this section we focus on the idealizedK-means algorithm, that is the

algorithm that always ﬁnds the global optimumc(n) of the K-means

objective function:

c(n):= (c1 , . . . , cK ) := argmin QK(c).

(n)

3.1.1

First Convergence Result and the Role of Symmetry

The starting point for the results in this section is the following obser-

vation [4]. Consider the situation in Figure 3.1a. Here the data contains

three clusters, but two of them are closer to each other than to the third

cluster. Assume we run the idealized K-means algorithm withK = 2 on

such a data set. Separating the left two clusters from the right cluster

(n)

(solid line) leads to a much better value ofQK than, say, separating

the top two clusters from the bottom one (dashed line). Hence, as soon

as we have a reasonable amount of data, idealized (!)K-means with

K = 2 always constructs the ﬁrst solution (solid line). Consequently, it

is stable in spite of the fact that K = 2 is the wrong number of clus-

ters. Note that this would not happen if the data set was symmetric,

as depicted in Figure 3.1b. Here neither the solution depicted by the

dashed line nor the one with the solid line is clearly superior, which

leads to instability if the idealizedK-means algorithm is applied to

diﬀerent samples. Similar examples can be constructed to detect that

K is too large, see Figure 3.1c and d. WithK = 3 it is clearly the best

solution to split the big cluster in Figure 3.1c, thus clustering this data

set is stable. In Figure 3.1d, however, due to symmetry reasons neither

splitting the top nor the bottom cluster leads to a clear advantage.

Again this leads to instability.

3.1 The Idealized K-MeansAlgorithm

249

(a)

(b)

(c)

(d)

Fig. 3.1 If data sets are not symmetric, idealizedK-means is stable even if the numberK

of clusters is too small (a) or too large (c). Instability of the wrong number of clusters only

occurs in symmetric data sets (b and d).

These informal observations suggest that unless the data set con-

tains perfect symmetries, the idealizedK-means algorithm is stable

even if Kis wrong. This can be formalized with the following theorem.

Theorem 3.1 (Stability and global optima of the objective

(∞)function).Let P be a probability distribution on Rdand QKthe

limit K-meansobjective function as deﬁned in Equation (3.2), for some

ﬁxed value K >1.

(1) If QKhas a unique global minimum, then the idealized

K-meansalgorithm is perfectly stable when n→ ∞, that is:

n→∞

(∞)

lim Instab(K, n)= 0.

(2) If QKhas several global minima (for example, because

the probability distribution is symmetric), then the idealized

K-meansalgorithm is instable, that is:

n→∞

lim Instab(K, n) >0.

This theorem has been proved (in a slightly more general setting) in

references [2, 4].

Proof sketch, Part 1.It is well known that if the objective function

(∞)QKhas a unique global minimum, then the centers c(n) constructed

by the idealized K-meansalgorithm on a sample of npoints almost

250

Stability Analysis of the K-MeansAlgorithm

surely converge to the true population centersc(∗)asn → ∞ [20]. This

means that given some ε >0 we can ﬁnd some large nsuch that c(n)is

ε-close toc(∗)with high probability. As a consequence, if we compare

two clusterings on diﬀerent samples of sizen, the centers of the two

clusterings are at most 2ε-close to each other. Finally, one can show that

if the cluster centers of two clusterings areε-close, then their minimal

matching distance is small as well. Thus, the expected distance between

the clusterings constructed on two samples of sizen becomes arbitrarily

small and eventually converges to 0 asn → ∞.

Part 2. For simplicity, consider the symmetric situation in Figure 3.1a.

Here the probability distribution has three axes of symmetry. ForK = 2

(∞)the objective functionQ2 has three global minimac(∗1), c(∗2), c(∗3)

corresponding to the three symmetric solutions. In such a situation, the

idealized K-meansalgorithm on a sample of npoints gets arbitrarily

close to one of the global optima, that is mini=1,...,3d(c(n) , c(∗i))→ 0 [16].

In particular, the sequence (c(n))n of empirical centers has three con-

vergent subsequences, each of which converges to one of the global

solutions. One can easily conclude that if we compare two clusterings

on random samples with probability 1/3 they belong to “the same sub-

sequence” and thus their distance will become arbitrarily small. With

probability 2/3 they “belong to diﬀerent subsequences”, and thus their

distance remains larger than a constanta > 0. From the latter we can

conclude that Instab(K, n)is always larger than 2a/3.

The interpretation of this theorem is distressing. The stability or

instability of parameter Kdoes not depend on whether Kis “correct”

or “wrong”, but only on whether the K-means objective function for

this particular value Khas one or several global minima. However, the

number of global minima is usually not related to the number of clus-

ters, but rather to the fact that the underlying probability distribution

has symmetries. In particular, if we consider “natural” data distribu-

tions, such distributions are rarely perfectly symmetric. Consequently,

(∞)

the corresponding functions QKusually only have one global mini-

mum, for any value of K.In practice this means that for a large sample

size n,the idealized K-meansalgorithm is stable for any value ofK.

This seems to suggest that model selection based on clustering stability

3.1 The Idealized K-MeansAlgorithm

251

does not work. However, we will see later in Section 3.3 that this result

is essentially an artifact of the idealized clustering setting and does not

carry over to the realistic setting.

3.1.2

Reﬁned Convergence Results for the Case of a

Unique Global Minimum

Above we have seen that if, for a particular distributionP and a

(∞)

particular value K,the objective function QKhas a unique global

minimum, then the idealized K-meansalgorithm is stable in the sense

that limn→∞Instab(K, n)= 0. At ﬁrst glance, this seems to suggest

that stability cannot distinguish between diﬀerent valuesk1 andk2 (at

least for large n).However, this point of view is too simplistic. It can

happen that even though both Instab(k1, n) and Instab(k2, n) converge

to 0 as n→ ∞, this happens “faster” fork1 than fork2 . If measured

relative to the absolute values of Instab(k1, n) and Instab(k2, n), the dif-

ference between Instab(k1, n) and Instab(k2, n) can still be large enough

to be “signiﬁcant”.

The key in verifying this intuition is to study the limit process

more closely. This line of work has been established by Shamir and

Tishby in a series of papers [22, 23, 24]. The main idea is that instead

of studying the convergence of Instab(k,n) one needs to consider the

√

rescaled instability n· Instab(k,n). One can prove that the rescaled

instability converges in distribution, and the limit distribution depends

on k.In particular, the means of the limit distributions are diﬀerent

for diﬀerent values ofk. This can be formalized as follows.

Theorem 3.2 (Convergence of rescaled stability).Assume that

the probability distribution Phas a density p.Consider a ﬁxed param-

eter K,and assume that the corresponding limit objective function

(∗)(∗)(∞)QKhas a unique global minimum c(∗)= (c1, . . . , cK). The bound-

ary between clusters iand j is denoted by Bij. Let m ∈N, and

Sn,1, . . . , Sn,2mbe samples of size ndrawn independently from P. Let

CK(Sn,i ) be the result of the idealized K-meansclustering on sample

Sn,i. Compute the instability as mean distance between clusterings of

252

Stability Analysis of the K-MeansAlgorithm

disjoint pairs of samples, that is:

Instab(K, n):=

dMMCK (Sn,2i−1), CK(Sn,2i ) .

i=1

(3.3)

Then, as n→ ∞ and m → ∞, the rescaled instability

converges in probability to

RInstab(K) :=

1≤i<j≤K

Bij

√

n ·Instab(K, n)

Vij

(∗)

−

(∗)

p(x)dx,

(3.4)

where Vijstands for a term describing the asymptotics of the random

ﬂuctuations of the cluster boundary between clusteri and clusterj

(exact formula given in [23, 24]).

Note that even though the deﬁnition of instability in Equation (3.3)

diﬀers slightly from the deﬁnition in Equation (2.1), intuitively it mea-

sures the same quantity. The deﬁnition in Equation (3.3) just has the

technical advantage that all pairs of samples are independent from one

another.

(∞)

Proof sketch.It is well known that if QKhas a unique global

minimum, then the centers constructed by the idealizedK-means algo-

rithm on a ﬁnite sample satisfy a central limit theorem [21]. That is,

if we rescale the distances between the sample-based centers and the

√

true centers with the factor n,these rescaled distances converge to a

normal distribution as n→ ∞. When the cluster centers converge, the

same can be said about the cluster boundaries. In this case, instabil-

ity essentially counts how many points change side when the cluster

boundaries move by some small amount. The points that potentially

change side are the points close to the boundary of the true limit clus-

tering. Counting these points is what the integralsBij . . . p(x)dx in

the deﬁnition of RInstab take care of. The exact characterization of

how the cluster boundaries “jitter” can be derived from the central

(∗)(∗)

in the inte-

limit theorem. This leads to the termVij / ci −cj

gral. Vijcharacterizes how the cluster centers themselves “jitter”. The

(∗)(∗)normalizationci −cj

is needed to transform jittering of cluster

centers to jittering of cluster boundaries: if two cluster centers are

3.1 The Idealized K-MeansAlgorithm

253

very far apart from each other, the cluster boundary only jitters by

a small amount if the centers move byε, say. However, if the centers

are very close to each other (say, they have distance 3ε), then mov-

ing the centers by εhas a large impact on the cluster boundary. The

details of this proof are very technical, we refer the interested reader to

references [23, 24].

Let us brieﬂy explain how the result in Theorem 3.2 is compatible

with the result in Theorem 3.1. On a high level, the diﬀerence between

both results resembles the diﬀerence between the law of large numbers

and the central limit theorem in probability theory. The LLN stud-

ies the convergence of the mean of a sum of random variables to its

expectation (note that Instab has the form of a sum of random vari-

ables). The CLT is concerned with the same expression, but rescaled

√

with a factor n.For the rescaled sum, the CLT then gives results

on the convergence in distribution. Note that in the particular case of

instability, the distribution of distances lives on the non-negative num-

bers only. This is why the rescaled instability in Theorem 3.2 is positive

and not 0 as in the limit of Instab in Theorem 3.1. A toy ﬁgure explain-

ing the diﬀerent convergence processes can be seen in Figure 3.2.

Theorem 3.2 tells us that diﬀerent parametersk usually lead to dif-

ferent rescaled stabilities in the limit forn → ∞. Thus we can hope

that if the sample size nis large enough we can distinguish between

diﬀerent values ofk based on the stability of the corresponding clus-

terings. An important question is now which values ofk lead to stable

and which ones lead to instable results, for a given distributionP .

3.1.3

Characterizing Stable Clusterings

It is a straightforward consequence of Theorem 3.2 that if we consider

(∞)

diﬀerent valuesk1 andk2 and the clustering objective functionsQk1

and Qk2have unique global minima, then the rescaled stability values

RInstab(k1) and RInstab(k2) can diﬀer from each other. Now we want

to investigate which values of k lead to high stability and which ones

lead to low stability.

Conclusion 3.3 (Instable clusterings).Assume that QKhas a

unique global optimum. If Instab(K, n) is large, the idealizedK-means

(∞)

254

Stability Analysis of the K-MeansAlgorithm

Fig. 3.2 Diﬀerent convergence processes. The left column shows the convergence studied

in Theorem 3.1. As the sample size n → ∞, the distribution of distancesdMM (C, C ) is

degenerate, all mass is concentrated on 0. The right column shows the convergence studied

in Theorem 3.2. The rescaled distances converge to a non-trivial distribution, and its mean

(depicted by the cross) is positive. To go from the left to the right side one has to rescale

√

by n.

clustering tends to have cluster boundaries in high-density regions of

the space.

There exist two diﬀerent derivations of this conclusion, which have

been obtained independently from each other by completely diﬀerent

methods [3, 22]. On a high level, the reason why the conclusion tends

to hold is that if cluster boundaries jitter in a region of high-density,

then more points “change side” than if the boundaries jitter in a region

of low density.

First derivation, informal, based on references [22, 24].Assume that

n is large enough such that we are already in the asymptotic regime

(that is, the solution c(n)constructed on the ﬁnite sample is close to the

true population solution c(∗)). Then the rescaled instability computed

on the sample is close to the expression given in Equation (3.4). If the

cluster boundaries Bijlie in a high-density region of the space, then

the integral in Equation (3.4) is large — compared to a situation where

the cluster boundaries lie in low-density regions of the space. From a

high level point of view, this justiﬁes the conclusion above. However,

3.1 The Idealized K-MeansAlgorithm

255

note that it is diﬃcult to identify how exactly the quantitiesp, Bij ,

and Vijinﬂuence RInstab, as they are not independent of each other.

Second derivation, more formal, based on Ben-David and von

Luxburg [3]. A formal way to prove the conclusion is as follows. We

introduce a new distance dboundarybetween two clusterings. This dis-

tance measures how far the cluster boundaries of two clusterings are

apart from each other. One can prove that theK-means quality func-

(∞)

tion QKis continuous with respect to this distance function. This

means that if two clusterings C, C are close with respect todboundary ,

(∞)

then they have similar quality values. Moreover, ifQK has a unique

global optimum, we can invert this argument and show that if a clus-

tering Cis close to the optimal limit clustering C ∗, then the distance

dboundary(C, C ∗) is small. Now consider the clusteringC (n) based on a

sample of size n.One can prove the following key statement. If C (n) con-

verges uniformly (over the space of all probability distributions) in the

sense that with probability at least 1− δ we have dboundary(Cn , C) ≤ γ,

then:

Instab(K, n)≤ 2δ + P (Tγ(B)).

(3.5)

Here P(Tγ (B)) denotes the probability mass of a tube of width γ

around the cluster boundaries B of C.Results in [1] establish the uni-

form convergence of the idealized K-means algorithm. This proves the

conjecture: Equation (3.5) shows that if Instab is high, then there is a

lot of mass around the cluster boundaries, namely the cluster bound-

aries are in a region of high density.

For stable clusterings, the situation is not as simple. It is tempting

to make the following conjecture.

Conjecture 3.4 (Stable clusterings). Assume that QKhas a

unique global optimum. If Instab(K, n) is “small”, the idealizedK-

means clustering tends to have cluster boundaries in low-density regions

of the space.

Argument in favor of the conjecture:As in the ﬁrst approach above,

considering the limit expression of RInstab reveals that if the cluster

(∞)

256

Stability Analysis of the K-MeansAlgorithm

boundary lies in a low density area of the space, then the integral in

RInstab tends to have a low value. In the extreme case where the cluster

boundaries go through a region of zero density, the rescaled instability

is even 0.

Argument against the conjecture: counter-examples!One can con-

struct artiﬁcial examples where clusterings are stable although their

decision boundary lies in a high-density region of the space ([3]). The

way to construct such examples is to ensure that the variations of the

cluster centers happen in parallel to cluster boundaries and not orthog-

onal to cluster boundaries. In this case, the sampling variation does

not lead to jittering of the cluster boundary, hence the result is rather

stable.

These counter-examples show that Conjecture 3.4 cannot be true in

general. However, my personal opinion is that the counter-examples are

rather artiﬁcial, and that similar situations will rarely be encountered

in practice. I believe that the conjecture “tends to hold” in practice.

It might be possible to formalize this intuition by proving that the

statement of the conjecture holds on a subset of “nice” and “natural”

probability distributions.

The important consequence of Conclusion 3.3 and Conjecture 3.4

(if true) is the following.

Conclusion 3.5 (Stability of idealizedK-means detects

whether Kis too large). Assume that the underlying distribution

P hasK well-separated clusters, and assume that these clusters can

be represented by a center-based clustering model. Then the following

statements tend to hold for the idealizedK-means algorithm.

(1) If Kis too large, then the clusterings obtained by the ideal-

ized K-meansalgorithm tend to be instable.

(2) If Kis correct or too small, then the clusterings obtained by

the idealized K-meansalgorithm tend to be stable (unless

the objective function has several global minima, for example

due to symmetries).

3.2 The Actual K-MeansAlgorithm

257

Given Conclusion 3.3 and Conjecture 3.4 it is easy to see why Con-

clusion 3.5 is true. If Kis larger than the correct number of clusters,

one necessarily has to split a true cluster into several smaller clusters.

The corresponding boundary goes through a region of high density (the

cluster which is being split). According to Conclusion 3.3 this leads to

instability. If Kis correct, then the idealized (!) K-meansalgorithm dis-

covers the correct clustering and thus has decision boundaries between

the true clusters, that is in low-density regions of the space. IfK is

too small, then the K-meansalgorithm has to group clusters together.

In this situation, the cluster boundaries are still between true clusters,

hence in a low-density region of the space.