聚类1

来源:互联网 发布:javascript库 知乎 编辑:程序博客网 时间:2024/05/01 15:45

Foundations and Trends Rin

Machine Learning

Vol. 2, No. 3 (2009) 235–274

c 2010 U. von Luxburg

DOI: 10.1561/2200000008

Clustering Stability: An Overview

By Ulrike von Luxburg

Contents

1 Introduction

2 Clustering Stability: Definition and

  Implementation

3 Stability Analysis of the K-MeansAlgorithm

3.1

3.2

3.3

The Idealized K-MeansAlgorithm

The Actual K-MeansAlgorithm

Relationships between the results

236

239

246

248

257

262

266

269

272

4 Beyond K-Means

5 Outlook

References


Foundations and Trends Rin

Machine Learning

Vol. 2, No. 3 (2009) 235–274

c 2010 U. von Luxburg

DOI: 10.1561/2200000008

Clustering Stability: An Overview

Ulrike von Luxburg

Max Planck Institute for Biological Cybernetics, T¨bingen, Germany,u

ulrike.luxburg@tuebingen.mpg.de

Abstract

A popular method for selecting the number of clusters is based on

stability arguments: one chooses the number of clusters such that the

corresponding clustering results are “most stable”. In recent years, a

series of papers has analyzed the behavior of this method from a theo-

retical point of view. However, the results are very technical and di-

cult to interpret for non-experts. In this monograph we give a high-level

overview about the existing literature on clustering stability. In addi-

tion to presenting the results in a slightly informal but accessible way,

we relate them to each other and discuss their dierent implications.


1

Introduction

Model selection is a dicult problem in non-parametric clustering. The

obvious reason is that, as opposed to supervised classification, there is

no ground truth against which we could “test” our clustering results.

One of the most pressing questions in practice is how to determine the

number of clusters. Various ad hoc methods have been suggested in

the literature, but none of them is entirely convincing. These methods

usually suer from the fact that they implicitly have to define “what a

clustering is” before they can assign dierent scores to dierent num-

bers of clusters. In recent years a new method has become increasingly

popular: selecting the number of clusters based on clustering stability.

Instead of defining “what is a clustering”, the basic philosophy is simply

that a clustering should be a structure on the data set that is “stable”.

That is, if applied to several data sets from the same underlying model

or of the same data-generating process, a clustering algorithm should

obtain similar results. In this philosophy it is not so important how

the clusters look (this is taken care of by the clustering algorithm), but

that they can be constructed in a stable manner.

   The basic intuition of why people believe that this is a good principle

can be described by Figure 1.1. Shown is a data distribution with four

236


237

Sample 1

k = 2:

Sample 2

k = 5:

Fig. 1.1 Idea of clustering stability. Instable clustering solutions if the number of clusters

is too small (first row) or too large (second row). See text for details.

underlying clusters (depicted by the black circles), and dierent sam-

ples from this distribution (depicted by red diamonds). If we cluster this

data set into K= 2 clusters, there are two reasonable solutions: a hori-

zontal and a vertical split. If a clustering algorithm is applied repeatedly

to dierent samples from this distribution, it might sometimes con-

struct the horizontal and sometimes the vertical solution. Obviously,

these two solutions are very dierent from each other, hence the clus-

tering results are instable. Similar eects take place if we start with

K = 5. In this case, we necessarily have to split an existing cluster into

two clusters, and depending on the sample this could happen to any

of the four clusters. Again the clustering solution is instable. Finally,

if we apply the algorithm with the correct numberK = 4, we observe

stable results (not shown in the figure): the clustering algorithm always

discovers the correct clusters (maybe up to a few outlier points). In this

example, the stability principle detects the correct number of clusters.

    At first glance, using stability-based principles for model selection

appears to be very attractive. It is elegant as it avoids to define what a

good clustering is. It is a meta-principle that can be applied to any basic

clustering algorithm and does not require a particular clustering model.

Finally, it sounds “very fundamental” from a philosophy of inference

point of view.


238

Introduction

    However, the longer one thinks about this principle, the less obvious

it becomes that model selection based on clustering stability “always

works”. What is clear is that solutions that are completely instable

should not be considered at all. However, if there are several stable

solutions, is it always the best choice to select the one corresponding

to the most stable results? One could conjecture that the most sta-

ble parameter always corresponds to the simplest solution, but clearly

there exist situations where the most simple solution is not what we

are looking for. To find out how model selection based on clustering

stability works we need theoretical results.

    In this monograph we discuss a series of theoretical results on clus-

tering stability that have been obtained in recent years. In Section 2

we review dierent protocols for how clustering stability is computed

and used for model selection. In Section 3 we concentrate on theoretical

results for the K-meansalgorithm and discuss their various relations.

This is the main section of the paper. Results for more general cluster-

ing algorithms are presented in Section 4.


2

     Clustering Stability:

Definition and Implementation

A clusteringCK of a data setS = {X1 , . . . , Xn} is a function that

assigns labels to all points of S, that is CK :S → {1, . . . , K}. Here K

denotes the number of clusters. A clustering algorithm is a procedure

that takes a set Sof points as input and outputs a clustering ofS.

The clustering algorithms considered in this monograph take an addi-

tional parameter as input, namely the numberK of clusters they are

supposed to construct. We analyze clustering stability in astatistical

setup. The data setS is assumed to consist ofn data pointsX1 , . . . , Xn

that have been drawn independently from some unknown underlying

distribution Pon some space X. The final goal is to use these sample

points to construct a good partition of the underlying spaceX . For

some theoretical results it will be easier to ignore sampling eects and

directly work on the underlying spaceX endowed with the probability

distribution P. This can be considered as the case of having “infinitely

many” data points. We sometimes call this the limit case forn → ∞.

    Assume we agree on a way to compute distancesd(C, C ) between

dierent clusteringsC and C (see below for details). Then, for a fixed

probability distribution P, a fixed number Kof clusters and a fixed

sample size n,the instability of a clustering algorithmis defined as the

239


240

Clustering Stability: Definition and Implementation

expected distance between two clusteringsCK (Sn), CK(Sn ) on dierent

data sets Sn, Snof size n, that is:

Instab(K, n):= E d(CK(Sn ), CK(Sn )) .

(2.1)

The expectation is taken with respect to the drawing of the two sam-

ples.

    In practice, a large variety of methods has been devised to compute

stability scores and use them for model selection. On a very general

level they work as follows:

Given: a set Sof data points, a clustering algorithm A that takes

the number kof clusters as input

(1) For k= 2, . . . , kmax

(a) Generate perturbed versions Sb (b = 1,. . . , bmax) of the

    original data set (for example by subsampling or

    adding noise, see below).

(b) For b= 1, . . . , bmax:

    Cluster the data set Sbwith algorithm Ainto k

    clusters to obtain clusteringCb .

(c) For b, b= 1, . . . , bmax:

    Compute pairwise distances d(Cb ,Cb )between these

    clusterings (using one of the distance functions

    described below).

(d) Compute instability as the mean distance between

    clusterings Cb:

Instab(k, n)=

1

b2max

bmax

d(Cb, Cb).

b,b =1

(2) Choose the parameter kthat gives the best stability, in the

    simplest case as follows:

K := argmin Instab(k,n)

k

(see below for more options).

    This scheme gives a very rough overview of how clustering stability

can be used for model selection. In practice, many details have to be

taken into account, and they will be discussed in the next section.

Finally, we want to mention an approach that is vaguely related to

clustering stability, namely the ensemble method [26]. Here, an ensem-

ble of algorithmsis applied to one fixed data set. Then a final clustering


241

is built from the results of the individual algorithms. We are not going

to discuss this approach in our monograph.

Generating perturbed versions of the data set.To be able to

evaluate the stability of a fixed clustering algorithm we need to run

the clustering algorithm several times on slightly dierent data sets.

To this end we need to generate perturbed versions of the original data

set. In practice, the following schemes have been used:

Draw a random subsample of the original data set without

  replacement [5, 12, 15, 17].

Add random noise to the original data points [8, 19].

If the original data set is high-dimensional, use dierent ran-

  dom projections in low-dimensional spaces, and then cluster

  the low-dimensional data sets [25].

If we work in a model-based framework, sample data from

  the model [14].

Draw a random sample of the original data with replacement.

  This approach has not been reported in the literature yet, but

  it avoids the problem of setting the size of the subsample. For

  good reasons, this kind of sampling is the standard in the

  bootstrap literature [11] and might also have advantages in

  the stability setting. This scheme requires that the algorithm

  can deal with weighted data points (because some data points

  will occur several times in the sample).

   In all cases, there is a trade-o that has to be treated carefully. If we

change the data set too much (for example, the subsample is too small,

or the noise too large), then we might destroy the structure we want

to discover by clustering. If we change the data set too little, then the

clustering algorithm will always obtain the same results, and we will

observe trivial stability. It is hard to quantify this trade-o in practice.

Which clusterings to compare? Dierent protocols are used to com-

pare the clusterings on the dierent data setsSb .

Compare the clustering of the original data set with the clus-

  terings obtained on subsamples [17].


242

Clustering Stability: Definition and Implementation

Compare clusterings of overlapping subsamples on the data

  points where both clusterings are defined [5].

Compare clusterings of disjoint subsamples [12, 15]. Here we

  first need to apply an extension operator to extend each clus-

  tering to the domain of the other clustering.

Distances between clusterings. If two clusterings are defined on the

same data points, then it is straightforward to compute a distance score

between these clusterings based on any of the well-known clustering

distances such as the Rand index, Jaccard index, Hamming distance,

minimal matching distance, and Variation of Information distance [18].

All these distances count, in some way or the other, points or pairs of

points on which the two clusterings agree or disagree. The most conve-

nient choice from a theoretical point of view is the minimal matching

distance. For two clusterings C, C of the same data set ofn points it is

defined as:

                   1

dMM(C, C ) := min

                πn

n

1 {C(Xi)=π(C (Xi ))} ,l

i=1

(2.2)

where the minimum is taken over all permutationsπ of the K labels.

Intuitively, the minimal matching distance measures the same quantity

as the 0–1-loss used in supervised classification. For a stability study

involving the adjusted Rand index or an adjusted mutual information

index see Vinh and Epps [27].

    If two clusterings are defined on dierent data sets one has two

choices. If the two data sets have a big overlap one can use arestriction

operator to restrict the clusterings to the points that are contained in

both data sets. On this restricted set one can then compute a standard

distance between the two clusterings. The other possibility is to use

an extension operatorto extend both clusterings from their domain to

the domain of the other clustering. Then one can compute a standard

distance between the two clusterings as they are now both defined

on the joint domain. For center-based clusterings, as constructed by

the K-meansalgorithm, a natural extension operator exists. Namely,

to a new data point we simply assign the label of the closest cluster

center. A more general scheme to extend an existing clustering to new


243

data points is to train a classifier on the old data points and use its

predictions as labels on the new data points. However, in the context

of clustering stability it is not obvious what kind of bias we introduce

with this approach.

Stability scores and their normalization.The stability protocol

outlined above results in a set of distance values (d(Cb, Cb))b,b =1,...,bmax .

In most approaches, one summarizes these values by taking their mean:

Instab(k, n)=

1

b2max

bmax

d(Cb, Cb).

b,b =1

Note that the mean is the simplest summary statistic one can compute

based on the distance values d(Cb, Cb). A dierent approach is to use the

area under the cumulative distribution function of the distance values

as the stability score, see Ben-Hur et al. [5] or Bertoni and Valentitni [6]

for details. In principle one could also come up with more elaborate

statistics based on distance values. To the best of our knowledge, such

concepts have not been used so far.

    The simplest way to select the numberK of clusters is to minimize

the instability:

K = argmin Instab(k,n).

k=2,...,kmax

This approach has been suggested in Levine and Domany [17]. However,

an important fact to note is that Instab(k,n) trivially scales withk,

regardless of what the underlying data structure is. For example, in

the top left plot in Figure 2.1 we can see that even for a completely

unclustered data set, Instab(n, k) increases with k. When using stability

for model selection, one should correct for the trivial scaling of Instab,

otherwise it might be meaningless to take the minimum afterwards.

There exist several dierentnormalizationprotocols:

Normalization using a reference null distribution [6, 12]. One

  repeatedly samples data sets from some reference null distri-

  bution. Such a distribution is defined on the same domain as

  the data points, but does not possess any cluster structure.

  In simple cases one can use the uniform distribution on the


244

Clustering Stability: Definition and Implementation

Data set: uniform

stability (not normalized)

Data set: four Gaussians

stability (not normalized)

1

0.5

0.8

0.6

0.4

   0

0.8

0.6

0.4

   0

1.5

1

0.5

   0

5

10

15

  510

stability (normalized)

15

          510

stability on reference distribution

15

0

 0

0.8

0.6

0.4

   0

2

1

0

 0

          510

stability on reference distribution

15

  510

stability (normalized)

15

5

10

15

Fig. 2.1 Normalized stability scores. Left plots: Data points from a uniform density on

[0, 1]2. Right plots: Data points from a mixture of four well-separated Gaussians inR2 . The

first row always shows the unnormalized instability Instab forK = 2, . . . , 15. The second row

shows the instability Instabnormobtained on a reference distribution (uniform distribution).

The third row shows the normalized stability Instabnorm.

  data domain as null distribution. A more practical approach

  is to scramble the individual dimensions of the existing data

  points and use the “scrambled points” as null distribution

  (see [6, 12] for details). Once we have drawn several data

  sets from the null distribution, we cluster them using our

  clustering algorithm and compute the corresponding stabil-

  ity score Instabnullas above. The normalized stabilityis then

  defined as Instabnorm:= Instab/Instabnull.

Normalization by random labels [15]. First, we cluster each

  of the data sets Sbas in the protocol above to obtain the

  clusterings Cb. Then, we randomly permute these labels. That

  is, we assign the label to data pointXi that belonged to

  Xπ(i), where π is a permutation of {1,. . . , n}. This leads to a

  permuted clustering Cb,perm . We then compute the stability

  score Instab as above, and similarly we compute Instabperm

  for the permuted clusterings. Thenormalized stability is then

  defined as Instabnorm:= Instab/Instabperm.

Once we computed the normalized stability scores Instabnormwe can

choose the number of clusters that has smallest normalized instability,


245

that is:

K = argmin Instabnorm(k, n).

k=2,...,kmax

This approach has been taken for example in Ben-Hur et al. [5] and

Lange et al. [15].

Selecting Kbased on statistical tests. A second approach to select

the final number of clusters is to use a statistical test. Similarly to

the normalization considered above, the idea is to compute stability

scores not only on the actual data set, but also on “null data sets”

drawn from some reference null distribution. Then one tests whether,

for a given parameter k,the stability on the actual data is significantly

larger than the one computed on the null data. If there are several

values kfor which this is the case, then one selects the one that is most

significant. The most well-known implementation of such a procedure

uses bootstrap methods [12]. Other authors use aχ2 -test [6] or a test

based on the Bernstein inequality [7].

    To summarize, there are many dierent implementations for select-

ing the number Kof clusters based on stability scores. Until now,

there does not exist any convincing empirical study that thoroughly

compares all these approaches on a variety of data sets. In my opin-

ion, even fundamental issues such as the normalization have not been

investigated in enough detail. For example, in my experience normal-

ization often has no eect whatsoever (but I did not conduct a thorough

study either). To put stability-based model selection on a firm ground

it would be crucial to compare the dierent approaches with each other

in an extensive case study.


3

Stability Analysis of the K-MeansAlgorithm

The vast majority of papers about clustering stability use theK-means

algorithm as basic clustering algorithm. In this section we discuss the

stability results for the K-meansalgorithm in depth. Later, in Sec-

tion 4 we will see how these results can be extended to other clustering

algorithms.

    For simpler reference we briefly recapitulate theK-means algorithm

(details can be found in many text books, for example [13]). Given a set

of ndata points X1, . . . , XnRdand a fixed number Kof clusters to

construct, the K-meansalgorithm attempts to minimize the clustering

objective function:

 (n)

QK(c1 , . . . , cK )

  1

=

  n

n

i=1

k=1,...,K

min

Xick2 ,

(3.1)

where c1, . . . , cKdenote the centers of the Kclusters. In the limit

n → ∞,the K-meansclustering is the one that minimizes the limit

objective function:

QK(c1 , . . . , cK ) =

(∞)

k=1,...,K

min

X ck

2

dP (X),

(3.2)

where Pis the underlying probability distribution.

246


247

   Given an initial set c<0>= {c<0>, . . . , c<0>} of centers, theK-means1K

algorithm iterates the following two steps until convergence:

(1) Assign data points to closest cluster centers:

i= 1,. . . , n :

C <t>(Xi ) := argmin Xic<t>.k

k=1,...K

(2) Re-adjust cluster means:

k= 1,. . . , K :

c<t+1>:=k

1

Nk

Xi,

{i | C <t>(Xi )=k}

where Nkdenotes the number of points in cluster k.

It is well known that, in general, theK-means algorithm terminates

                         (n)

in a local optimum of QKand does not necessarily find the global

optimum. We study the K-meansalgorithm in two dierent scenarios:

The idealized scenario: Here we assume an idealized algorithm that

always finds the globaloptimum of the K-meansobjective function

  (n)

QK. For simplicity, we call this algorithm the idealizedK-means

algorithm.

The realistic scenario: Here we analyze the actualK-means

algorithm as described above. In particular, we take into account its

property of getting stuck in local optima. We also take into account

the initialization of the algorithm.

    In both scenarios, our theoretical investigations are based on the

following simple protocol to compute the stability of theK-means

algorithm:

(1) We assume to have access to as many independent samples

    of size nof the underlying distribution as we want. That is,

    we ignore artifacts introduced by the fact that in practice we

    draw subsamples of one fixed, given sample and thus might

    introduce a bias.

(2) As distance between two K-meansclusterings of two samples

    S, Swe use the minimal matching distance between the

    extended clusterings on the domainS S.


248

Stability Analysis of the K-MeansAlgorithm

(3) We work with the expected minimal matching distance as

    in Equation (2.1), that is we analyze Instab rather than

    the practically used Instab. This does not do much harm as

    instability scores are highly concentrated around their means

    anyway.

3.1

The Idealized K-MeansAlgorithm

In this section we focus on the idealizedK-means algorithm, that is the

algorithm that always finds the global optimumc(n) of the K-means

objective function:

c(n):= (c1 , . . . , cK ) := argmin QK(c).

c

(n)

(n)

(n)

3.1.1

First Convergence Result and the Role of Symmetry

The starting point for the results in this section is the following obser-

vation [4]. Consider the situation in Figure 3.1a. Here the data contains

three clusters, but two of them are closer to each other than to the third

cluster. Assume we run the idealized K-means algorithm withK = 2 on

such a data set. Separating the left two clusters from the right cluster

                                                 (n)

(solid line) leads to a much better value ofQK than, say, separating

the top two clusters from the bottom one (dashed line). Hence, as soon

as we have a reasonable amount of data, idealized (!)K-means with

K = 2 always constructs the first solution (solid line). Consequently, it

is stable in spite of the fact that K = 2 is the wrong number of clus-

ters. Note that this would not happen if the data set was symmetric,

as depicted in Figure 3.1b. Here neither the solution depicted by the

dashed line nor the one with the solid line is clearly superior, which

leads to instability if the idealizedK-means algorithm is applied to

dierent samples. Similar examples can be constructed to detect that

K is too large, see Figure 3.1c and d. WithK = 3 it is clearly the best

solution to split the big cluster in Figure 3.1c, thus clustering this data

set is stable. In Figure 3.1d, however, due to symmetry reasons neither

splitting the top nor the bottom cluster leads to a clear advantage.

Again this leads to instability.


3.1 The Idealized K-MeansAlgorithm

249

(a)

(b)

(c)

(d)

Fig. 3.1 If data sets are not symmetric, idealizedK-means is stable even if the numberK

of clusters is too small (a) or too large (c). Instability of the wrong number of clusters only

occurs in symmetric data sets (b and d).

    These informal observations suggest that unless the data set con-

tains perfect symmetries, the idealizedK-means algorithm is stable

even if Kis wrong. This can be formalized with the following theorem.

Theorem 3.1 (Stability and global optima of the objective

                                                               (∞)function).Let P be a probability distribution on Rdand QKthe

limit K-meansobjective function as defined in Equation (3.2), for some

fixed value K >1.

(1) If QKhas a unique global minimum, then the idealized

    K-meansalgorithm is perfectly stable when n→ ∞, that is:

n→∞

(∞)

(∞)

lim Instab(K, n)= 0.

(2) If QKhas several global minima (for example, because

    the probability distribution is symmetric), then the idealized

    K-meansalgorithm is instable, that is:

n→∞

lim Instab(K, n) >0.

This theorem has been proved (in a slightly more general setting) in

references [2, 4].

    Proof sketch, Part 1.It is well known that if the objective function

  (∞)QKhas a unique global minimum, then the centers c(n) constructed

by the idealized K-meansalgorithm on a sample of npoints almost


250

Stability Analysis of the K-MeansAlgorithm

surely converge to the true population centersc()asn → ∞ [20]. This

means that given some ε >0 we can find some large nsuch that c(n)is

ε-close toc()with high probability. As a consequence, if we compare

two clusterings on dierent samples of sizen, the centers of the two

clusterings are at most 2ε-close to each other. Finally, one can show that

if the cluster centers of two clusterings areε-close, then their minimal

matching distance is small as well. Thus, the expected distance between

the clusterings constructed on two samples of sizen becomes arbitrarily

small and eventually converges to 0 asn → ∞.

Part 2. For simplicity, consider the symmetric situation in Figure 3.1a.

Here the probability distribution has three axes of symmetry. ForK = 2

                            (∞)the objective functionQ2 has three global minimac(1), c(2), c(3)

corresponding to the three symmetric solutions. In such a situation, the

idealized K-meansalgorithm on a sample of npoints gets arbitrarily

close to one of the global optima, that is mini=1,...,3d(c(n) , c(i))0 [16].

In particular, the sequence (c(n))n of empirical centers has three con-

vergent subsequences, each of which converges to one of the global

solutions. One can easily conclude that if we compare two clusterings

on random samples with probability 1/3 they belong to “the same sub-

sequence” and thus their distance will become arbitrarily small. With

probability 2/3 they “belong to dierent subsequences”, and thus their

distance remains larger than a constanta > 0. From the latter we can

conclude that Instab(K, n)is always larger than 2a/3.

    The interpretation of this theorem is distressing. The stability or

instability of parameter Kdoes not depend on whether Kis “correct”

or “wrong”, but only on whether the K-means objective function for

this particular value Khas one or several global minima. However, the

number of global minima is usually not related to the number of clus-

ters, but rather to the fact that the underlying probability distribution

has symmetries. In particular, if we consider “natural” data distribu-

tions, such distributions are rarely perfectly symmetric. Consequently,

                                 (∞)

the corresponding functions QKusually only have one global mini-

mum, for any value of K.In practice this means that for a large sample

size n,the idealized K-meansalgorithm is stable for any value ofK.

This seems to suggest that model selection based on clustering stability


3.1 The Idealized K-MeansAlgorithm

251

does not work. However, we will see later in Section 3.3 that this result

is essentially an artifact of the idealized clustering setting and does not

carry over to the realistic setting.

3.1.2

Refined Convergence Results for the Case of a

Unique Global Minimum

Above we have seen that if, for a particular distributionP and a

                                                  (∞)

particular value K,the objective function QKhas a unique global

minimum, then the idealized K-meansalgorithm is stable in the sense

that limn→∞Instab(K, n)= 0. At first glance, this seems to suggest

that stability cannot distinguish between dierent valuesk1 andk2 (at

least for large n).However, this point of view is too simplistic. It can

happen that even though both Instab(k1, n) and Instab(k2, n) converge

to 0 as n→ ∞, this happens “faster” fork1 than fork2 . If measured

relative to the absolute values of Instab(k1, n) and Instab(k2, n), the dif-

ference between Instab(k1, n) and Instab(k2, n) can still be large enough

to be “significant”.

    The key in verifying this intuition is to study the limit process

more closely. This line of work has been established by Shamir and

Tishby in a series of papers [22, 23, 24]. The main idea is that instead

of studying the convergence of Instab(k,n) one needs to consider the

                    

rescaled instability n· Instab(k,n). One can prove that the rescaled

instability converges in distribution, and the limit distribution depends

on k.In particular, the means of the limit distributions are dierent

for dierent values ofk. This can be formalized as follows.

Theorem 3.2 (Convergence of rescaled stability).Assume that

the probability distribution Phas a density p.Consider a fixed param-

eter K,and assume that the corresponding limit objective function

                                                 ()()(∞)QKhas a unique global minimum c()= (c1, . . . , cK). The bound-

ary between clusters iand j is denoted by Bij. Let m N, and

Sn,1, . . . , Sn,2mbe samples of size ndrawn independently from P. Let

CK(Sn,i ) be the result of the idealized K-meansclustering on sample

Sn,i. Compute the instability as mean distance between clusterings of


252

Stability Analysis of the K-MeansAlgorithm

disjoint pairs of samples, that is:

                1

Instab(K, n):=

                m

m

dMMCK (Sn,2i−1), CK(Sn,2i ) .

i=1

(3.3)

Then, as n→ ∞ and m → ∞, the rescaled instability

converges in probability to

RInstab(K) :=

1≤i<j≤K

Bij

n ·Instab(K, n)

Vij

 ()

ci

 ()

cj

p(x)dx,

(3.4)

where Vijstands for a term describing the asymptotics of the random

fluctuations of the cluster boundary between clusteri and clusterj

(exact formula given in [23, 24]).

    Note that even though the definition of instability in Equation (3.3)

diers slightly from the definition in Equation (2.1), intuitively it mea-

sures the same quantity. The definition in Equation (3.3) just has the

technical advantage that all pairs of samples are independent from one

another.

                                                (∞)

    Proof sketch.It is well known that if QKhas a unique global

minimum, then the centers constructed by the idealizedK-means algo-

rithm on a finite sample satisfy a central limit theorem [21]. That is,

if we rescale the distances between the sample-based centers and the

                              

true centers with the factor n,these rescaled distances converge to a

normal distribution as n→ ∞. When the cluster centers converge, the

same can be said about the cluster boundaries. In this case, instabil-

ity essentially counts how many points change side when the cluster

boundaries move by some small amount. The points that potentially

change side are the points close to the boundary of the true limit clus-

tering. Counting these points is what the integralsBij . . . p(x)dx in

the definition of RInstab take care of. The exact characterization of

how the cluster boundaries “jitter” can be derived from the central

                                                 ()()

                                                               in the inte-

limit theorem. This leads to the termVij / ci cj

gral. Vijcharacterizes how the cluster centers themselves “jitter”. The

                   ()()normalizationci cj

                              is needed to transform jittering of cluster

centers to jittering of cluster boundaries: if two cluster centers are


3.1 The Idealized K-MeansAlgorithm

253

very far apart from each other, the cluster boundary only jitters by

a small amount if the centers move byε, say. However, if the centers

are very close to each other (say, they have distance 3ε), then mov-

ing the centers by εhas a large impact on the cluster boundary. The

details of this proof are very technical, we refer the interested reader to

references [23, 24].

    Let us briefly explain how the result in Theorem 3.2 is compatible

with the result in Theorem 3.1. On a high level, the dierence between

both results resembles the dierence between the law of large numbers

and the central limit theorem in probability theory. The LLN stud-

ies the convergence of the mean of a sum of random variables to its

expectation (note that Instab has the form of a sum of random vari-

ables). The CLT is concerned with the same expression, but rescaled

               

with a factor n.For the rescaled sum, the CLT then gives results

on the convergence in distribution. Note that in the particular case of

instability, the distribution of distances lives on the non-negative num-

bers only. This is why the rescaled instability in Theorem 3.2 is positive

and not 0 as in the limit of Instab in Theorem 3.1. A toy figure explain-

ing the dierent convergence processes can be seen in Figure 3.2.

    Theorem 3.2 tells us that dierent parametersk usually lead to dif-

ferent rescaled stabilities in the limit forn → ∞. Thus we can hope

that if the sample size nis large enough we can distinguish between

dierent values ofk based on the stability of the corresponding clus-

terings. An important question is now which values ofk lead to stable

and which ones lead to instable results, for a given distributionP .

3.1.3

Characterizing Stable Clusterings

It is a straightforward consequence of Theorem 3.2 that if we consider

                                                                   (∞)

dierent valuesk1 andk2 and the clustering objective functionsQk1

and Qk2have unique global minima, then the rescaled stability values

RInstab(k1) and RInstab(k2) can dier from each other. Now we want

to investigate which values of k lead to high stability and which ones

lead to low stability.

Conclusion 3.3 (Instable clusterings).Assume that QKhas a

unique global optimum. If Instab(K, n) is large, the idealizedK-means

(∞)

(∞)


254

Stability Analysis of the K-MeansAlgorithm

Fig. 3.2 Dierent convergence processes. The left column shows the convergence studied

in Theorem 3.1. As the sample size n → ∞, the distribution of distancesdMM (C, C ) is

degenerate, all mass is concentrated on 0. The right column shows the convergence studied

in Theorem 3.2. The rescaled distances converge to a non-trivial distribution, and its mean

(depicted by the cross) is positive. To go from the left to the right side one has to rescale

  

by n.

clustering tends to have cluster boundaries in high-density regions of

the space.

There exist two dierent derivations of this conclusion, which have

been obtained independently from each other by completely dierent

methods [3, 22]. On a high level, the reason why the conclusion tends

to hold is that if cluster boundaries jitter in a region of high-density,

then more points “change side” than if the boundaries jitter in a region

of low density.

    First derivation, informal, based on references [22, 24].Assume that

n is large enough such that we are already in the asymptotic regime

(that is, the solution c(n)constructed on the finite sample is close to the

true population solution c()). Then the rescaled instability computed

on the sample is close to the expression given in Equation (3.4). If the

cluster boundaries Bijlie in a high-density region of the space, then

the integral in Equation (3.4) is large — compared to a situation where

the cluster boundaries lie in low-density regions of the space. From a

high level point of view, this justifies the conclusion above. However,


3.1 The Idealized K-MeansAlgorithm

255

note that it is dicult to identify how exactly the quantitiesp, Bij ,

and Vijinfluence RInstab, as they are not independent of each other.

    Second derivation, more formal, based on Ben-David and von

Luxburg [3]. A formal way to prove the conclusion is as follows. We

introduce a new distance dboundarybetween two clusterings. This dis-

tance measures how far the cluster boundaries of two clusterings are

apart from each other. One can prove that theK-means quality func-

       (∞)

tion QKis continuous with respect to this distance function. This

means that if two clusterings C, C are close with respect todboundary ,

                                                       (∞)

then they have similar quality values. Moreover, ifQK has a unique

global optimum, we can invert this argument and show that if a clus-

tering Cis close to the optimal limit clustering C , then the distance

dboundary(C, C ) is small. Now consider the clusteringC (n) based on a

sample of size n.One can prove the following key statement. If C (n) con-

verges uniformly (over the space of all probability distributions) in the

sense that with probability at least 1δ we have dboundary(Cn , C) ≤ γ,

then:

Instab(K, n)2δ + P (Tγ(B)).

(3.5)

Here P(Tγ (B)) denotes the probability mass of a tube of width γ

around the cluster boundaries B of C.Results in [1] establish the uni-

form convergence of the idealized K-means algorithm. This proves the

conjecture: Equation (3.5) shows that if Instab is high, then there is a

lot of mass around the cluster boundaries, namely the cluster bound-

aries are in a region of high density.

   For stable clusterings, the situation is not as simple. It is tempting

to make the following conjecture.

Conjecture 3.4 (Stable clusterings). Assume that QKhas a

unique global optimum. If Instab(K, n) is “small”, the idealizedK-

means clustering tends to have cluster boundaries in low-density regions

of the space.

   Argument in favor of the conjecture:As in the first approach above,

considering the limit expression of RInstab reveals that if the cluster

(∞)


256

Stability Analysis of the K-MeansAlgorithm

boundary lies in a low density area of the space, then the integral in

RInstab tends to have a low value. In the extreme case where the cluster

boundaries go through a region of zero density, the rescaled instability

is even 0.

    Argument against the conjecture: counter-examples!One can con-

struct artificial examples where clusterings are stable although their

decision boundary lies in a high-density region of the space ([3]). The

way to construct such examples is to ensure that the variations of the

cluster centers happen in parallel to cluster boundaries and not orthog-

onal to cluster boundaries. In this case, the sampling variation does

not lead to jittering of the cluster boundary, hence the result is rather

stable.

    These counter-examples show that Conjecture 3.4 cannot be true in

general. However, my personal opinion is that the counter-examples are

rather artificial, and that similar situations will rarely be encountered

in practice. I believe that the conjecture “tends to hold” in practice.

It might be possible to formalize this intuition by proving that the

statement of the conjecture holds on a subset of “nice” and “natural”

probability distributions.

    The important consequence of Conclusion 3.3 and Conjecture 3.4

(if true) is the following.

Conclusion 3.5 (Stability of idealizedK-means detects

whether Kis too large). Assume that the underlying distribution

P hasK well-separated clusters, and assume that these clusters can

be represented by a center-based clustering model. Then the following

statements tend to hold for the idealizedK-means algorithm.

(1) If Kis too large, then the clusterings obtained by the ideal-

    ized K-meansalgorithm tend to be instable.

(2) If Kis correct or too small, then the clusterings obtained by

    the idealized K-meansalgorithm tend to be stable (unless

    the objective function has several global minima, for example

    due to symmetries).


3.2 The Actual K-MeansAlgorithm

257

Given Conclusion 3.3 and Conjecture 3.4 it is easy to see why Con-

clusion 3.5 is true. If Kis larger than the correct number of clusters,

one necessarily has to split a true cluster into several smaller clusters.

The corresponding boundary goes through a region of high density (the

cluster which is being split). According to Conclusion 3.3 this leads to

instability. If Kis correct, then the idealized (!) K-meansalgorithm dis-

covers the correct clustering and thus has decision boundaries between

the true clusters, that is in low-density regions of the space. IfK is

too small, then the K-meansalgorithm has to group clusters together.

In this situation, the cluster boundaries are still between true clusters,

hence in a low-density region of the space.


原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 20岁胸下垂松软怎么办 断奶时乳房有肿块怎么办 孩子断奶后乳房变小怎么办 断奶了月经不来怎么办 钥匙在门上拔不出来怎么办 钥匙拔不下来了怎么办 养了几天鱼死了怎么办 乌龟的眼睛肿了怎么办 手被鱼刺扎了怎么办 被鱼刺扎手肿了怎么办 手被桂鱼扎了怎么办 三岁宝宝卡鱼刺怎么办 一岁宝宝卡鱼刺怎么办 鱼刺卡在胸口了怎么办 婴儿被鱼刺卡了怎么办 幼儿被鱼刺卡到怎么办 鱼刺被吞下去了怎么办 喉咙卡到鱼刺下不去怎么办 被小鱼刺卡了怎么办 晚上被鱼刺卡到怎么办 一个小鱼刺卡了怎么办 卡了一个小鱼刺怎么办 鱼刺卡在气管里怎么办 刺蛾幼虫 蛰了怎么办 被杨树辣子蛰了怎么办 蜇了老子蜇了怎么办 被刺蛾幼虫蛰了怎么办 孕妇被蚊虫叮咬发痒怎么办 白掌叶子尖发黄怎么办 白掌叶子卷了怎么办 白掌叶子全软了怎么办? 发财树有黄斑了怎么办 幸福树叶子蔫了怎么办 幸福树枝条塌了怎么办? 幸福树叶子嫣了怎么办 毒蚊子叮咬肿硬怎么办 被蚊子咬了很痒怎么办 蚊子咬了脚肿了怎么办 小孩被蚊子咬了怎么办 小狗老喜欢咬人怎么办 狗狗喜欢咬手怎么办