聚类1
来源:互联网 发布:javascript库 知乎 编辑:程序博客网 时间:2024/05/01 15:45
Foundations and Trends Rin
Machine Learning
Vol. 2, No. 3 (2009) 235–274
c 2010 U. von Luxburg
DOI: 10.1561/2200000008
Clustering Stability: An Overview
By Ulrike von Luxburg
Contents
1 Introduction
2 Clustering Stability: Definition and
Implementation
3 Stability Analysis of the K-MeansAlgorithm
3.1
3.2
3.3
The Idealized K-MeansAlgorithm
The Actual K-MeansAlgorithm
Relationships between the results
236
239
246
248
257
262
266
269
272
4 Beyond K-Means
5 Outlook
References
Foundations and Trends Rin
Machine Learning
Vol. 2, No. 3 (2009) 235–274
c 2010 U. von Luxburg
DOI: 10.1561/2200000008
Clustering Stability: An Overview
Ulrike von Luxburg
Max Planck Institute for Biological Cybernetics, T¨bingen, Germany,u
ulrike.luxburg@tuebingen.mpg.de
Abstract
A popular method for selecting the number of clusters is based on
stability arguments: one chooses the number of clusters such that the
corresponding clustering results are “most stable”. In recent years, a
series of papers has analyzed the behavior of this method from a theo-
retical point of view. However, the results are very technical and diffi-
cult to interpret for non-experts. In this monograph we give a high-level
overview about the existing literature on clustering stability. In addi-
tion to presenting the results in a slightly informal but accessible way,
we relate them to each other and discuss their different implications.
1
Introduction
Model selection is a difficult problem in non-parametric clustering. The
obvious reason is that, as opposed to supervised classification, there is
no ground truth against which we could “test” our clustering results.
One of the most pressing questions in practice is how to determine the
number of clusters. Various ad hoc methods have been suggested in
the literature, but none of them is entirely convincing. These methods
usually suffer from the fact that they implicitly have to define “what a
clustering is” before they can assign different scores to different num-
bers of clusters. In recent years a new method has become increasingly
popular: selecting the number of clusters based on clustering stability.
Instead of defining “what is a clustering”, the basic philosophy is simply
that a clustering should be a structure on the data set that is “stable”.
That is, if applied to several data sets from the same underlying model
or of the same data-generating process, a clustering algorithm should
obtain similar results. In this philosophy it is not so important how
the clusters look (this is taken care of by the clustering algorithm), but
that they can be constructed in a stable manner.
The basic intuition of why people believe that this is a good principle
can be described by Figure 1.1. Shown is a data distribution with four
236
237
Sample 1
k = 2:
Sample 2
k = 5:
Fig. 1.1 Idea of clustering stability. Instable clustering solutions if the number of clusters
is too small (first row) or too large (second row). See text for details.
underlying clusters (depicted by the black circles), and different sam-
ples from this distribution (depicted by red diamonds). If we cluster this
data set into K= 2 clusters, there are two reasonable solutions: a hori-
zontal and a vertical split. If a clustering algorithm is applied repeatedly
to different samples from this distribution, it might sometimes con-
struct the horizontal and sometimes the vertical solution. Obviously,
these two solutions are very different from each other, hence the clus-
tering results are instable. Similar effects take place if we start with
K = 5. In this case, we necessarily have to split an existing cluster into
two clusters, and depending on the sample this could happen to any
of the four clusters. Again the clustering solution is instable. Finally,
if we apply the algorithm with the correct numberK = 4, we observe
stable results (not shown in the figure): the clustering algorithm always
discovers the correct clusters (maybe up to a few outlier points). In this
example, the stability principle detects the correct number of clusters.
At first glance, using stability-based principles for model selection
appears to be very attractive. It is elegant as it avoids to define what a
good clustering is. It is a meta-principle that can be applied to any basic
clustering algorithm and does not require a particular clustering model.
Finally, it sounds “very fundamental” from a philosophy of inference
point of view.
238
Introduction
However, the longer one thinks about this principle, the less obvious
it becomes that model selection based on clustering stability “always
works”. What is clear is that solutions that are completely instable
should not be considered at all. However, if there are several stable
solutions, is it always the best choice to select the one corresponding
to the most stable results? One could conjecture that the most sta-
ble parameter always corresponds to the simplest solution, but clearly
there exist situations where the most simple solution is not what we
are looking for. To find out how model selection based on clustering
stability works we need theoretical results.
In this monograph we discuss a series of theoretical results on clus-
tering stability that have been obtained in recent years. In Section 2
we review different protocols for how clustering stability is computed
and used for model selection. In Section 3 we concentrate on theoretical
results for the K-meansalgorithm and discuss their various relations.
This is the main section of the paper. Results for more general cluster-
ing algorithms are presented in Section 4.
2
Clustering Stability:
Definition and Implementation
A clusteringCK of a data setS = {X1 , . . . , Xn} is a function that
assigns labels to all points of S, that is CK :S → {1, . . . , K}. Here K
denotes the number of clusters. A clustering algorithm is a procedure
that takes a set Sof points as input and outputs a clustering ofS.
The clustering algorithms considered in this monograph take an addi-
tional parameter as input, namely the numberK of clusters they are
supposed to construct. We analyze clustering stability in astatistical
setup. The data setS is assumed to consist ofn data pointsX1 , . . . , Xn
that have been drawn independently from some unknown underlying
distribution Pon some space X. The final goal is to use these sample
points to construct a good partition of the underlying spaceX . For
some theoretical results it will be easier to ignore sampling effects and
directly work on the underlying spaceX endowed with the probability
distribution P. This can be considered as the case of having “infinitely
many” data points. We sometimes call this the limit case forn → ∞.
Assume we agree on a way to compute distancesd(C, C ) between
different clusteringsC and C (see below for details). Then, for a fixed
probability distribution P, a fixed number Kof clusters and a fixed
sample size n,the instability of a clustering algorithmis defined as the
239
240
Clustering Stability: Definition and Implementation
expected distance between two clusteringsCK (Sn), CK(Sn ) on different
data sets Sn, Snof size n, that is:
Instab(K, n):= E d(CK(Sn ), CK(Sn )) .
(2.1)
The expectation is taken with respect to the drawing of the two sam-
ples.
In practice, a large variety of methods has been devised to compute
stability scores and use them for model selection. On a very general
level they work as follows:
Given: a set Sof data points, a clustering algorithm A that takes
the number kof clusters as input
(1) For k= 2, . . . , kmax
(a) Generate perturbed versions Sb (b = 1,. . . , bmax) of the
original data set (for example by subsampling or
adding noise, see below).
(b) For b= 1, . . . , bmax:
Cluster the data set Sbwith algorithm Ainto k
clusters to obtain clusteringCb .
(c) For b, b= 1, . . . , bmax:
Compute pairwise distances d(Cb ,Cb )between these
clusterings (using one of the distance functions
described below).
(d) Compute instability as the mean distance between
clusterings Cb:
Instab(k, n)=
1
b2max
bmax
d(Cb, Cb).
b,b =1
(2) Choose the parameter kthat gives the best stability, in the
simplest case as follows:
K := argmin Instab(k,n)
k
(see below for more options).
This scheme gives a very rough overview of how clustering stability
can be used for model selection. In practice, many details have to be
taken into account, and they will be discussed in the next section.
Finally, we want to mention an approach that is vaguely related to
clustering stability, namely the ensemble method [26]. Here, an ensem-
ble of algorithmsis applied to one fixed data set. Then a final clustering
241
is built from the results of the individual algorithms. We are not going
to discuss this approach in our monograph.
Generating perturbed versions of the data set.To be able to
evaluate the stability of a fixed clustering algorithm we need to run
the clustering algorithm several times on slightly different data sets.
To this end we need to generate perturbed versions of the original data
set. In practice, the following schemes have been used:
• Draw a random subsample of the original data set without
replacement [5, 12, 15, 17].
• Add random noise to the original data points [8, 19].
• If the original data set is high-dimensional, use different ran-
dom projections in low-dimensional spaces, and then cluster
the low-dimensional data sets [25].
• If we work in a model-based framework, sample data from
the model [14].
• Draw a random sample of the original data with replacement.
This approach has not been reported in the literature yet, but
it avoids the problem of setting the size of the subsample. For
good reasons, this kind of sampling is the standard in the
bootstrap literature [11] and might also have advantages in
the stability setting. This scheme requires that the algorithm
can deal with weighted data points (because some data points
will occur several times in the sample).
In all cases, there is a trade-off that has to be treated carefully. If we
change the data set too much (for example, the subsample is too small,
or the noise too large), then we might destroy the structure we want
to discover by clustering. If we change the data set too little, then the
clustering algorithm will always obtain the same results, and we will
observe trivial stability. It is hard to quantify this trade-off in practice.
Which clusterings to compare? Different protocols are used to com-
pare the clusterings on the different data setsSb .
• Compare the clustering of the original data set with the clus-
terings obtained on subsamples [17].
242
Clustering Stability: Definition and Implementation
• Compare clusterings of overlapping subsamples on the data
points where both clusterings are defined [5].
• Compare clusterings of disjoint subsamples [12, 15]. Here we
first need to apply an extension operator to extend each clus-
tering to the domain of the other clustering.
Distances between clusterings. If two clusterings are defined on the
same data points, then it is straightforward to compute a distance score
between these clusterings based on any of the well-known clustering
distances such as the Rand index, Jaccard index, Hamming distance,
minimal matching distance, and Variation of Information distance [18].
All these distances count, in some way or the other, points or pairs of
points on which the two clusterings agree or disagree. The most conve-
nient choice from a theoretical point of view is the minimal matching
distance. For two clusterings C, C of the same data set ofn points it is
defined as:
1
dMM(C, C ) := min
πn
n
1 {C(Xi)=π(C (Xi ))} ,l
i=1
(2.2)
where the minimum is taken over all permutationsπ of the K labels.
Intuitively, the minimal matching distance measures the same quantity
as the 0–1-loss used in supervised classification. For a stability study
involving the adjusted Rand index or an adjusted mutual information
index see Vinh and Epps [27].
If two clusterings are defined on different data sets one has two
choices. If the two data sets have a big overlap one can use arestriction
operator to restrict the clusterings to the points that are contained in
both data sets. On this restricted set one can then compute a standard
distance between the two clusterings. The other possibility is to use
an extension operatorto extend both clusterings from their domain to
the domain of the other clustering. Then one can compute a standard
distance between the two clusterings as they are now both defined
on the joint domain. For center-based clusterings, as constructed by
the K-meansalgorithm, a natural extension operator exists. Namely,
to a new data point we simply assign the label of the closest cluster
center. A more general scheme to extend an existing clustering to new
243
data points is to train a classifier on the old data points and use its
predictions as labels on the new data points. However, in the context
of clustering stability it is not obvious what kind of bias we introduce
with this approach.
Stability scores and their normalization.The stability protocol
outlined above results in a set of distance values (d(Cb, Cb))b,b =1,...,bmax .
In most approaches, one summarizes these values by taking their mean:
Instab(k, n)=
1
b2max
bmax
d(Cb, Cb).
b,b =1
Note that the mean is the simplest summary statistic one can compute
based on the distance values d(Cb, Cb). A different approach is to use the
area under the cumulative distribution function of the distance values
as the stability score, see Ben-Hur et al. [5] or Bertoni and Valentitni [6]
for details. In principle one could also come up with more elaborate
statistics based on distance values. To the best of our knowledge, such
concepts have not been used so far.
The simplest way to select the numberK of clusters is to minimize
the instability:
K = argmin Instab(k,n).
k=2,...,kmax
This approach has been suggested in Levine and Domany [17]. However,
an important fact to note is that Instab(k,n) trivially scales withk,
regardless of what the underlying data structure is. For example, in
the top left plot in Figure 2.1 we can see that even for a completely
unclustered data set, Instab(n, k) increases with k. When using stability
for model selection, one should correct for the trivial scaling of Instab,
otherwise it might be meaningless to take the minimum afterwards.
There exist several differentnormalizationprotocols:
• Normalization using a reference null distribution [6, 12]. One
repeatedly samples data sets from some reference null distri-
bution. Such a distribution is defined on the same domain as
the data points, but does not possess any cluster structure.
In simple cases one can use the uniform distribution on the
244
Clustering Stability: Definition and Implementation
Data set: uniform
stability (not normalized)
Data set: four Gaussians
stability (not normalized)
1
0.5
0.8
0.6
0.4
0
0.8
0.6
0.4
0
1.5
1
0.5
0
5
10
15
510
stability (normalized)
15
510
stability on reference distribution
15
0
0
0.8
0.6
0.4
0
2
1
0
0
510
stability on reference distribution
15
510
stability (normalized)
15
5
10
15
Fig. 2.1 Normalized stability scores. Left plots: Data points from a uniform density on
[0, 1]2. Right plots: Data points from a mixture of four well-separated Gaussians inR2 . The
first row always shows the unnormalized instability Instab forK = 2, . . . , 15. The second row
shows the instability Instabnormobtained on a reference distribution (uniform distribution).
The third row shows the normalized stability Instabnorm.
data domain as null distribution. A more practical approach
is to scramble the individual dimensions of the existing data
points and use the “scrambled points” as null distribution
(see [6, 12] for details). Once we have drawn several data
sets from the null distribution, we cluster them using our
clustering algorithm and compute the corresponding stabil-
ity score Instabnullas above. The normalized stabilityis then
defined as Instabnorm:= Instab/Instabnull.
• Normalization by random labels [15]. First, we cluster each
of the data sets Sbas in the protocol above to obtain the
clusterings Cb. Then, we randomly permute these labels. That
is, we assign the label to data pointXi that belonged to
Xπ(i), where π is a permutation of {1,. . . , n}. This leads to a
permuted clustering Cb,perm . We then compute the stability
score Instab as above, and similarly we compute Instabperm
for the permuted clusterings. Thenormalized stability is then
defined as Instabnorm:= Instab/Instabperm.
Once we computed the normalized stability scores Instabnormwe can
choose the number of clusters that has smallest normalized instability,
245
that is:
K = argmin Instabnorm(k, n).
k=2,...,kmax
This approach has been taken for example in Ben-Hur et al. [5] and
Lange et al. [15].
Selecting Kbased on statistical tests. A second approach to select
the final number of clusters is to use a statistical test. Similarly to
the normalization considered above, the idea is to compute stability
scores not only on the actual data set, but also on “null data sets”
drawn from some reference null distribution. Then one tests whether,
for a given parameter k,the stability on the actual data is significantly
larger than the one computed on the null data. If there are several
values kfor which this is the case, then one selects the one that is most
significant. The most well-known implementation of such a procedure
uses bootstrap methods [12]. Other authors use aχ2 -test [6] or a test
based on the Bernstein inequality [7].
To summarize, there are many different implementations for select-
ing the number Kof clusters based on stability scores. Until now,
there does not exist any convincing empirical study that thoroughly
compares all these approaches on a variety of data sets. In my opin-
ion, even fundamental issues such as the normalization have not been
investigated in enough detail. For example, in my experience normal-
ization often has no effect whatsoever (but I did not conduct a thorough
study either). To put stability-based model selection on a firm ground
it would be crucial to compare the different approaches with each other
in an extensive case study.
3
Stability Analysis of the K-MeansAlgorithm
The vast majority of papers about clustering stability use theK-means
algorithm as basic clustering algorithm. In this section we discuss the
stability results for the K-meansalgorithm in depth. Later, in Sec-
tion 4 we will see how these results can be extended to other clustering
algorithms.
For simpler reference we briefly recapitulate theK-means algorithm
(details can be found in many text books, for example [13]). Given a set
of ndata points X1, . . . , Xn∈Rdand a fixed number Kof clusters to
construct, the K-meansalgorithm attempts to minimize the clustering
objective function:
(n)
QK(c1 , . . . , cK )
1
=
n
n
i=1
k=1,...,K
min
Xi− ck2 ,
(3.1)
where c1, . . . , cKdenote the centers of the Kclusters. In the limit
n → ∞,the K-meansclustering is the one that minimizes the limit
objective function:
QK(c1 , . . . , cK ) =
(∞)
k=1,...,K
min
X −ck
2
dP (X),
(3.2)
where Pis the underlying probability distribution.
246
247
Given an initial set c<0>= {c<0>, . . . , c<0>} of centers, theK-means1K
algorithm iterates the following two steps until convergence:
(1) Assign data points to closest cluster centers:
∀i= 1,. . . , n :
C <t>(Xi ) := argmin Xi− c<t>.k
k=1,...K
(2) Re-adjust cluster means:
∀k= 1,. . . , K :
c<t+1>:=k
1
Nk
Xi,
{i | C <t>(Xi )=k}
where Nkdenotes the number of points in cluster k.
It is well known that, in general, theK-means algorithm terminates
(n)
in a local optimum of QKand does not necessarily find the global
optimum. We study the K-meansalgorithm in two different scenarios:
The idealized scenario: Here we assume an idealized algorithm that
always finds the globaloptimum of the K-meansobjective function
(n)
QK. For simplicity, we call this algorithm the idealizedK-means
algorithm.
The realistic scenario: Here we analyze the actualK-means
algorithm as described above. In particular, we take into account its
property of getting stuck in local optima. We also take into account
the initialization of the algorithm.
In both scenarios, our theoretical investigations are based on the
following simple protocol to compute the stability of theK-means
algorithm:
(1) We assume to have access to as many independent samples
of size nof the underlying distribution as we want. That is,
we ignore artifacts introduced by the fact that in practice we
draw subsamples of one fixed, given sample and thus might
introduce a bias.
(2) As distance between two K-meansclusterings of two samples
S, Swe use the minimal matching distance between the
extended clusterings on the domainS ∪S.
248
Stability Analysis of the K-MeansAlgorithm
(3) We work with the expected minimal matching distance as
in Equation (2.1), that is we analyze Instab rather than
the practically used Instab. This does not do much harm as
instability scores are highly concentrated around their means
anyway.
3.1
The Idealized K-MeansAlgorithm
In this section we focus on the idealizedK-means algorithm, that is the
algorithm that always finds the global optimumc(n) of the K-means
objective function:
c(n):= (c1 , . . . , cK ) := argmin QK(c).
c
(n)
(n)
(n)
3.1.1
First Convergence Result and the Role of Symmetry
The starting point for the results in this section is the following obser-
vation [4]. Consider the situation in Figure 3.1a. Here the data contains
three clusters, but two of them are closer to each other than to the third
cluster. Assume we run the idealized K-means algorithm withK = 2 on
such a data set. Separating the left two clusters from the right cluster
(n)
(solid line) leads to a much better value ofQK than, say, separating
the top two clusters from the bottom one (dashed line). Hence, as soon
as we have a reasonable amount of data, idealized (!)K-means with
K = 2 always constructs the first solution (solid line). Consequently, it
is stable in spite of the fact that K = 2 is the wrong number of clus-
ters. Note that this would not happen if the data set was symmetric,
as depicted in Figure 3.1b. Here neither the solution depicted by the
dashed line nor the one with the solid line is clearly superior, which
leads to instability if the idealizedK-means algorithm is applied to
different samples. Similar examples can be constructed to detect that
K is too large, see Figure 3.1c and d. WithK = 3 it is clearly the best
solution to split the big cluster in Figure 3.1c, thus clustering this data
set is stable. In Figure 3.1d, however, due to symmetry reasons neither
splitting the top nor the bottom cluster leads to a clear advantage.
Again this leads to instability.
3.1 The Idealized K-MeansAlgorithm
249
(a)
(b)
(c)
(d)
Fig. 3.1 If data sets are not symmetric, idealizedK-means is stable even if the numberK
of clusters is too small (a) or too large (c). Instability of the wrong number of clusters only
occurs in symmetric data sets (b and d).
These informal observations suggest that unless the data set con-
tains perfect symmetries, the idealizedK-means algorithm is stable
even if Kis wrong. This can be formalized with the following theorem.
Theorem 3.1 (Stability and global optima of the objective
(∞)function).Let P be a probability distribution on Rdand QKthe
limit K-meansobjective function as defined in Equation (3.2), for some
fixed value K >1.
(1) If QKhas a unique global minimum, then the idealized
K-meansalgorithm is perfectly stable when n→ ∞, that is:
n→∞
(∞)
(∞)
lim Instab(K, n)= 0.
(2) If QKhas several global minima (for example, because
the probability distribution is symmetric), then the idealized
K-meansalgorithm is instable, that is:
n→∞
lim Instab(K, n) >0.
This theorem has been proved (in a slightly more general setting) in
references [2, 4].
Proof sketch, Part 1.It is well known that if the objective function
(∞)QKhas a unique global minimum, then the centers c(n) constructed
by the idealized K-meansalgorithm on a sample of npoints almost
250
Stability Analysis of the K-MeansAlgorithm
surely converge to the true population centersc(∗)asn → ∞ [20]. This
means that given some ε >0 we can find some large nsuch that c(n)is
ε-close toc(∗)with high probability. As a consequence, if we compare
two clusterings on different samples of sizen, the centers of the two
clusterings are at most 2ε-close to each other. Finally, one can show that
if the cluster centers of two clusterings areε-close, then their minimal
matching distance is small as well. Thus, the expected distance between
the clusterings constructed on two samples of sizen becomes arbitrarily
small and eventually converges to 0 asn → ∞.
Part 2. For simplicity, consider the symmetric situation in Figure 3.1a.
Here the probability distribution has three axes of symmetry. ForK = 2
(∞)the objective functionQ2 has three global minimac(∗1), c(∗2), c(∗3)
corresponding to the three symmetric solutions. In such a situation, the
idealized K-meansalgorithm on a sample of npoints gets arbitrarily
close to one of the global optima, that is mini=1,...,3d(c(n) , c(∗i))→ 0 [16].
In particular, the sequence (c(n))n of empirical centers has three con-
vergent subsequences, each of which converges to one of the global
solutions. One can easily conclude that if we compare two clusterings
on random samples with probability 1/3 they belong to “the same sub-
sequence” and thus their distance will become arbitrarily small. With
probability 2/3 they “belong to different subsequences”, and thus their
distance remains larger than a constanta > 0. From the latter we can
conclude that Instab(K, n)is always larger than 2a/3.
The interpretation of this theorem is distressing. The stability or
instability of parameter Kdoes not depend on whether Kis “correct”
or “wrong”, but only on whether the K-means objective function for
this particular value Khas one or several global minima. However, the
number of global minima is usually not related to the number of clus-
ters, but rather to the fact that the underlying probability distribution
has symmetries. In particular, if we consider “natural” data distribu-
tions, such distributions are rarely perfectly symmetric. Consequently,
(∞)
the corresponding functions QKusually only have one global mini-
mum, for any value of K.In practice this means that for a large sample
size n,the idealized K-meansalgorithm is stable for any value ofK.
This seems to suggest that model selection based on clustering stability
3.1 The Idealized K-MeansAlgorithm
251
does not work. However, we will see later in Section 3.3 that this result
is essentially an artifact of the idealized clustering setting and does not
carry over to the realistic setting.
3.1.2
Refined Convergence Results for the Case of a
Unique Global Minimum
Above we have seen that if, for a particular distributionP and a
(∞)
particular value K,the objective function QKhas a unique global
minimum, then the idealized K-meansalgorithm is stable in the sense
that limn→∞Instab(K, n)= 0. At first glance, this seems to suggest
that stability cannot distinguish between different valuesk1 andk2 (at
least for large n).However, this point of view is too simplistic. It can
happen that even though both Instab(k1, n) and Instab(k2, n) converge
to 0 as n→ ∞, this happens “faster” fork1 than fork2 . If measured
relative to the absolute values of Instab(k1, n) and Instab(k2, n), the dif-
ference between Instab(k1, n) and Instab(k2, n) can still be large enough
to be “significant”.
The key in verifying this intuition is to study the limit process
more closely. This line of work has been established by Shamir and
Tishby in a series of papers [22, 23, 24]. The main idea is that instead
of studying the convergence of Instab(k,n) one needs to consider the
√
rescaled instability n· Instab(k,n). One can prove that the rescaled
instability converges in distribution, and the limit distribution depends
on k.In particular, the means of the limit distributions are different
for different values ofk. This can be formalized as follows.
Theorem 3.2 (Convergence of rescaled stability).Assume that
the probability distribution Phas a density p.Consider a fixed param-
eter K,and assume that the corresponding limit objective function
(∗)(∗)(∞)QKhas a unique global minimum c(∗)= (c1, . . . , cK). The bound-
ary between clusters iand j is denoted by Bij. Let m ∈N, and
Sn,1, . . . , Sn,2mbe samples of size ndrawn independently from P. Let
CK(Sn,i ) be the result of the idealized K-meansclustering on sample
Sn,i. Compute the instability as mean distance between clusterings of
252
Stability Analysis of the K-MeansAlgorithm
disjoint pairs of samples, that is:
1
Instab(K, n):=
m
m
dMMCK (Sn,2i−1), CK(Sn,2i ) .
i=1
(3.3)
Then, as n→ ∞ and m → ∞, the rescaled instability
converges in probability to
RInstab(K) :=
1≤i<j≤K
Bij
√
n ·Instab(K, n)
Vij
(∗)
ci
−
(∗)
cj
p(x)dx,
(3.4)
where Vijstands for a term describing the asymptotics of the random
fluctuations of the cluster boundary between clusteri and clusterj
(exact formula given in [23, 24]).
Note that even though the definition of instability in Equation (3.3)
differs slightly from the definition in Equation (2.1), intuitively it mea-
sures the same quantity. The definition in Equation (3.3) just has the
technical advantage that all pairs of samples are independent from one
another.
(∞)
Proof sketch.It is well known that if QKhas a unique global
minimum, then the centers constructed by the idealizedK-means algo-
rithm on a finite sample satisfy a central limit theorem [21]. That is,
if we rescale the distances between the sample-based centers and the
√
true centers with the factor n,these rescaled distances converge to a
normal distribution as n→ ∞. When the cluster centers converge, the
same can be said about the cluster boundaries. In this case, instabil-
ity essentially counts how many points change side when the cluster
boundaries move by some small amount. The points that potentially
change side are the points close to the boundary of the true limit clus-
tering. Counting these points is what the integralsBij . . . p(x)dx in
the definition of RInstab take care of. The exact characterization of
how the cluster boundaries “jitter” can be derived from the central
(∗)(∗)
in the inte-
limit theorem. This leads to the termVij / ci −cj
gral. Vijcharacterizes how the cluster centers themselves “jitter”. The
(∗)(∗)normalizationci −cj
is needed to transform jittering of cluster
centers to jittering of cluster boundaries: if two cluster centers are
3.1 The Idealized K-MeansAlgorithm
253
very far apart from each other, the cluster boundary only jitters by
a small amount if the centers move byε, say. However, if the centers
are very close to each other (say, they have distance 3ε), then mov-
ing the centers by εhas a large impact on the cluster boundary. The
details of this proof are very technical, we refer the interested reader to
references [23, 24].
Let us briefly explain how the result in Theorem 3.2 is compatible
with the result in Theorem 3.1. On a high level, the difference between
both results resembles the difference between the law of large numbers
and the central limit theorem in probability theory. The LLN stud-
ies the convergence of the mean of a sum of random variables to its
expectation (note that Instab has the form of a sum of random vari-
ables). The CLT is concerned with the same expression, but rescaled
√
with a factor n.For the rescaled sum, the CLT then gives results
on the convergence in distribution. Note that in the particular case of
instability, the distribution of distances lives on the non-negative num-
bers only. This is why the rescaled instability in Theorem 3.2 is positive
and not 0 as in the limit of Instab in Theorem 3.1. A toy figure explain-
ing the different convergence processes can be seen in Figure 3.2.
Theorem 3.2 tells us that different parametersk usually lead to dif-
ferent rescaled stabilities in the limit forn → ∞. Thus we can hope
that if the sample size nis large enough we can distinguish between
different values ofk based on the stability of the corresponding clus-
terings. An important question is now which values ofk lead to stable
and which ones lead to instable results, for a given distributionP .
3.1.3
Characterizing Stable Clusterings
It is a straightforward consequence of Theorem 3.2 that if we consider
(∞)
different valuesk1 andk2 and the clustering objective functionsQk1
and Qk2have unique global minima, then the rescaled stability values
RInstab(k1) and RInstab(k2) can differ from each other. Now we want
to investigate which values of k lead to high stability and which ones
lead to low stability.
Conclusion 3.3 (Instable clusterings).Assume that QKhas a
unique global optimum. If Instab(K, n) is large, the idealizedK-means
(∞)
(∞)
254
Stability Analysis of the K-MeansAlgorithm
Fig. 3.2 Different convergence processes. The left column shows the convergence studied
in Theorem 3.1. As the sample size n → ∞, the distribution of distancesdMM (C, C ) is
degenerate, all mass is concentrated on 0. The right column shows the convergence studied
in Theorem 3.2. The rescaled distances converge to a non-trivial distribution, and its mean
(depicted by the cross) is positive. To go from the left to the right side one has to rescale
√
by n.
clustering tends to have cluster boundaries in high-density regions of
the space.
There exist two different derivations of this conclusion, which have
been obtained independently from each other by completely different
methods [3, 22]. On a high level, the reason why the conclusion tends
to hold is that if cluster boundaries jitter in a region of high-density,
then more points “change side” than if the boundaries jitter in a region
of low density.
First derivation, informal, based on references [22, 24].Assume that
n is large enough such that we are already in the asymptotic regime
(that is, the solution c(n)constructed on the finite sample is close to the
true population solution c(∗)). Then the rescaled instability computed
on the sample is close to the expression given in Equation (3.4). If the
cluster boundaries Bijlie in a high-density region of the space, then
the integral in Equation (3.4) is large — compared to a situation where
the cluster boundaries lie in low-density regions of the space. From a
high level point of view, this justifies the conclusion above. However,
3.1 The Idealized K-MeansAlgorithm
255
note that it is difficult to identify how exactly the quantitiesp, Bij ,
and Vijinfluence RInstab, as they are not independent of each other.
Second derivation, more formal, based on Ben-David and von
Luxburg [3]. A formal way to prove the conclusion is as follows. We
introduce a new distance dboundarybetween two clusterings. This dis-
tance measures how far the cluster boundaries of two clusterings are
apart from each other. One can prove that theK-means quality func-
(∞)
tion QKis continuous with respect to this distance function. This
means that if two clusterings C, C are close with respect todboundary ,
(∞)
then they have similar quality values. Moreover, ifQK has a unique
global optimum, we can invert this argument and show that if a clus-
tering Cis close to the optimal limit clustering C ∗, then the distance
dboundary(C, C ∗) is small. Now consider the clusteringC (n) based on a
sample of size n.One can prove the following key statement. If C (n) con-
verges uniformly (over the space of all probability distributions) in the
sense that with probability at least 1− δ we have dboundary(Cn , C) ≤ γ,
then:
Instab(K, n)≤ 2δ + P (Tγ(B)).
(3.5)
Here P(Tγ (B)) denotes the probability mass of a tube of width γ
around the cluster boundaries B of C.Results in [1] establish the uni-
form convergence of the idealized K-means algorithm. This proves the
conjecture: Equation (3.5) shows that if Instab is high, then there is a
lot of mass around the cluster boundaries, namely the cluster bound-
aries are in a region of high density.
For stable clusterings, the situation is not as simple. It is tempting
to make the following conjecture.
Conjecture 3.4 (Stable clusterings). Assume that QKhas a
unique global optimum. If Instab(K, n) is “small”, the idealizedK-
means clustering tends to have cluster boundaries in low-density regions
of the space.
Argument in favor of the conjecture:As in the first approach above,
considering the limit expression of RInstab reveals that if the cluster
(∞)
256
Stability Analysis of the K-MeansAlgorithm
boundary lies in a low density area of the space, then the integral in
RInstab tends to have a low value. In the extreme case where the cluster
boundaries go through a region of zero density, the rescaled instability
is even 0.
Argument against the conjecture: counter-examples!One can con-
struct artificial examples where clusterings are stable although their
decision boundary lies in a high-density region of the space ([3]). The
way to construct such examples is to ensure that the variations of the
cluster centers happen in parallel to cluster boundaries and not orthog-
onal to cluster boundaries. In this case, the sampling variation does
not lead to jittering of the cluster boundary, hence the result is rather
stable.
These counter-examples show that Conjecture 3.4 cannot be true in
general. However, my personal opinion is that the counter-examples are
rather artificial, and that similar situations will rarely be encountered
in practice. I believe that the conjecture “tends to hold” in practice.
It might be possible to formalize this intuition by proving that the
statement of the conjecture holds on a subset of “nice” and “natural”
probability distributions.
The important consequence of Conclusion 3.3 and Conjecture 3.4
(if true) is the following.
Conclusion 3.5 (Stability of idealizedK-means detects
whether Kis too large). Assume that the underlying distribution
P hasK well-separated clusters, and assume that these clusters can
be represented by a center-based clustering model. Then the following
statements tend to hold for the idealizedK-means algorithm.
(1) If Kis too large, then the clusterings obtained by the ideal-
ized K-meansalgorithm tend to be instable.
(2) If Kis correct or too small, then the clusterings obtained by
the idealized K-meansalgorithm tend to be stable (unless
the objective function has several global minima, for example
due to symmetries).
3.2 The Actual K-MeansAlgorithm
257
Given Conclusion 3.3 and Conjecture 3.4 it is easy to see why Con-
clusion 3.5 is true. If Kis larger than the correct number of clusters,
one necessarily has to split a true cluster into several smaller clusters.
The corresponding boundary goes through a region of high density (the
cluster which is being split). According to Conclusion 3.3 this leads to
instability. If Kis correct, then the idealized (!) K-meansalgorithm dis-
covers the correct clustering and thus has decision boundaries between
the true clusters, that is in low-density regions of the space. IfK is
too small, then the K-meansalgorithm has to group clusters together.
In this situation, the cluster boundaries are still between true clusters,
hence in a low-density region of the space.
- 聚类1
- mahout读书笔记 -- 聚类(1)
- mahout spectral聚类1
- 聚类 - 1 - 聚类介绍
- 聚类(1)----DBSCAN实例
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- 聚类
- Ubuntu12.04下用bash操作mysql
- 使用jdk的socket通信
- Android学习笔记
- 用EditPlus查看或修改文件编码的方法
- [翻译]实例:在Android调用WCF服务
- 聚类1
- 如何保证数据库结构的合理性(一、调整字段)
- 前缀,中缀,后缀表达式
- arcgis10.1如何生成MSD记录
- 海量数据面试题举例
- Java数字格式化
- Hadoop下join操作的几点优化意见
- GNU C 、ANSI C、标准C、标准c++区别和联系
- android 播放gif 图片