数据挖掘中的聚类分析

来源：互联网发布：vb.net 数组编辑：程序博客网时间：2024/06/04 23:32

记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。

第一周

-Considerations for Cluster Analysis

partitioning criteria (single level vs. hierarchical partitioning)
separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])

Four issues:
-Quality

deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
clusters with arbitrary shape
deal with noisy data

-Scalability

clustering all the data instead of only on samples
high dimensionality
incremental or stream clustering and insensitivity to input order

-Constraint-based clustering

user-given preferences or constraints

-Interpretable and usability

Cluster Analysis Categorization:
-Technique-centered

distance-based
density-based and grid-based methods
probabilistic and generative models
leveraging dimensionality reduction methods
high-dimensional clustering
scalable tech for cluster analysis

-Data type-centered

clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.

-Additional insight-centered

visual insights, semi-supervised, ensemble-based, validation-based.

Typical Clustering Methods:
-Distance-based

partitioning algo.: k-means, k-medians, k-medoids
hierarchical algo.: agglomerative vs. divisive method

-Density-based and grid-based

density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
grid-based: individual regions are formed into a grid-like structure

-Probabilistic and generative models

Assume a specific form of the generative model (比如：mixture of Gaussian)
Model parameters are estimated with EM algo.
Then estimate the generative probability of the underlying data points.

-High-dimensional clustering

subspace clustering (bottom-up, top-down, correlation-based method vs. δ-cluster method)
dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)

Lecture2:
Good clustering:

High intra-class similarity (Cohesive)
Low inter-class similarity (Distinctive between clusters)

proximity: similarity or dissimilarity

-Dissimilarity Matrix

triangle matrix (symmetric)
distance functions are usually different for different types of data

-Distance on numeric data: Minkowski Distance

A popular distance measure:
d(i,j)=|xi1−xj1|p+|xi2−xj2|p+⋯+|xil−xjl|p−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√p，
其中，i=(xi1,xi2,…,xil)，j=(xj1,xj2,…,xjl)为l维数据，p 为order (这种距离也常被成为 l−p norm)。
Property:
positivity; symmetry; triangle inequality.
p=1: Manhanttan (or city block) distance
p=2: Euclidean distance
p→∞: “supremum” distance
这种情况下，
d(i,j)=limx→∞|xi1−xj1|p+|xi2−xj2|p+⋯+|xil−xjl|p−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√p=maxf=1,…,l |xif−xif|。

-Proximity measure for binary attribute
Draw a contingency table for binary data.

for symmetric binary variables: d(i,j)=r+sq+r+s+t
for asymmetric binary variable: d(i,j)=r+sq+r+s
Jaccard coefficient (similarity measure): simJaccard(i,j)=qq+r+s,跟“coherence(i,j)计算一样”。

-Proximity measure for categorical attribute

simple matching: d(i,j)=p−mp
user a large number of binary variables

-Proximity measure for ordinal attribute
compute ranks zif as interval-scaled. (其中，zif=rif−1Mf−1)。

-Attributes of mixed type

a dataset may contain all attribute types: nominal, symmetric binary, asymmetric binary, numeric, and ordinal;
use a weighted formula to combine their effects:
d(i,j)=∑pf=1w(f)ijd(f)ij∑pf=1w(f)ij.

-Covariance for two variables
Covariance between two variables X1 and X2:
σ12=E[X1X2]−E[X1]E[X2]

-Correlation coefficient
ρ12=σ12σ21σ22√

-Covariance matrix
∑=E[(X−μ)(X−μ)T]
=(σ21σ21σ12σ22)

Lecture3:

Partitioning method: Discovering the groupings in the data by optimizing a specific objective function and iteratively improving the quality of partitions.

K-partitioning method:

Partitioning a dataset D of n objects into a set of K clusters so that an objective function is optimized (e.g., the sum of squared distances is minimized.)

A typical objective function:
Sum of Squared Errors (SSE)
SSE(C)=∑Kk=1∑xi∈Ck||xi−ck||2

Problem Definition:

Given K, find a partition of K clusters that optimizes the chosen partitioning criterion.

Global optimal: Needs to exhaustively enumerate all partitions.

Heuristic methods (i.e., greedy algo.):
- K-means, K-medians, K-Medoids, etc.

K-Means：

Each cluster is represented by the center of the cluster
Efficiency: O(tKn)，normally, K,t≪n
often terminate at a local optimal
Need to specify K
objects in a continuous n-dimensional space
- use the K-modes for categorical data
Sensitive to noisy data and outliers
- variations: use K-medians, K-medoids, etc.
Not suitable to discover clusters with non-convex shapes
- use density-based clustering, kernel K-means, etc.

Variations of K-Means

choose better initial centroid estimates
- K-means++, Intelligent K-means, Genetic K-means
choose different representative prototypes for the clusters
- K-medoids, K-medians, K-modes
applying feature transformation techniques
- weighted K-means, kernel K-means

Initialization of K-means

K-means++
- The first centroid is selected at random
- The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score)
- The selection continues until k centroids are obtained

The K-medoids clustering methods

K-medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

PAM (Partitioning Around Medoids)

starts from an initial set of medoids
iteratively replaces one of the medoids by one of the non-medoids if it improves the total sum of the squared errors (SSE) of the resulting clustering
works effectively for small data sets but does not scale well for large data sets (due to the computational complexity)
O(K(n−K)2) (quite expensive!)

Efficiency improvements on PAM

CLARA (Kaufmann & Rousseeuw, 1990):
- PAM on samples: O(Ks2+K(n−K)), s is the sample size.
Clarans (Ng & Han, 1994): randomized re-sampling, ensuring efficiency + quality

K-modes:

An extension to K-means by replacing means of clusters with modes. For categorical value.

Dissimilarity measure between object X and the center of cluster Z.
Φ(xi,zj)=1−nrjnl,
其中，nl为类l中的对象数目，nrj是属性值为r的对象数目。说明，dissimilarity为frequency-based。

A mixture of categorical and numerical data: using a K-prototype method

0 0