数据挖掘中的聚类分析

来源:互联网 发布:vb.net 数组 编辑:程序博客网 时间:2024/06/04 23:32

记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。

第一周

-Considerations for Cluster Analysis

  • partitioning criteria (single level vs. hierarchical partitioning)
  • separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
  • similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
  • clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])

Four issues:
-Quality

  • deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
  • clusters with arbitrary shape
  • deal with noisy data

-Scalability

  • clustering all the data instead of only on samples
  • high dimensionality
  • incremental or stream clustering and insensitivity to input order

-Constraint-based clustering

  • user-given preferences or constraints

-Interpretable and usability

Cluster Analysis Categorization:
-Technique-centered

  • distance-based
  • density-based and grid-based methods
  • probabilistic and generative models
  • leveraging dimensionality reduction methods
  • high-dimensional clustering
  • scalable tech for cluster analysis

-Data type-centered

  • clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.

-Additional insight-centered

  • visual insights, semi-supervised, ensemble-based, validation-based.

Typical Clustering Methods:
-Distance-based

  • partitioning algo.: k-means, k-medians, k-medoids
  • hierarchical algo.: agglomerative vs. divisive method

-Density-based and grid-based

  • density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
  • grid-based: individual regions are formed into a grid-like structure

-Probabilistic and generative models

  • Assume a specific form of the generative model (比如:mixture of Gaussian)
  • Model parameters are estimated with EM algo.
  • Then estimate the generative probability of the underlying data points.

-High-dimensional clustering

  • subspace clustering (bottom-up, top-down, correlation-based method vs. δ-cluster method)
  • dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)

Lecture2:
Good clustering:

  • High intra-class similarity (Cohesive)
  • Low inter-class similarity (Distinctive between clusters)

proximity: similarity or dissimilarity

-Dissimilarity Matrix

  • triangle matrix (symmetric)
  • distance functions are usually different for different types of data

-Distance on numeric data: Minkowski Distance

  • A popular distance measure:
    d(i,j)=|xi1xj1|p+|xi2xj2|p++|xilxjl|pp
    其中,i=(xi1,xi2,,xil)j=(xj1,xj2,,xjl)l维数据,p 为order (这种距离也常被成为 lp norm)。

  • Property:
    positivity; symmetry; triangle inequality.

  • p=1: Manhanttan (or city block) distance

  • p=2: Euclidean distance
  • p: “supremum” distance
    这种情况下,
    d(i,j)=limx|xi1xj1|p+|xi2xj2|p++|xilxjl|pp=maxf=1,,l |xifxif|

-Proximity measure for binary attribute
Draw a contingency table for binary data.

  • for symmetric binary variables: d(i,j)=r+sq+r+s+t
  • for asymmetric binary variable: d(i,j)=r+sq+r+s
  • Jaccard coefficient (similarity measure): simJaccard(i,j)=qq+r+s,跟“coherence(i,j)计算一样”。

-Proximity measure for categorical attribute

  • simple matching: d(i,j)=pmp
  • user a large number of binary variables

-Proximity measure for ordinal attribute
compute ranks zif as interval-scaled. (其中,zif=rif1Mf1)。

-Attributes of mixed type

  • a dataset may contain all attribute types: nominal, symmetric binary, asymmetric binary, numeric, and ordinal;
  • use a weighted formula to combine their effects:
    d(i,j)=pf=1w(f)ijd(f)ijpf=1w(f)ij.

-Covariance for two variables
Covariance between two variables X1 and X2:
σ12=E[X1X2]E[X1]E[X2]

-Correlation coefficient
ρ12=σ12σ21σ22

-Covariance matrix
=E[(Xμ)(Xμ)T]
 =(σ21σ21σ12σ22)

Lecture3:

Partitioning method: Discovering the groupings in the data by optimizing a specific objective function and iteratively improving the quality of partitions.

K-partitioning method:

Partitioning a dataset D of n objects into a set of K clusters so that an objective function is optimized (e.g., the sum of squared distances is minimized.)

A typical objective function:
Sum of Squared Errors (SSE)
SSE(C)=Kk=1xiCk||xick||2

Problem Definition:

Given K, find a partition of K clusters that optimizes the chosen partitioning criterion.

Global optimal: Needs to exhaustively enumerate all partitions.

Heuristic methods (i.e., greedy algo.):
- K-means, K-medians, K-Medoids, etc.

K-Means:

  • Each cluster is represented by the center of the cluster
  • Efficiency: O(tKn),normally, K,tn
  • often terminate at a local optimal
  • Need to specify K
  • objects in a continuous n-dimensional space
    • use the K-modes for categorical data
  • Sensitive to noisy data and outliers
    • variations: use K-medians, K-medoids, etc.
  • Not suitable to discover clusters with non-convex shapes
    • use density-based clustering, kernel K-means, etc.

Variations of K-Means

  • choose better initial centroid estimates
    • K-means++, Intelligent K-means, Genetic K-means
  • choose different representative prototypes for the clusters

    • K-medoids, K-medians, K-modes
  • applying feature transformation techniques

    • weighted K-means, kernel K-means

Initialization of K-means

  • K-means++
    • The first centroid is selected at random
    • The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score)
    • The selection continues until k centroids are obtained

The K-medoids clustering methods

K-medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

PAM (Partitioning Around Medoids)

  • starts from an initial set of medoids
  • iteratively replaces one of the medoids by one of the non-medoids if it improves the total sum of the squared errors (SSE) of the resulting clustering
  • works effectively for small data sets but does not scale well for large data sets (due to the computational complexity)
  • O(K(nK)2) (quite expensive!)

Efficiency improvements on PAM

  • CLARA (Kaufmann & Rousseeuw, 1990):
    • PAM on samples: O(Ks2+K(nK)), s is the sample size.
  • Clarans (Ng & Han, 1994): randomized re-sampling, ensuring efficiency + quality

K-modes:

An extension to K-means by replacing means of clusters with modes. For categorical value.

Dissimilarity measure between object X and the center of cluster Z.
Φ(xi,zj)=1nrjnl,
其中,nl为类l中的对象数目,nrj是属性值为r的对象数目。说明,dissimilarity为frequency-based

A mixture of categorical and numerical data: using a K-prototype method

0 0
原创粉丝点击