数据挖掘中的聚类分析
来源:互联网 发布:vb.net 数组 编辑:程序博客网 时间:2024/06/04 23:32
记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。
第一周
-Considerations for Cluster Analysis
- partitioning criteria (single level vs. hierarchical partitioning)
- separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
- similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
- clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])
Four issues:
-Quality
- deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
- clusters with arbitrary shape
- deal with noisy data
-Scalability
- clustering all the data instead of only on samples
- high dimensionality
- incremental or stream clustering and insensitivity to input order
-Constraint-based clustering
- user-given preferences or constraints
-Interpretable and usability
Cluster Analysis Categorization:
-Technique-centered
- distance-based
- density-based and grid-based methods
- probabilistic and generative models
- leveraging dimensionality reduction methods
- high-dimensional clustering
- scalable tech for cluster analysis
-Data type-centered
- clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.
-Additional insight-centered
- visual insights, semi-supervised, ensemble-based, validation-based.
Typical Clustering Methods:
-Distance-based
- partitioning algo.: k-means, k-medians, k-medoids
- hierarchical algo.: agglomerative vs. divisive method
-Density-based and grid-based
- density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
- grid-based: individual regions are formed into a grid-like structure
-Probabilistic and generative models
- Assume a specific form of the generative model (比如:mixture of Gaussian)
- Model parameters are estimated with EM algo.
- Then estimate the generative probability of the underlying data points.
-High-dimensional clustering
- subspace clustering (bottom-up, top-down, correlation-based method vs.
δ -cluster method) - dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)
Lecture2:
Good clustering:
- High intra-class similarity (Cohesive)
- Low inter-class similarity (Distinctive between clusters)
proximity: similarity or dissimilarity
-Dissimilarity Matrix
- triangle matrix (symmetric)
- distance functions are usually different for different types of data
-Distance on numeric data: Minkowski Distance
A popular distance measure:
d(i,j)=|xi1−xj1|p+|xi2−xj2|p+⋯+|xil−xjl|p−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√p ,
其中,i=(xi1,xi2,…,xil) ,j=(xj1,xj2,…,xjl) 为l 维数据,p 为order (这种距离也常被成为l−p norm)。Property:
positivity; symmetry; triangle inequality.p=1 : Manhanttan (or city block) distancep=2 : Euclidean distancep→∞ : “supremum” distance
这种情况下,d(i,j)=limx→∞|xi1−xj1|p+|xi2−xj2|p+⋯+|xil−xjl|p−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√p=maxf=1,…,l |xif−xif| 。
-Proximity measure for binary attribute
Draw a contingency table for binary data.
- for symmetric binary variables:
d(i,j)=r+sq+r+s+t - for asymmetric binary variable:
d(i,j)=r+sq+r+s - Jaccard coefficient (similarity measure):
simJaccard(i,j)=qq+r+s ,跟“coherence(i,j) 计算一样”。
-Proximity measure for categorical attribute
- simple matching:
d(i,j)=p−mp - user a large number of binary variables
-Proximity measure for ordinal attribute
compute ranks
-Attributes of mixed type
- a dataset may contain all attribute types: nominal, symmetric binary, asymmetric binary, numeric, and ordinal;
- use a weighted formula to combine their effects:
d(i,j)=∑pf=1w(f)ijd(f)ij∑pf=1w(f)ij.
-Covariance for two variables
Covariance between two variables
-Correlation coefficient
-Covariance matrix
Lecture3:
Partitioning method: Discovering the groupings in the data by optimizing a specific objective function and iteratively improving the quality of partitions.
Partitioning a dataset
D ofn objects into a set ofK clusters so that an objective function is optimized (e.g., the sum of squared distances is minimized.)
A typical objective function:
Sum of Squared Errors (SSE)
Problem Definition:
Given
K , find a partition ofK clusters that optimizes the chosen partitioning criterion.
Global optimal: Needs to exhaustively enumerate all partitions.
Heuristic methods (i.e., greedy algo.):
- K-means, K-medians, K-Medoids, etc.
K-Means:
- Each cluster is represented by the center of the cluster
- Efficiency:
O(tKn) ,normally,K,t≪n - often terminate at a local optimal
- Need to specify
K - objects in a continuous n-dimensional space
- use the
K -modes for categorical data
- use the
- Sensitive to noisy data and outliers
- variations: use
K -medians,K -medoids, etc.
- variations: use
- Not suitable to discover clusters with non-convex shapes
- use density-based clustering, kernel
K -means, etc.
- use density-based clustering, kernel
Variations of
- choose better initial centroid estimates
K -means++, IntelligentK -means, GeneticK -means
choose different representative prototypes for the clusters
K -medoids,K -medians,K -modes
applying feature transformation techniques
- weighted
K -means, kernelK -means
- weighted
Initialization of
K -means++- The first centroid is selected at random
- The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score)
- The selection continues until
k centroids are obtained
The
K -medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
PAM (Partitioning Around Medoids)
- starts from an initial set of medoids
- iteratively replaces one of the medoids by one of the non-medoids if it improves the total sum of the squared errors (SSE) of the resulting clustering
- works effectively for small data sets but does not scale well for large data sets (due to the computational complexity)
O(K(n−K)2) (quite expensive!)
Efficiency improvements on PAM
- CLARA (Kaufmann & Rousseeuw, 1990):
- PAM on samples:
O(Ks2+K(n−K)) ,s is the sample size.
- PAM on samples:
- Clarans (Ng & Han, 1994): randomized re-sampling, ensuring efficiency + quality
An extension to
K -means by replacing means of clusters with modes. For categorical value.
Dissimilarity measure between object
其中,
A mixture of categorical and numerical data: using a
K -prototype method
- 数据挖掘中的聚类分析
- 数据挖掘:聚类分析
- 数据挖掘--聚类分析
- 数据挖掘(聚类分析)
- 数据挖掘笔记:聚类分析
- 数据挖掘算法--聚类分析
- 数据挖掘-聚类分析
- 数据挖掘-聚类分析
- 数据挖掘之聚类分析
- 数据挖掘导论 之 聚类分析
- 数据挖掘对聚类分析的要求
- 基于.NET实现数据挖掘--聚类分析算法
- 使用Orange进行数据挖掘之聚类分析(2)------K-means
- 数据挖掘之描述建模(聚类分析和K-means)
- 数据挖掘算法之聚类分析(二)canopy算法
- 聚类分析(数据挖掘原书第三版,范明,孟小峰译)
- 基于.NET实现数据挖掘--顺序分析与聚类分析算法
- 基于微软案例数据挖掘之Microsoft 聚类分析算法
- 简单学生信息管理系统
- 升级xcode时更换appid账户
- 位运算符
- ListView几种写法性能对比及优化(转载)
- MySQL 数据库设计初步规范V1.0
- 数据挖掘中的聚类分析
- Linux中keepalive的使用
- android webview字体大小的设置
- 安卓用QuickContactBadge和AsyncQueryHandler实现联系人列表的完美实现
- redhat CentOS release 6.4 (Final),安装中文输入法的菜鸟历程
- 我是如何成长为系统架构师的
- Windows8下安装SQL Server 2005无法启动服务的解决办法
- 用C#实现网络爬虫(二)
- 访问子类对象的实例变量