无监督分类算法 之 聚类

来源:互联网 发布:足彩过滤软件app 编辑:程序博客网 时间:2024/06/05 22:49

无监督分类算法 之 聚类

标签(空格分隔): SPARK机器学习


1. Introduction of clustering

1.1 defination of clustering –Unsupervised Classification

Clustering is searching for patterns in complex data. Patterns can lead to business decisions.

“Cluster analysis is a set of methods for constructing
a (hopefully) sensible and informative classification
of an initially unclassified set of data, using the variable
values observed on each individual.”
Everitt (1998), The Cambridge Dictionary of Statistics

1.2 Types of Clustering:

1.2.1 Hierarchical

截图168.png-8.8kB

Problems with hierarchical clustering
• Hierarchical methods do not scale up well
• Previous merges or divisions are irrevocable
• There are many hierarchical clustering methods, each defining cluster similarity in different ways, and no one method is “best”

1.2.2 Partitive

截图169.png-14.5kB

Problems with partitive clustering: it might
• make you guess the number of clusters present
• make assumptions about the shape of the clusters, usually that
they are (hyper-) spherical
• be influenced by seed location, outliers, and the order the
observations are read in
• Be difficult to determine the optimal grouping, due to the
combinatorial explosion of potential solutions.

3. Measuring Similarity

3.1 Euclidean Distance

Euclidean distance gives the linear distance between any two points in n-dimensional space.
截图170.png-3.4kB

It is a generalization of the Pythagorean theorem.
截图171.png-10.6kB

3.2 City Block (Manhattan) Distance

The distance between two points is measured
along the sides of a right triangle.
截图172.png-2.2kB

It is the distance that you would travel if you had to walk along the streets of a right-angled city.
截图173.png-6.2kB

3.3 Hamming Distance

截图174.png-11.5kB

3.4 Correlation

截图175.png-10.6kB

3.5 Density-Based Similarity

Density-based methods define similarity as the distance
between derived density “bubbles” (hyper-spheres).
截图176.png-12.9kB

4. the processes of applying clustering

4.1 Preparation for Clustering

4.1.1 Data and sample selection (Who am I clustering?)

4.1.2 Variable selection/ clustering (What characteristics matter?)

• Variable reduction for redundancy and irrelevancy
• Variable clustering
• Selecting relevant variables is the secret to clustering success.

4.1.3 Graphical exploration (What shape/how many clusters?)

(1)Plotting can help to determine such key things as
» the shape of the clusters,
» the relative cluster dispersion (variation),
» the approximate number of clusters
截图177.png-12.6kB

(2)Dimension reduction
Dimension reduction techniques, such as principal
component analysis, multidimensional scaling, etc. can be
applied to summarise multivariate data

4.1.4 Variable standardization(Are variable scales comparable?)

截图178.png-16.5kB

4.1.5 Variable transformation(Are variables correlated? Are clusters elongated?)

截图179.png-62.4kB

4.2 Partitive Clustering

Partitive clustering minimizes or maximizes a specified error criterion, for example
• cluster separation, or
• within-cluster similarity (homogeneity)

Natural Grouping Criterion
Borrowing concepts from least-squares estimation yields a natural grouping criterion:
• maximize the between-cluster sum of squares, or
• minimize the within-cluster sum of squares.
A large between-cluster sum of squares value
implies that the cluster is well-separated.
A small within-cluster sum of squares value
implies that the members of the cluster are
homogenous

The Trace Function
Decompose Total variation (T = W + B) into Within-group (W) and Between-group (B) variation
截图180.png-6.7kB
Here xij is the jth member of cluster i, is the cluster i mean, and is the vector of sample means for each variable. There are g clusters (groups).

The trace function
截图181.png-3.6kB
Where xi is the value of the ith variable for the observation, ci is the value of the ith variable for the cluster seed, n is the number of variables, and m is the number of nonmissing variables.

Trace summarizes matrix W into a single number by
adding together its diagonal (variance) elements.
 Simply adding matrix elements together makes trace very
efficient, but it also makes it scale dependent.
 The trace ignores the off-diagonal elements of a matrix,
which in clustering means that all variables are treated
as though they were independent (uncorrelated).
 The trace function adds all the within-cluster sums
of squares (SS) together, which compounds the impact
of information from correlated variables.

The Spherical Structure Problem
Because the trace function only considers the diagonal elements of W,it tends to form roughly spherical clusters. This can sometimes be managed using data transformation techniques.

Trace(W) also tends to produce clusters with about the same number
of observations in each cluster. Alternative clustering techniques exist
to manage this problem.

4.2.1 The K-Means Methodology

The three-step k-means methodology is given
below:
1. Select (or specify) an initial set of cluster seeds.
2. Read the observations and update the seeds (known
after the update as reference vectors). Repeat until
convergence is attained.
3. Make one final pass through the data, assigning
each observation to its nearest reference vector.

  1. Select inputs.
  2. Select k cluster centers.
  3. Assign cases to closest
    center.
  4. Update cluster centers.
  5. Re-assign cases.
  6. Repeat steps 4 and 5
    until convergence.

What Value of k to Use
The number of seeds, k, typically translates to the
final number of clusters that are obtained. The choice
of k can be made using a variety of methods.
• Subject-matter knowledge (There
are most likely five groups.)
• Convenience (It is convenient to
market to three to four groups.)
• Constraints (You have six products
and need six segments.)
• Arbitrarily (Always pick 20.)
• Based on the data (Ward’s method)

5. Spark MLlib实现

spark mllib package支持以下几种聚类模型

5.1 K-means

K-means聚类法是事先定义好cluster的个数

执行逻辑中涉及以下系数:
(1)k is the number of desired clusters.期望的cluster个数
(2)maxIterations is the maximum number of iterations to run.最大迭代次数
(3)initializationMode specifies either random initialization or initialization via k-means||.
(4)runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).跑算法的时间;k-means并不会保证得出最全面的优质方案,当在给定的数据集上运行多次后,算法会返回一个最优的聚类结果。
(5)i**nitializationSteps** determines the number of steps in the k-means|| algorithm.算法中的步长
(6)epsilon determines the distance threshold within which we consider k-means to have converged.k-means聚合时距离临界值
(7)initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.设置初始的k值,即第一次聚类的中心值,若设置了该值,则只会运行一次。

example from spark官网:

//导入包import org.apache.spark.mllib.clustering.{KMeans,KMeansModel}import org.apache.spark.mllib.linalg.vectors//加载数据val data = sc.textFile("data/mllib/kmeans_data.txt")//解析数据:先将数据转换成密集向量,并且是用空格分割。然后将每个数据的格式转换成double形式,最后缓存val parseData = data.map(s=>vectors.dense(s.split(' ')).map(_.toDouble))).cache//假设我们想要分两个类,并且设置迭代次数为20,输入参数构建模型val numClusters = 2val numIterations = 20val clusters = KMeans.train(parseData,numCluster,numIterations)//通过“组内误差平方和”评估聚类模型的好坏val WSSSE = clusters.computerCost(parseData)println("within set sum of squared error = "+ WSSSE)//保存模型clusters.save(sc,"clustermodelpath")//下次使用模型val sameModel = KMeansModel.load(sc,"clustermodelpath")
0 0
原创粉丝点击