无监督分类算法 之 聚类
来源:互联网 发布:足彩过滤软件app 编辑:程序博客网 时间:2024/06/05 22:49
无监督分类算法 之 聚类
标签(空格分隔): SPARK机器学习
1. Introduction of clustering
1.1 defination of clustering –Unsupervised Classification
Clustering is searching for patterns in complex data. Patterns can lead to business decisions.
“Cluster analysis is a set of methods for constructing
a (hopefully) sensible and informative classification
of an initially unclassified set of data, using the variable
values observed on each individual.”
Everitt (1998), The Cambridge Dictionary of Statistics
1.2 Types of Clustering:
1.2.1 Hierarchical
Problems with hierarchical clustering
• Hierarchical methods do not scale up well
• Previous merges or divisions are irrevocable
• There are many hierarchical clustering methods, each defining cluster similarity in different ways, and no one method is “best”
1.2.2 Partitive
Problems with partitive clustering: it might
• make you guess the number of clusters present
• make assumptions about the shape of the clusters, usually that
they are (hyper-) spherical
• be influenced by seed location, outliers, and the order the
observations are read in
• Be difficult to determine the optimal grouping, due to the
combinatorial explosion of potential solutions.
3. Measuring Similarity
3.1 Euclidean Distance
Euclidean distance gives the linear distance between any two points in n-dimensional space.
It is a generalization of the Pythagorean theorem.
3.2 City Block (Manhattan) Distance
The distance between two points is measured
along the sides of a right triangle.
It is the distance that you would travel if you had to walk along the streets of a right-angled city.
3.3 Hamming Distance
3.4 Correlation
3.5 Density-Based Similarity
Density-based methods define similarity as the distance
between derived density “bubbles” (hyper-spheres).
4. the processes of applying clustering
4.1 Preparation for Clustering
4.1.1 Data and sample selection (Who am I clustering?)
4.1.2 Variable selection/ clustering (What characteristics matter?)
• Variable reduction for redundancy and irrelevancy
• Variable clustering
• Selecting relevant variables is the secret to clustering success.
4.1.3 Graphical exploration (What shape/how many clusters?)
(1)Plotting can help to determine such key things as
» the shape of the clusters,
» the relative cluster dispersion (variation),
» the approximate number of clusters
(2)Dimension reduction
Dimension reduction techniques, such as principal
component analysis, multidimensional scaling, etc. can be
applied to summarise multivariate data
4.1.4 Variable standardization(Are variable scales comparable?)
4.1.5 Variable transformation(Are variables correlated? Are clusters elongated?)
4.2 Partitive Clustering
Partitive clustering minimizes or maximizes a specified error criterion, for example
• cluster separation, or
• within-cluster similarity (homogeneity)
Natural Grouping Criterion
Borrowing concepts from least-squares estimation yields a natural grouping criterion:
• maximize the between-cluster sum of squares, or
• minimize the within-cluster sum of squares.
A large between-cluster sum of squares value
implies that the cluster is well-separated.
A small within-cluster sum of squares value
implies that the members of the cluster are
homogenous
The Trace Function
Decompose Total variation (T = W + B) into Within-group (W) and Between-group (B) variation
Here xij is the jth member of cluster i, is the cluster i mean, and is the vector of sample means for each variable. There are g clusters (groups).
The trace function
Where xi is the value of the ith variable for the observation, ci is the value of the ith variable for the cluster seed, n is the number of variables, and m is the number of nonmissing variables.
Trace summarizes matrix W into a single number by
adding together its diagonal (variance) elements.
Simply adding matrix elements together makes trace very
efficient, but it also makes it scale dependent.
The trace ignores the off-diagonal elements of a matrix,
which in clustering means that all variables are treated
as though they were independent (uncorrelated).
The trace function adds all the within-cluster sums
of squares (SS) together, which compounds the impact
of information from correlated variables.
The Spherical Structure Problem
Because the trace function only considers the diagonal elements of W,it tends to form roughly spherical clusters. This can sometimes be managed using data transformation techniques.
Trace(W) also tends to produce clusters with about the same number
of observations in each cluster. Alternative clustering techniques exist
to manage this problem.
4.2.1 The K-Means Methodology
The three-step k-means methodology is given
below:
1. Select (or specify) an initial set of cluster seeds.
2. Read the observations and update the seeds (known
after the update as reference vectors). Repeat until
convergence is attained.
3. Make one final pass through the data, assigning
each observation to its nearest reference vector.
- Select inputs.
- Select k cluster centers.
- Assign cases to closest
center. - Update cluster centers.
- Re-assign cases.
- Repeat steps 4 and 5
until convergence.
What Value of k to Use
The number of seeds, k, typically translates to the
final number of clusters that are obtained. The choice
of k can be made using a variety of methods.
• Subject-matter knowledge (There
are most likely five groups.)
• Convenience (It is convenient to
market to three to four groups.)
• Constraints (You have six products
and need six segments.)
• Arbitrarily (Always pick 20.)
• Based on the data (Ward’s method)
5. Spark MLlib实现
spark mllib package支持以下几种聚类模型
5.1 K-means
K-means聚类法是事先定义好cluster的个数
执行逻辑中涉及以下系数:
(1)k is the number of desired clusters.期望的cluster个数
(2)maxIterations is the maximum number of iterations to run.最大迭代次数
(3)initializationMode specifies either random initialization or initialization via k-means||.
(4)runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).跑算法的时间;k-means并不会保证得出最全面的优质方案,当在给定的数据集上运行多次后,算法会返回一个最优的聚类结果。
(5)i**nitializationSteps** determines the number of steps in the k-means|| algorithm.算法中的步长
(6)epsilon determines the distance threshold within which we consider k-means to have converged.k-means聚合时距离临界值
(7)initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.设置初始的k值,即第一次聚类的中心值,若设置了该值,则只会运行一次。
example from spark官网:
//导入包import org.apache.spark.mllib.clustering.{KMeans,KMeansModel}import org.apache.spark.mllib.linalg.vectors//加载数据val data = sc.textFile("data/mllib/kmeans_data.txt")//解析数据:先将数据转换成密集向量,并且是用空格分割。然后将每个数据的格式转换成double形式,最后缓存val parseData = data.map(s=>vectors.dense(s.split(' ')).map(_.toDouble))).cache//假设我们想要分两个类,并且设置迭代次数为20,输入参数构建模型val numClusters = 2val numIterations = 20val clusters = KMeans.train(parseData,numCluster,numIterations)//通过“组内误差平方和”评估聚类模型的好坏val WSSSE = clusters.computerCost(parseData)println("within set sum of squared error = "+ WSSSE)//保存模型clusters.save(sc,"clustermodelpath")//下次使用模型val sameModel = KMeansModel.load(sc,"clustermodelpath")
- 无监督分类算法 之 聚类
- 机器学习:有监督算法之分类
- 机器学习:有监督算法之分类
- 无监督学习之K-means算法
- 无监督学习算法
- 有监督学习、无监督学习、分类、聚类、回归等概念
- 有监督学习、无监督学习、分类、聚类、回归等概念
- 有监督学习、无监督学习、分类、聚类、回归等概念
- 王小草【机器学习】笔记--无监督算法之聚类
- 机器学习----无监督学习算法之异常检测
- 无监督K-maeans算法
- 无监督学习-apriori算法
- 无监督学习-FPgrowth算法
- 分类与聚类 监督学习与无监督学习
- 分类与聚类 监督学习与无监督学习
- 机器学习模型的基本分类--有监督、无监督
- 【机器学习算法-python实现】K-means无监督学习实现分类
- 无监督分类:聚类分析(K均值)
- typedef的作用
- 自制编译器:词法单元解析
- Android之JNI NDK开发的常见问题
- Javascript 输出内容(document.write)
- 矢量网络分析仪--测天线时使用技巧
- 无监督分类算法 之 聚类
- imooc学习笔记--五子棋
- 237. Delete Node in a Linked List
- Android中的Drawable小结
- 很久没写了,发个策划书如何?
- HTTP协议
- ACM_欧拉函数(eular) 及其引申性质
- 算法导论(第三版)-复习- 第六部分图论 22-26[转]
- android 修改wifi 静态ip源码分析