Machine Learning - XIII. Clustering聚类 (Week 8)

来源：互联网发布：java微信退款接口开发编辑：程序博客网时间：2024/05/21 17:26

http://blog.csdn.net/pipisorry/article/details/44677097

机器学习Machine Learning - Andrew NG courses学习笔记

Clustering 聚类

{unsupervised learning algorithm where we learn from unlabeled data instead of the label data}

Unsupervised Learning_ Introduction非监督学习介绍

a typical supervisory problem : given a label training sets and the goal is to find the decision boundary that separates the positive label examples and the negative label examples.

The supervised learning problem in this case is given a set of labels to fit a hypothesis to it.

unsupervised learning : give this sort of unlabeled training set to an algorithm and we just ask the algorithm: find some structure in the data for us.

one type of structure we might have an algorithm find, is that has points grouped into two separate clusters and so an algorithm that finds that clusters like the ones circled, is called a clustering algorithm.
聚类算法的应用

Social network analysis：information about who are the people that you email the most frequently and who are the people that they email the most frequently, and to find coherent groups of people.you'd want to find who other coherent groups of friends in a social network.

K-Means K-均值算法

{一种基于划分的聚类算法}

In the clustering problem we are given an unlabeled data set and we would like to have an algorithm automatically group the data into coherent subsets or into coherent clusters for us.
K-均值算法步骤图形化展示
randomly initialize two points, called the cluster centroids.

Note:two Cluster Centroids because I want to group my data into two clusters.
K Means is an iterative algorithm and it does two things.First is a cluster assignment step, and second is a move centroid step.
cluster assignment step:going through each of the examples,depending on whether it's closer to the red cluster centroid or the blue cluster centroid,assign each of the data points to one of the two cluster centroids.

move centroid step:move the two cluster centroids to the average of the points colored the same colour.

迭代这两个步骤的效果：

converged收敛: keep running additional iterations of K means the cluster centroids will not change any further and the colours of the points will not change any further. And so, at this point,K means has converged and it's done a pretty good job finding the two clusters in this data.

Note:[K-means聚类交互可视化]

K-均值算法

Note:

1. for unsupervised learning of the K means use the convention that XI is an RN dimensionalvector, rather N plus one dimensional vectors.

2. cluster assignment step:find k that minimizes the distance between the example and the cluster centroid, and set as Ci, and by convention the distance between Xi and the cluster centroid tend to write this as the squared distance.

3. what if there is a cluster centroid with zero points assigned to it: (that can happen, although in practice it happens not that often)

just eliminate that cluster centroid.end up with K minus one clusters.

if you really need k clusters,you can just randomly reinitialize that cluster centroid.

K-means for non-separated clusters

applied to data sets that not be several very well separated clusters.
例子an example of market segmentation: with examples of different peoples heights and weight.design t-shirts of three sizes, small, medium and large.

Note:even though the data before hand it didn't seem like we had 3 well separated clusters, K Means will kind of separate out the data into multiple pluses.
[Why K-Means is not always a good idea]

Optimization Objective最优化目标

Most of the supervised learning algorithms have an optimization objective or some cost function that the algorithm was trying to minimize.
K-means最优化目标的用途knowing what is theoptimization objective of K-means be usefulto us for two purposes:
First, help us to debug the learning algorithm and make sure that K-means is running correctly;
second,use this to help K-means find better clusters and avoid local optima.

变量定义：

最优化目标

Note: This cost function is sometimes also called thedistortion(变形) functionor the distortion of the K-means algorithm.

K-means算法中最优化目标的分析

Note:

1. the first assignment step doesn't change the cluster centroids, but picking the values of C1, C2, up to CM that minimizes the cost function or the distortion function,J. And it's possible to prove that mathematically.
2. the second step of K-means,once again won't prove it,but it can be shown mathematically, that what the root centroid step does, is it chooses the values of mu that minimizes J.
3. So, K-means taking the two sets of variables and partitioning them into two halves right here.
it first minimizes J with respect to the variable C, and then minimizes J with respect the variables Mu, and then it keeps on iterating.

Random Initialization随机初始化

{initialize K-means and make K-means avoid local optima}

Random initialization随机初始化

Note:K-means can end up converging to different solutions depending on exactly how the clusters were initialized, and so, depending on the random initialization.And, in particular, K-means can actually end up at local optima.

Local optima局部最优解

Note:local optima means it is not doing a good job minimizing this distortion function J.
多次随机初始化

Note:

1. every centroid needs to start in a different location. Otherwise, each will be updated in the same way at each iteration and they will never spread out into different clusters.

2. But if K is very large,much greater than 10,having multiple random initializations is less likely to make a huge difference and there is a much higher chance that your first random initialization will give you a pretty decent solution already and doing, doing multiple random initializations will probably give you a slightly better solution but maybe not that much.

Choosing the Number of Clusters选择聚类的数目

What is the right value of K?

Note:there actually isn't a great way of answering this or doing this automatically and by far the most common way of choosing the number of clusters, is still choosing it manually by looking at visualizations or by looking at the output of the clustering algorithm or something else.

Choosing the value of K

Note:

1. Elbow: distortion goes down rapidly until K equals 3, really goes down very slowly after that.So let's pick K equals 3.
2. 这种方法不是对所有聚类案例有效：but just as often, you end up with a plot that looks like the one on the right and is not clear where the ready location of the elbow is. It makes it harder to choose a number of clusters using this method.

3. Too many clusters will still overfit the data, as it will find "structure" that is purely ficticious and not truly present in the data.

Note:

1. if that later, downstream purpose,such as market segmentation.then often, a better way to determine the number of clusters, is to see how well different numbers of clusters serve that later downstream purpose.

2. 衣服大小越细分，成本越高；这就取决于市场。
总结：

the better way to think about how to choose the number of clusters is to ask, for what purpose are you running K-means?And then to think, what is the number of clusters K that serves that, you know, whatever later purpose that you actually run the K-means for.

还可以定义一个指标「例如簇平均直径」，通过二分搜索来确定最佳k。

[(stackoverflow)k-means聚类k值的确定： How do I determine k when using k-means clustering?]

[用parametric bootstrap寻找K-means最佳K值]

Reviews复习

from:http://blog.csdn.net/pipisorry/article/details/44677097

聚类算法

1 0