机器学习--无监督学习之K-means聚类方法

来源：互联网发布：jsp无法import java类编辑：程序博客网时间：2024/06/16 20:13

一、引言

从上次SVM之后几节课讲的是学习理论，这块理论性比较深，我得好好消化一下。所以先总结一下第一个无监督的机器学习算法，K-means聚类方法。

所谓无监督学习，就是数据样本没有标签，要让学习算法自己去发现数据之间内在的一些结构和规律。就好比做题没有标准答案，所以训练效果自然比监督学习差。但是目前机器学习最大的问题还是在于大量标记样本的需求，掌握数据才能训练出好的算法，但是数据却不是人人都能轻易获得的。所以无监督学习算法的研究是必要的和长期的。

二、算法介绍

k-means算法是一种迭代算法，其思想很简单。就是找出样本所聚集的类个数并找出每一个样本点归属于哪个类。

分成两步：

先人工指定K个聚类中心，并采用一定规则初始化它们的位置；

(1) 簇分配：遍历样本分别找到与k个聚类中心的点，分别归类。

(2)移动中心：将聚类中心移到上一步归属于该中心的样本点的均值处

重复以上步骤直到收敛。给个图就是这样：

还有个动图地址：https://datasciencelab.files.wordpress.com/2013/12/p_n100_k7.gif

算法非常简单，下面给出几个需要注意的点：

(1)每次求取样本点到各个聚类中心距离时，可以用一般的两点之间的距离，也可以有其余求取距离的方法：

①Euclidean Distance公式——欧几里得距离，就是最常见的二范数

②CityBlock Distance公式——城市街区距离，在图像中经常用到

(2)k-means究竟在优化什么？(k-means的优化函数)

这里有一些公式，说明算法的步骤：

①初始化k个聚类中心（通常不是很多，2,3,4,5个左右）

②把每个样本点(j=1,...m)分配给里它最近的聚类中心,t表示第t次迭代

③把聚类中心更新为属于它的样本点的平均值点

它的优化函数为：

参数部分分为两块，其中C是针对每个样本点的，m为，每一维表示了每一个样本点所属的聚类中心索引，μ是针对聚类中心的，有k维。优化的过程就是算法的迭代的过程，分为两部分。每一步固定一组参数，优化另一组参数。

那么，最重要的问题来了，如何选择k-means的初始聚类中心个数以及初始聚类中心位置？

①首先k<m，聚类中心个数肯定比样本点个数少。第二，可以用肘部法则选择k的个数，示意图如下：

针对不同的k个数绘制代价函数最后最优值曲线绘制，一般在k=1~10之间，找到类似于手肘的点对应的k个数，应该是最好的一个。但是这种方法也不是万能，最好的方法还是人工选择，结合k-means应用的场景来选。

②初始聚类中心的位置选择不同，最后收敛的结果和速度也可能不同。选择思路如下：

1、从输入的数据点集合（要求有k个聚类）中随机选择一个点作为第一个聚类中心
2、对于数据集中的每一个点x，计算它与最近聚类中心(指已选择的聚类中心)的距离D(x)
3、选择一个新的数据点作为新的聚类中心，选择的原则是：D(x)较大的点，被选取作为聚类中心的概率较大
4、重复2和3直到k个聚类中心被选出来
5、利用这k个初始的聚类中心来运行标准的k-means算法

三、代码实现

from math import pi, sin, cosfrom collections import namedtuplefrom random import random, choicefrom copy import copy try:    import psyco    psyco.full()except ImportError:    pass  FLOAT_MAX = 1e100  class Point:    __slots__ = ["x", "y", "group"]    def __init__(self, x=0.0, y=0.0, group=0):        self.x, self.y, self.group = x, y, group  def generate_points(npoints, radius):    points = [Point() for _ in xrange(npoints)]     # note: this is not a uniform 2-d distribution    for p in points:        r = random() * radius        ang = random() * 2 * pi        p.x = r * cos(ang)        p.y = r * sin(ang)     return points  def nearest_cluster_center(point, cluster_centers):    """Distance and index of the closest cluster center"""    def sqr_distance_2D(a, b):        return (a.x - b.x) ** 2  +  (a.y - b.y) ** 2     min_index = point.group    min_dist = FLOAT_MAX     for i, cc in enumerate(cluster_centers):        d = sqr_distance_2D(cc, point)        if min_dist > d:            min_dist = d            min_index = i     return (min_index, min_dist)  def kpp(points, cluster_centers):    cluster_centers[0] = copy(choice(points))    d = [0.0 for _ in xrange(len(points))]     for i in xrange(1, len(cluster_centers)):        sum = 0        for j, p in enumerate(points):            d[j] = nearest_cluster_center(p, cluster_centers[:i])[1]            sum += d[j]         sum *= random()         for j, di in enumerate(d):            sum -= di            if sum > 0:                continue            cluster_centers[i] = copy(points[j])            break     for p in points:        p.group = nearest_cluster_center(p, cluster_centers)[0]  def lloyd(points, nclusters):    cluster_centers = [Point() for _ in xrange(nclusters)]     # call k++ init    kpp(points, cluster_centers)     lenpts10 = len(points) >> 10     changed = 0    while True:        # group element for centroids are used as counters        for cc in cluster_centers:            cc.x = 0            cc.y = 0            cc.group = 0         for p in points:            cluster_centers[p.group].group += 1            cluster_centers[p.group].x += p.x            cluster_centers[p.group].y += p.y         for cc in cluster_centers:            cc.x /= cc.group            cc.y /= cc.group         # find closest centroid of each PointPtr        changed = 0        for p in points:            min_i = nearest_cluster_center(p, cluster_centers)[0]            if min_i != p.group:                changed += 1                p.group = min_i         # stop when 99.9% of points are good        if changed <= lenpts10:            break     for i, cc in enumerate(cluster_centers):        cc.group = i     return cluster_centers  def print_eps(points, cluster_centers, W=400, H=400):    Color = namedtuple("Color", "r g b");     colors = []    for i in xrange(len(cluster_centers)):        colors.append(Color((3 * (i + 1) % 11) / 11.0,                            (7 * i % 11) / 11.0,                            (9 * i % 11) / 11.0))     max_x = max_y = -FLOAT_MAX    min_x = min_y = FLOAT_MAX     for p in points:        if max_x < p.x: max_x = p.x        if min_x > p.x: min_x = p.x        if max_y < p.y: max_y = p.y        if min_y > p.y: min_y = p.y     scale = min(W / (max_x - min_x),                H / (max_y - min_y))    cx = (max_x + min_x) / 2    cy = (max_y + min_y) / 2     print "%%!PS-Adobe-3.0\n%%%%BoundingBox: -5 -5 %d %d" % (W + 10, H + 10)     print ("/l {rlineto} def /m {rmoveto} def\n" +           "/c { .25 sub exch .25 sub exch .5 0 360 arc fill } def\n" +           "/s { moveto -2 0 m 2 2 l 2 -2 l -2 -2 l closepath " +           "   gsave 1 setgray fill grestore gsave 3 setlinewidth" +           " 1 setgray stroke grestore 0 setgray stroke }def")     for i, cc in enumerate(cluster_centers):        print ("%g %g %g setrgbcolor" %               (colors[i].r, colors[i].g, colors[i].b))         for p in points:            if p.group != i:                continue            print ("%.3f %.3f c" % ((p.x - cx) * scale + W / 2,                                    (p.y - cy) * scale + H / 2))         print ("\n0 setgray %g %g s" % ((cc.x - cx) * scale + W / 2,                                        (cc.y - cy) * scale + H / 2))     print "\n%%%%EOF"  def main():    npoints = 30000    k = 7 # # clusters     points = generate_points(npoints, 10)    cluster_centers = lloyd(points, k)    print_eps(points, cluster_centers)  main()

阅读全文

0 0