机器学习算法-k-means聚类算法

来源:互联网 发布:网络赚钱的门路2017 编辑:程序博客网 时间:2024/05/21 11:36

一、k-means原理
k-means也是聚类算法中最简单的一种了,但是里面包含的思想却是不一般。聚类属于无监督学习,以往的回归、朴素贝叶斯、SVM等都是有类别标签y的,也就是说样例中已经给出了样例的分类。而聚类的样本中却没有给定y,只有特征x,比如假设宇宙中的星星可以表示成三维空间中的点集(x,y,z)。聚类的目的是找到每个样本x潜在的类别y,并将同类别y的样本x放在一起。比如上面的星星,聚类后结果是一个个星团,星团里面的点相互距离比较近,星团间的星星距离就比较远了。

k-means算法的计算过程:

1、从D中随机取k个元素,作为k个簇的各自的初始中心。

2、分别计算剩下的元素到k个簇中心的距离,将这些元素分别划归到距离最小的簇。

3、根据聚类结果,重新计算k个簇各自的中心,计算方法是取簇中所有元素各自维度的算术平均数。

4、将D中全部元素按照新的中心重新聚类。

5、重复第4步,直到聚类结果不再变化(或者两次迭代结果小于某一阈值,或设置最大迭代次数)。

6、将结果输出。

时间复杂度:O(T*n*k*m)

空间复杂度:O(n*m)

n:元素个数,k:第一步中选取的元素个数,即需要聚类的簇数,m:每个元素的特征项个数,T:第5步中迭代的次数

算法优缺点:
优点:速度快,简单
缺点:最终结果跟初始点选择有关,容易陷入局部最优,在大数据集上收敛较慢,需要知道k值。

二、python实现

from numpy import *  import time  import matplotlib.pyplot as plt  # calculate Euclidean distance  def euclDistance(vector1, vector2):      return sqrt(sum(power(vector2 - vector1, 2)))  # init centroids with random samples  def initCentroids(dataSet, k):      numSamples, dim = dataSet.shape      centroids = zeros((k, dim))      for i in range(k):          index = int(random.uniform(0, numSamples))          centroids[i, :] = dataSet[index, :]      return centroids  # k-means cluster  def kmeans(dataSet, k):      numSamples = dataSet.shape[0]      # first column stores which cluster this sample belongs to,      # second column stores the error between this sample and its centroid      clusterAssment = mat(zeros((numSamples, 2)))      clusterChanged = True      ## step 1: init centroids      centroids = initCentroids(dataSet, k)      while clusterChanged:          clusterChanged = False          ## for each sample          for i in xrange(numSamples):              minDist  = 100000.0              minIndex = 0              ## for each centroid              ## step 2: find the centroid who is closest              for j in range(k):                  distance = euclDistance(centroids[j, :], dataSet[i, :])                  if distance < minDist:                      minDist  = distance                      minIndex = j              ## step 3: update its cluster              if clusterAssment[i, 0] != minIndex:                  clusterChanged = True                  clusterAssment[i, :] = minIndex, minDist**2          ## step 4: update centroids          for j in range(k):              pointsInCluster = dataSet[nonzero(clusterAssment[:, 0].A == j)[0]]              centroids[j, :] = mean(pointsInCluster, axis = 0)      print ('Congratulations, cluster complete!'  )    return centroids, clusterAssment  # show your cluster only available with 2-D data  def showCluster(dataSet, k, centroids, clusterAssment):      numSamples, dim = dataSet.shape      if dim != 2:          print ("Sorry! I can not draw because the dimension of your data is not 2!"  )        return 1      mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']      if k > len(mark):          print ("Sorry! Your k is too large! please contact Zouxy"  )        return 1      # draw all samples      for i in xrange(numSamples):          markIndex = int(clusterAssment[i, 0])          plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])      mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']      # draw the centroids      for i in range(k):          plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)      plt.show()#测试代码if __name__ == '__main__':    ## step 1: load data      dataSet = []      fileIn = open(r'D:/mypython/MachinesLearning/Clustering/testSet.txt')      for line in fileIn.readlines():          lineArr = line.strip().split('\t')          dataSet.append([float(lineArr[0]), float(lineArr[1])])      ## step 2: clustering...      dataSet = mat(dataSet)      k = 4      centroids, clusterAssment = kmeans(dataSet, k)      ## step 3: show the result      showCluster(dataSet, k, centroids, clusterAssment)  

http://blog.csdn.net/zouxy09/article/details/17589329

原创粉丝点击