SparkML之聚类(一)Kmeans聚类

来源：互联网发布：网络创业计划书编辑：程序博客网时间：2024/04/30 12:15

------------------------------目录--------------------------------------------------

Kmeans理论

Matlab实现

Spark源码分析

Spark源码

Spark实验

------------------------------------------------------------------------------------

Kmeans聚类理论可以访问http://cs229.stanford.edu/notes/cs229-notes7a.pdf

可以总结下面三点

1、初始化K个聚类中心点

2、计算每个点到K个聚类中心的距离

3、根据距离，寻找更好的K个聚类中心点

问题1：怎么确定这K个了？随机选择还是有算法确定

解答：我们可以随机选择这K个点，也可以用算法确定这K个点，一般用K-mean++算法进行对K个点的选择

详细的K-mean++算法访问（https://en.wikipedia.org/wiki/K-means%2B%2B和http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf.），具体流程如下：

1、从输入的数据集中随机选择一个点作为第一个聚类中心C1

2、计算每个数据点到C1的距离,保存在D（x）中

3、选择D（x）中最大距离的那个数据点作为第二个聚类中心C2

4、计算每个数据点到C1、C2的距离，那个聚类中心里这个数据点近，那么这个数据点就数据这个聚类中心

那么此时，所有的数据点分为2类，一类是依附于C1，另一类是依附于C2

问题2：当K大于2怎么办？那么是不是我们依附于C1、C2的距离都放在一个D(X),在按照那个距离最大，选哪个

解答：這是可以的！算法嘛，总是希望做的和别人有更优之处，要想更优，那么就得考虑更多，下面就是考虑更多的

结果

5、计算每个数据点到最近聚类中心的距离 $D(X)_{i}$ （比如4中，计算C1类的点到C1的距离，C2类的点到C2的距离），之后

对所有的距离进行求和得到Sum

6、取(0,sum)之间的一个随机值记为temp，让temp = temp - $D(X)_{i}$ (i = 1,2,...),当temp <0 的那一刻，就取这个（i）

点，那么这个i点就是下一个聚类中心

7、就這样一直寻找到第K个点

现在结合matlab编写函数进行讲解

数据和代码，百度云链接;链接：http://pan.baidu.com/s/1mh9ldnM 密码：dcz4

%导入数据filename = 'C:\Users\andrew\Desktop\kmeans\IrisData.txt';delimiter = '\t';startRow = 2;formatSpec = '%f%f%f%f%s%[^\n\r]';fileID = fopen(filename,'r');dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'HeaderLines' ,startRow-1, 'ReturnOnError', false);VarName1 = dataArray{:, 1};Sepalwidth = dataArray{:, 2};%萼片宽Petallength = dataArray{:, 3};%花瓣长度Petalwidth = dataArray{:, 4};%花瓣宽度Species = dataArray{:, 5};%种类%查看样本的分布情况figure(1)plot(Petalwidth,Petallength,'o')xlabel('花瓣宽度')ylabel('花瓣长度')title('未聚类的分布')%为了更加直观，把不同的种类用不同的符号标记出来%1到50的种类为：   I. setosa%51到100的种类为： I. versicolor%101到150的种类为：I. virginicafigure(2)hold onplot(Petalwidth(1:50),Petallength(1:50),'rs','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','g','MarkerSize',10)plot(Petalwidth(51:100),Petallength(51:100),'mo','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor',[.49 1 .63],'MarkerSize',10)plot(Petalwidth(101:end),Petallength(101:end),'rh','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','r','MarkerSize',10)xlabel('花瓣宽度')ylabel('花瓣长度')title('实际类别分布情况')hold off%调用 自己编写的kmeans，看看效果X = [Petalwidth,Petallength];[cid,nr,centers] = mykmeans(X,3);figure(3)hold onfor i = 1:length(Petalwidth)    if cid(i) ==1        plot(Petalwidth(i),Petallength(i),'rs','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','g','MarkerSize',10)    end    if cid(i) ==2        plot(Petalwidth(i),Petallength(i),'mo','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor',[.49 1 .63],'MarkerSize',10)    end    if cid(i) ==3        plot(Petalwidth(i),Petallength(i),'rh','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','r','MarkerSize',10)    endendtitle('算法聚类结果')hold off        %centers =%%    2.0478    5.6261%    0.2460    1.4620%    1.3593    4.2926

function [cid,nr,centers] = mykmeans(X,k)%参数说明%   X：数据集（n*m）%   k：输入的聚类中心个数%   cid:每一个数据输入那一类%    nr:每个数据的个数%   centers:每个类的聚合中心（集合）%%寻找初始的k个聚类中心[n,m]=size(X);cid = zeros(1,n);nr = zeros(1,k);%从输入的数据集中随机找出k个聚类中心temp = randperm(n);a = temp(1:k);nc= X(a',:);%开始对k个聚类中心进行优化maxIter = 100;iter = 1;while iter < maxIter    for i = 1:n         dist = sum((repmat(X(i,:),k,1)-nc).^2,2);          [~,ind] = min(dist);           cid(i) = ind;    end    for i = 1:k        ind = find(cid==i);         nc(i,:) = mean(X(ind,:));         nr(i) = length(ind);     end    iter = iter + 1;end tempMaxiter = 2;tempIter = 1;tempMove = 1;while tempIter < tempMaxiter && tempMove ~= 0     tempMove = 0;     for i = 1:n        dist = sum((repmat(X(i,:),k,1)-nc).^2,2);         r = cid(i);        dadj = nr./(nr+1).*dist';        [~,ind] = min(dadj);         if ind ~= r                    cid(i) = ind;                   ic = find(cid==ind);                   nc(ind,:) = mean(X(ic,:));                    tempMove = 1;          end    end    tempIter = tempIter+1;endcenters = nc;end

SparkML源码分析

Spark对于Kmeans算法程序一共有：KMeans类和KMeans同名对象。

VectorWithNorm类（自行设定norm，为fastSquaredDistance函数服务）

KMeansModel类和KMeansModel同名对象

作用：

KMeans类：

设定训练的参数（聚类中心数目：k,最大迭代次数： maxIterations并行数：runs（默认是1），

初始化算法：initializationMode，种子：seed，收敛值：Epsilon等）

训练模型（run）:初始化中心（随机方法和K-mean++算法），聚类中心点计算（runAlogorithm）

初始化并行度（initKMeansParallel）

KMeans同名对象

对不同输入格式，书写train方法

找里各个聚类中心最近的点（findClosest）和计算距离（pointCost）

快速距离计算（fastSquaredDistance）

KMeansModel类

计算每个样本到最近聚类中心的距离（computeCost）

保存模型（save）

预测（predict）

KMeansModel同名对象

导入模型（load）

当前模型的版本号处理（object SaveLoadV1_0）

spark源码：

Kmeans类

@Since("0.8.0")class KMeans private (                       private var k: Int,                       private var maxIterations: Int,                       private var runs: Int,                       private var initializationMode: String,                       private var initializationSteps: Int,                       private var epsilon: Double,                       private var seed: Long) extends Serializable with Logging {  /**    * Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1,    * initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, seed: random}.    */  //构造函数：  //默认情况下：分2类，20次迭代，1个并行，输出化模型选择KMeans.K_MEANS_PARALLEL，初始steps为5  //epsilon = 0.0001，  @Since("0.8.0")  def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4, Utils.random.nextLong())  /**    * Number of clusters to create (k).    */  @Since("1.4.0")  def getK: Int = k  /**    * Set the number of clusters to create (k). Default: 2.    */  @Since("0.8.0")  def setK(k: Int): this.type = {    this.k = k    this  }  /**    * Maximum number of iterations allowed.    */  @Since("1.4.0")  def getMaxIterations: Int = maxIterations  /**    * Set maximum number of iterations allowed. Default: 20.    */  @Since("0.8.0")  def setMaxIterations(maxIterations: Int): this.type = {    this.maxIterations = maxIterations    this  }  /**    * The initialization algorithm. This can be either "random" or "k-means||".    */  @Since("1.4.0")  def getInitializationMode: String = initializationMode  /**    * Set the initialization algorithm. This can be either "random" to choose random points as    * initial cluster centers, or "k-means||" to use a parallel variant of k-means++    * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.    */  @Since("0.8.0")  def setInitializationMode(initializationMode: String): this.type = {    KMeans.validateInitMode(initializationMode)    this.initializationMode = initializationMode    this  }  /**    * :: Experimental ::    * Number of runs of the algorithm to execute in parallel.    */  @Since("1.4.0")  @deprecated("Support for runs is deprecated. This param will have no effect in 2.0.0.", "1.6.0")  def getRuns: Int = runs  /**    * :: Experimental ::    * Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm    * this many times with random starting conditions (configured by the initialization mode), then    * return the best clustering found over any run. Default: 1.    */  @Since("0.8.0")  @deprecated("Support for runs is deprecated. This param will have no effect in 2.0.0.", "1.6.0")  def setRuns(runs: Int): this.type = {    internalSetRuns(runs)  }  // Internal version of setRuns for Python API, this should be removed at the same time as setRuns  // this is done to avoid deprecation warnings in our build.  private[mllib] def internalSetRuns(runs: Int): this.type = {    if (runs <= 0) {      throw new IllegalArgumentException("Number of runs must be positive")    }    if (runs != 1) {      logWarning("Setting number of runs is deprecated and will have no effect in 2.0.0")    }    this.runs = runs    this  }  /**    * Number of steps for the k-means|| initialization mode    */  @Since("1.4.0")  def getInitializationSteps: Int = initializationSteps  /**    * Set the number of steps for the k-means|| initialization mode. This is an advanced    * setting -- the default of 5 is almost always enough. Default: 5.    */  @Since("0.8.0")  def setInitializationSteps(initializationSteps: Int): this.type = {    if (initializationSteps <= 0) {      throw new IllegalArgumentException("Number of initialization steps must be positive")    }    this.initializationSteps = initializationSteps    this  }  /**    * The distance threshold within which we've consider centers to have converged.    */  @Since("1.4.0")  def getEpsilon: Double = epsilon  /**    * Set the distance threshold within which we've consider centers to have converged.    * If all centers move less than this Euclidean distance, we stop iterating one run.    */  @Since("0.8.0")  def setEpsilon(epsilon: Double): this.type = {    this.epsilon = epsilon    this  }  /**    * The random seed for cluster initialization.    */  @Since("1.4.0")  def getSeed: Long = seed  /**    * Set the random seed for cluster initialization.    */  @Since("1.4.0")  def setSeed(seed: Long): this.type = {    this.seed = seed    this  }  // Initial cluster centers can be provided as a KMeansModel object rather than using the  // random or k-means|| initializationMode  private var initialModel: Option[KMeansModel] = None  /**    * Set the initial starting point, bypassing the random initialization or k-means||    * The condition model.k == this.k must be met, failure results    * in an IllegalArgumentException.    */  @Since("1.4.0")  def setInitialModel(model: KMeansModel): this.type = {    require(model.k == k, "mismatched cluster count")    initialModel = Some(model)    this  }  /**    * Train a K-means model on the given set of points; `data` should be cached for high    * performance, because this is an iterative algorithm.    */  @Since("0.8.0")  //训练模型的run方法  //官方提示说，因为迭代最好选择将数据进行缓存  def run(data: RDD[Vector]): KMeansModel = {    if (data.getStorageLevel == StorageLevel.NONE) {      logWarning("The input data is not directly cached, which may hurt performance if its"        + " parent RDDs are also uncached.")    }    // Compute squared norms and cache them.    //计算2范数，并且缓存    val norms = data.map(Vectors.norm(_, 2.0))    norms.persist()    val zippedData = data.zip(norms).map { case (v, norm) =>      new VectorWithNorm(v, norm)    }    //调用runAlgorithm函数，实现K-Means算法，    val model = runAlgorithm(zippedData)    norms.unpersist()    // Warn at the end of the run as well, for increased visibility.    if (data.getStorageLevel == StorageLevel.NONE) {      logWarning("The input data was not directly cached, which may hurt performance if its"        + " parent RDDs are also uncached.")    }    model//返回模型  }  /**    * Implementation of K-Means algorithm.    */  private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {    val sc = data.sparkContext    val initStartTime = System.nanoTime()    // Only one run is allowed when initialModel is given    val numRuns = if (initialModel.nonEmpty) {      if (runs > 1) logWarning("Ignoring runs; one run is allowed when initialModel is given.")      1    } else {      runs    }    //聚类中心初始化    val centers = initialModel match {      case Some(kMeansCenters) => {        Array(kMeansCenters.clusterCenters.map(s => new VectorWithNorm(s)))      }      case None => {        if (initializationMode == KMeans.RANDOM) {          initRandom(data)        } else {          initKMeansParallel(data)        }      }    }    val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9    logInfo(s"Initialization with $initializationMode took " + "%.3f".format(initTimeInSeconds) +      " seconds.")    val active = Array.fill(numRuns)(true)    val costs = Array.fill(numRuns)(0.0)    var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)    var iteration = 0    val iterationStartTime = System.nanoTime()    // Execute iterations of Lloyd's algorithm until all runs have converged    //使用Lloyd算法    //可以参考分析理论部分的5和6    //同时可以对比一下matlab的算法    //這里多了缓存，并行度的变量    while (iteration < maxIterations && !activeRuns.isEmpty) {      type WeightedPoint = (Vector, Long)      def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = {        axpy(1.0, x._1, y._1)        (y._1, x._2 + y._2)      }      val activeCenters = activeRuns.map(r => centers(r)).toArray      val costAccums = activeRuns.map(_ => sc.accumulator(0.0))      val bcActiveCenters = sc.broadcast(activeCenters)      // Find the sum and count of points mapping to each center      //理论分析的第五部分，对每个中心到各自的样本的距离进行计算      val totalContribs = data.mapPartitions { points =>        val thisActiveCenters = bcActiveCenters.value        val runs = thisActiveCenters.length        val k = thisActiveCenters(0).length        val dims = thisActiveCenters(0)(0).vector.size        val sums = Array.fill(runs, k)(Vectors.zeros(dims))        val counts = Array.fill(runs, k)(0L)        points.foreach { point =>          (0 until runs).foreach { i =>            val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point)            costAccums(i) += cost            val sum = sums(i)(bestCenter)            axpy(1.0, point.vector, sum)            counts(i)(bestCenter) += 1          }        }        //contribs = ((并行度，那个中心)，某个聚类中心到各自样本之和，那个并行度下的那个聚类中心)        val contribs = for (i <- 0 until runs; j <- 0 until k) yield {          ((i, j), (sums(i)(j), counts(i)(j)))        }        contribs.iterator      }.reduceByKey(mergeContribs).collectAsMap()      bcActiveCenters.unpersist(blocking = false)      // Update the cluster centers and costs for each active run      for ((run, i) <- activeRuns.zipWithIndex) {        var changed = false        var j = 0        while (j < k) {          val (sum, count) = totalContribs((i, j))          if (count != 0) {            scal(1.0 / count, sum)            val newCenter = new VectorWithNorm(sum)            if (KMeans.fastSquaredDistance(newCenter, centers(run)(j)) > epsilon * epsilon) {              changed = true            }            centers(run)(j) = newCenter          }          j += 1        }        if (!changed) {          active(run) = false          logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")        }        costs(run) = costAccums(i).value      }      activeRuns = activeRuns.filter(active(_))      iteration += 1    }    val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) / 1e9    logInfo(s"Iterations took " + "%.3f".format(iterationTimeInSeconds) + " seconds.")    if (iteration == maxIterations) {      logInfo(s"KMeans reached the max number of iterations: $maxIterations.")    } else {      logInfo(s"KMeans converged in $iteration iterations.")    }    val (minCost, bestRun) = costs.zipWithIndex.min    logInfo(s"The cost for the best run is $minCost.")    new KMeansModel(centers(bestRun).map(_.vector))  }  /**    * Initialize `runs` sets of cluster centers at random.    */  //随机的寻找K个聚类中心  private def initRandom(data: RDD[VectorWithNorm])  : Array[Array[VectorWithNorm]] = {    // Sample all the cluster centers in one pass to avoid repeated scans    val sample = data.takeSample(true, runs * k, new XORShiftRandom(this.seed).nextInt()).toSeq    Array.tabulate(runs)(r => sample.slice(r * k, (r + 1) * k).map { v =>      new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm)    }.toArray)  }  /**    * Initialize `runs` sets of cluster centers using the k-means|| algorithm by Bahmani et al.    * (Bahmani et al., Scalable K-Means++, VLDB 2012). This is a variant of k-means++ that tries    * to find with dissimilar cluster centers by starting with a random center and then doing    * passes where more centers are chosen with probability proportional to their squared distance    * to the current cluster set. It results in a provable approximation to an optimal clustering.    *    * The original paper can be found at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf.    */  //用K-Means++理论  private def initKMeansParallel(data: RDD[VectorWithNorm])  : Array[Array[VectorWithNorm]] = {    // Initialize empty centers and point costs.    val centers = Array.tabulate(runs)(r => ArrayBuffer.empty[VectorWithNorm])    var costs = data.map(_ => Array.fill(runs)(Double.PositiveInfinity))    // Initialize each run's first center to a random point.    val seed = new XORShiftRandom(this.seed).nextInt()    val sample = data.takeSample(true, runs, seed).toSeq    val newCenters = Array.tabulate(runs)(r => ArrayBuffer(sample(r).toDense))    /** Merges new centers to centers. */    def mergeNewCenters(): Unit = {      var r = 0      while (r < runs) {        centers(r) ++= newCenters(r)        newCenters(r).clear()        r += 1      }    }    // On each step, sample 2 * k points on average for each run with probability proportional    // to their squared distance from that run's centers. Note that only distances between points    // and new centers are computed in each iteration.    var step = 0    while (step < initializationSteps) {      val bcNewCenters = data.context.broadcast(newCenters)      val preCosts = costs      costs = data.zip(preCosts).map { case (point, cost) =>        Array.tabulate(runs) { r =>          math.min(KMeans.pointCost(bcNewCenters.value(r), point), cost(r))        }      }.persist(StorageLevel.MEMORY_AND_DISK)      val sumCosts = costs        .aggregate(new Array[Double](runs))(          seqOp = (s, v) => {            // s += v            var r = 0            while (r < runs) {              s(r) += v(r)              r += 1            }            s          },          combOp = (s0, s1) => {            // s0 += s1            var r = 0            while (r < runs) {              s0(r) += s1(r)              r += 1            }            s0          }        )      bcNewCenters.unpersist(blocking = false)      preCosts.unpersist(blocking = false)      val chosen = data.zip(costs).mapPartitionsWithIndex { (index, pointsWithCosts) =>        val rand = new XORShiftRandom(seed ^ (step << 16) ^ index)        pointsWithCosts.flatMap { case (p, c) =>          val rs = (0 until runs).filter { r =>            rand.nextDouble() < 2.0 * c(r) * k / sumCosts(r)          }          if (rs.length > 0) Some((p, rs)) else None        }      }.collect()      mergeNewCenters()      chosen.foreach { case (p, rs) =>        rs.foreach(newCenters(_) += p.toDense)      }      step += 1    }    mergeNewCenters()    costs.unpersist(blocking = false)    // Finally, we might have a set of more than k candidate centers for each run; weigh each    // candidate by the number of points in the dataset mapping to it and run a local k-means++    // on the weighted centers to pick just k of them    val bcCenters = data.context.broadcast(centers)    val weightMap = data.flatMap { p =>      Iterator.tabulate(runs) { r =>        ((r, KMeans.findClosest(bcCenters.value(r), p)._1), 1.0)      }    }.reduceByKey(_ + _).collectAsMap()    bcCenters.unpersist(blocking = false)    val finalCenters = (0 until runs).par.map { r =>      val myCenters = centers(r).toArray      val myWeights = (0 until myCenters.length).map(i => weightMap.getOrElse((r, i), 0.0)).toArray      LocalKMeans.kMeansPlusPlus(r, myCenters, myWeights, k, 30)    }    finalCenters.toArray  }}

KMeans同名对象

/**  * Top-level methods for calling K-means clustering.  */@Since("0.8.0")object KMeans {  // Initialization mode names  @Since("0.8.0")  val RANDOM = "random"  @Since("0.8.0")  val K_MEANS_PARALLEL = "k-means||"  /**    * Trains a k-means model using the given set of parameters.    *    * @param data Training points as an `RDD` of `Vector` types.    * @param k Number of clusters to create.    * @param maxIterations Maximum number of iterations allowed.    * @param runs Number of runs to execute in parallel. The best model according to the cost    *             function will be returned. (default: 1)    * @param initializationMode The initialization algorithm. This can either be "random" or    *                           "k-means||". (default: "k-means||")    * @param seed Random seed for cluster initialization. Default is to generate seed based    *             on system time.    */  @Since("1.3.0")  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations  //并行数：runs（默认是1），初始化算法：initializationMode，种子：seed  //一共6个参数  def train(             data: RDD[Vector],             k: Int,             maxIterations: Int,             runs: Int,             initializationMode: String,             seed: Long): KMeansModel = {    new KMeans().setK(k)      .setMaxIterations(maxIterations)      .internalSetRuns(runs)      .setInitializationMode(initializationMode)      .setSeed(seed)      .run(data)  }  /**    * Trains a k-means model using the given set of parameters.    *    * @param data Training points as an `RDD` of `Vector` types.    * @param k Number of clusters to create.    * @param maxIterations Maximum number of iterations allowed.    * @param runs Number of runs to execute in parallel. The best model according to the cost    *             function will be returned. (default: 1)    * @param initializationMode The initialization algorithm. This can either be "random" or    *                           "k-means||". (default: "k-means||")    */  @Since("0.8.0")  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations  //并行数：runs（默认是1），初始化算法：initializationMode  //一共5个参数  def train(             data: RDD[Vector],             k: Int,             maxIterations: Int,             runs: Int,             initializationMode: String): KMeansModel = {    new KMeans().setK(k)      .setMaxIterations(maxIterations)      .internalSetRuns(runs)      .setInitializationMode(initializationMode)      .run(data)  }  /**    * Trains a k-means model using specified parameters and the default values for unspecified.    */  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations  //一共3个参数  @Since("0.8.0")  def train(             data: RDD[Vector],             k: Int,             maxIterations: Int): KMeansModel = {    train(data, k, maxIterations, 1, K_MEANS_PARALLEL)  }  /**    * Trains a k-means model using specified parameters and the default values for unspecified.    */  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations  //并行数：runs（默认是1）  //一共4个参数  @Since("0.8.0")  def train(             data: RDD[Vector],             k: Int,             maxIterations: Int,             runs: Int): KMeansModel = {    train(data, k, maxIterations, runs, K_MEANS_PARALLEL)  }  /**    * Returns the index of the closest center to the given point, as well as the squared distance.    */  //找出点到所有聚类中心最近的一个聚类中心，返回：(bestIndex, bestDistance)  private[mllib] def findClosest(                                  centers: TraversableOnce[VectorWithNorm],                                  point: VectorWithNorm): (Int, Double) = {    var bestDistance = Double.PositiveInfinity    var bestIndex = 0    var i = 0    centers.foreach { center =>      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary      // distance computation.      var lowerBoundOfSqDist = center.norm - point.norm      lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist      if (lowerBoundOfSqDist < bestDistance) {        val distance: Double = fastSquaredDistance(center, point)        if (distance < bestDistance) {          bestDistance = distance          bestIndex = i        }      }      i += 1    }    (bestIndex, bestDistance)  }  /**    * Returns the K-means cost of a given point against the given cluster centers.    */  //计算样本点和和中心点之间的距离  private[mllib] def pointCost(                                centers: TraversableOnce[VectorWithNorm],                                point: VectorWithNorm): Double =    findClosest(centers, point)._2  /**    * Returns the squared Euclidean distance between two vectors computed by    * [[org.apache.spark.mllib.util.MLUtils#fastSquaredDistance]].    */  //返回两个点的2范数（距离）  private[clustering] def fastSquaredDistance(                                               v1: VectorWithNorm,                                               v2: VectorWithNorm): Double = {    MLUtils.fastSquaredDistance(v1.vector, v1.norm, v2.vector, v2.norm)  }  //验证初始化模型  private[spark] def validateInitMode(initMode: String): Boolean = {    initMode match {      case KMeans.RANDOM => true      case KMeans.K_MEANS_PARALLEL => true      case _ => false    }  }}KMeansModel/**  * A vector with its norm for fast distance computation.  *  * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]  */private[clustering]//自己定义class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable {  def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))  def this(array: Array[Double]) = this(Vectors.dense(array))  /** Converts the vector to a dense vector. */  def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm)}

KMeansModel类

class KMeansModel @Since("1.1.0") (@Since("1.0.0") val clusterCenters: Array[Vector])  extends Saveable with Serializable with PMMLExportable {  /**   * A Java-friendly constructor that takes an Iterable of Vectors.   */  @Since("1.4.0")  def this(centers: java.lang.Iterable[Vector]) = this(centers.asScala.toArray)  /**   * Total number of clusters.   */  @Since("0.8.0")  def k: Int = clusterCenters.length  /**   * Returns the cluster index that a given point belongs to.   */  @Since("0.8.0")  def predict(point: Vector): Int = {    KMeans.findClosest(clusterCentersWithNorm, new VectorWithNorm(point))._1  }  /**   * Maps given points to their cluster indices.   */  @Since("1.0.0")  def predict(points: RDD[Vector]): RDD[Int] = {    val centersWithNorm = clusterCentersWithNorm    val bcCentersWithNorm = points.context.broadcast(centersWithNorm)    points.map(p => KMeans.findClosest(bcCentersWithNorm.value, new VectorWithNorm(p))._1)  }  /**   * Maps given points to their cluster indices.   */  @Since("1.0.0")  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =    predict(points.rdd).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]  /**   * Return the K-means cost (sum of squared distances of points to their nearest center) for this   * model on the given data.   */  @Since("0.8.0")  def computeCost(data: RDD[Vector]): Double = {    val centersWithNorm = clusterCentersWithNorm    val bcCentersWithNorm = data.context.broadcast(centersWithNorm)    data.map(p => KMeans.pointCost(bcCentersWithNorm.value, new VectorWithNorm(p))).sum()  }  private def clusterCentersWithNorm: Iterable[VectorWithNorm] =    clusterCenters.map(new VectorWithNorm(_))  @Since("1.4.0")  override def save(sc: SparkContext, path: String): Unit = {    KMeansModel.SaveLoadV1_0.save(sc, this, path)  }  override protected def formatVersion: String = "1.0"}

KMeansModel同名对象

object KMeansModel extends Loader[KMeansModel] {  @Since("1.4.0")  override def load(sc: SparkContext, path: String): KMeansModel = {    KMeansModel.SaveLoadV1_0.load(sc, path)  }  private case class Cluster(id: Int, point: Vector)  private object Cluster {    def apply(r: Row): Cluster = {      Cluster(r.getInt(0), r.getAs[Vector](1))    }  }  private[clustering]  object SaveLoadV1_0 {    private val thisFormatVersion = "1.0"    private[clustering]    val thisClassName = "org.apache.spark.mllib.clustering.KMeansModel"    def save(sc: SparkContext, model: KMeansModel, path: String): Unit = {      val sqlContext = SQLContext.getOrCreate(sc)      import sqlContext.implicits._      val metadata = compact(render(        ("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("k" -> model.k)))      sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path))      val dataRDD = sc.parallelize(model.clusterCenters.zipWithIndex).map { case (point, id) =>        Cluster(id, point)      }.toDF()      dataRDD.write.parquet(Loader.dataPath(path))    }    def load(sc: SparkContext, path: String): KMeansModel = {      implicit val formats = DefaultFormats      val sqlContext = SQLContext.getOrCreate(sc)      val (className, formatVersion, metadata) = Loader.loadMetadata(sc, path)      assert(className == thisClassName)      assert(formatVersion == thisFormatVersion)      val k = (metadata \ "k").extract[Int]      val centroids = sqlContext.read.parquet(Loader.dataPath(path))      Loader.checkSchema[Cluster](centroids.schema)      val localCentroids = centroids.rdd.map(Cluster.apply).collect()      assert(k == localCentroids.length)      new KMeansModel(localCentroids.sortBy(_.id).map(_.point))    }  }}

SparkML实验

note:为了方便处理，这个数据和只是Matlab中的3，4列，而且是纯数字

package Clusterimport org.apache.spark.mllib.clustering.{KMeans, KMeansModel}import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.{SparkConf, SparkContext}object myKmean {  def main(args: Array[String]) {    val conf = new SparkConf().setAppName("KMeansExample").setMaster("local")    val sc = new SparkContext(conf)        // Load and parse the data    val data = sc.textFile("C:\\Users\\andrew\\Desktop\\kmeans\\IrisData.txt")    println(data)    data.collect.foreach(println)    val parsedData = data.map(s => Vectors.dense(s.split('\t').map(_.toDouble)))    parsedData.collect.foreach(println)        // Cluster the data into two classes using KMeans    val initMode = "K-means||"    val numClusters = 3    val numIterations = 20    val clusters = KMeans.train(parsedData, numClusters, numIterations,1,initMode)    clusters.clusterCenters.foreach(println)    //聚类中心的坐标    //[5.595833333333332,2.0374999999999988]    //[1.462,0.24599999999999994]    //[4.269230769230769,1.3423076923076924]        // Evaluate clustering by computing Within Set Sum of Squared Errors    val WSSSE = clusters.computeCost(parsedData)    println("Within Set Sum of Squared Errors = " + WSSSE)    //Within Set Sum of Squared Errors = 31.371358974359016    // Save and load model    clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")    val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")    sc.stop()  }}

对比于matlab的聚合中心

centers =
2.0478 5.6261
0.2460 1.4620
1.3593 4.2926

//聚类中心的坐标
[5.595833333333332,2.0374999999999988]
[1.462,0.24599999999999994]
[4.269230769230769,1.3423076923076924]

1 0