上一篇我们简单了解了Spark MLlib中的Supervised Learning,那么这一篇,我们主要介绍Unsupervised Learning。本篇介绍的内容有:KMeans、PCA(Principal Conponent Analysis)和SVD(Singular Value Decomposition)。

1.Unsupervised Learning(非监督学习)

首先,我们先看下Wikipedia对Unsupervised Learning的定义,如下:
"In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data."





KMeans算法是聚类算法中非常简单的一种算法。算法的详细介绍可以参考我的另一篇博文KMeans笔记迭代计算选定的簇中心到各个特征的距离,把距离较近的特征划分为一簇,然后重新计算簇中心,直到划分结果合适为止。在Spark MLlib中,实现KMeans的接口就是KMeans。代码如下:

import org.apache.spark.mllib.clustering.KMeansimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.{SparkContext, SparkConf}object KMeansDemo {  def main(args: Array[String]): Unit ={    if(args.length != 3){      System.err.println("Usage: <input file> <iteration number> <cluster numberr>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    val data = sc.textFile(args(0))    //Iteration number    val iteration = args(1).toInt    //Cluster number    val cluster = args(2).toInt    val parseData = => Vectors.dense(line.split(" ").map(_.toDouble)))    //Train a model    val model = KMeans.train(parseData, cluster, iteration)    //Check centers of its clusters    val centers = model.clusterCenters    for(center <- centers) {      println(centers.toVector)    }    sc.stop()  }}

2.PCA(Principal Conponent Analysis)

PCA是利用协方差矩阵来实现降维的算法,算法的详细介绍参考我的另一篇博客PCA算法详解。在Spark MLlib中,要实现PCA,必须先构造我们在第一篇中提到的分布式矩阵RowMatrix,然后调用RowMatrix的computePrincipalComponents()。代码如下:

import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.linalg.distributed.RowMatriximport org.apache.spark.{SparkContext, SparkConf}object PCA {  def main(args: Array[String]): Unit ={    if(args.length != 2){      System.err.println("Usage: <input file> <reduce dimensions>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    val data = sc.textFile(args(0))    //descent dimension    val dimension = args(1).toInt    val vectors ={line =>      val values = line.split(" ").map(_.toDouble)      Vectors.dense(values)    }    val mat = new RowMatrix(vectors)    //Compute principal components    val pc = mat.computePrincipalComponents(dimension)    //Project the rows to the linear space spanned by the principal component    val pca = mat.multiply(pc)    println("PCA matrix rows:" + pca.numRows() + ", column: " + pca.numCols())    sc.stop()  }}

3.SVD(Singular Value Decomposition)

SVD是利用矩阵的奇异值分解来实现降维的,不过这里的奇异值分解并不像数学中定义的那么严格。在Spark MLlib中,同PCA,欲实现SVD,必先构造RowMatrix,然后调用computeSVD()方法。代码如下:

import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.linalg.distributed.RowMatriximport org.apache.spark.{SparkContext, SparkConf}object SVD {  def main (args: Array[String]): Unit ={    if(args.length != 2){      System.err.println("Usage: <input file> <reduce dimension>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    val data = sc.textFile(args(0))    //descent dimension    val dimension = args(1).toInt    val vectors ={line =>      val values = line.split(" ").map(_.toDouble)      Vectors.dense(values)    }    val mat = new RowMatrix(vectors)    //Compute svd    val svd = mat.computeSVD(dimension)    //Calculate the U factor(eigenvector)    val U = svd.U    //Calculate the matrix of singular vector(eigenvalues)    val vec = svd.s    //Calculate the V factor(eigenvector)    val V = svd.V    println(svd)    sc.stop()  }}

