Spark MLlib之机器学习（三）

来源：互联网发布：ubuntu查看u盘挂载点编辑：程序博客网时间：2024/04/25 20:34

上一篇我们简单了解了Spark MLlib中的Supervised Learning，那么这一篇，我们主要介绍Unsupervised Learning。本篇介绍的内容有：KMeans、PCA（Principal Conponent Analysis）和SVD（Singular Value Decomposition）。

1.Unsupervised Learning（非监督学习）

首先，我们先看下Wikipedia对Unsupervised Learning的定义，如下：
"In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data."

可以简单的理解为：非监督学习是指，尝试从未标注的数据中，寻找隐藏的结构。

本篇中，KMeans是聚类算法，上一篇我们把分类算法概括为——分门别类，那么同样聚类算法也可以概况为四个字——物以类聚，即将将物理或抽象对象的集合组成为由类似的对象组成的多个类的过程被成为聚类。

而PCA和SVD则是用于数据降维，我们知道SVM中用到了高斯核函数使数据维度升高，从而简化算法。但是，一般情况下降维的方法对于理解数据更容易。只不过因事而已罢了。

2.KMeans

KMeans算法是聚类算法中非常简单的一种算法。算法的详细介绍可以参考我的另一篇博文KMeans笔记迭代计算选定的簇中心到各个特征的距离，把距离较近的特征划分为一簇，然后重新计算簇中心，直到划分结果合适为止。在Spark MLlib中，实现KMeans的接口就是KMeans。代码如下：

import org.apache.spark.mllib.clustering.KMeansimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.{SparkContext, SparkConf}object KMeansDemo {  def main(args: Array[String]): Unit ={    if(args.length != 3){      System.err.println("Usage: <input file> <iteration number> <cluster numberr>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    val data = sc.textFile(args(0))    //Iteration number    val iteration = args(1).toInt    //Cluster number    val cluster = args(2).toInt    val parseData = data.map(line => Vectors.dense(line.split(" ").map(_.toDouble)))    //Train a model    val model = KMeans.train(parseData, cluster, iteration)    //Check centers of its clusters    val centers = model.clusterCenters    for(center <- centers) {      println(centers.toVector)    }    sc.stop()  }}

2.PCA（Principal Conponent Analysis）

PCA是利用协方差矩阵来实现降维的算法，算法的详细介绍参考我的另一篇博客PCA算法详解。在Spark MLlib中，要实现PCA，必须先构造我们在第一篇中提到的分布式矩阵RowMatrix，然后调用RowMatrix的computePrincipalComponents()。代码如下：

import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.linalg.distributed.RowMatriximport org.apache.spark.{SparkContext, SparkConf}object PCA {  def main(args: Array[String]): Unit ={    if(args.length != 2){      System.err.println("Usage: <input file> <reduce dimensions>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    val data = sc.textFile(args(0))    //descent dimension    val dimension = args(1).toInt    val vectors = data.map{line =>      val values = line.split(" ").map(_.toDouble)      Vectors.dense(values)    }    val mat = new RowMatrix(vectors)    //Compute principal components    val pc = mat.computePrincipalComponents(dimension)    //Project the rows to the linear space spanned by the principal component    val pca = mat.multiply(pc)    println("PCA matrix rows:" + pca.numRows() + ", column: " + pca.numCols())    sc.stop()  }}

3.SVD（Singular Value Decomposition）

SVD是利用矩阵的奇异值分解来实现降维的，不过这里的奇异值分解并不像数学中定义的那么严格。在Spark MLlib中，同PCA，欲实现SVD，必先构造RowMatrix，然后调用computeSVD()方法。代码如下：

import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.linalg.distributed.RowMatriximport org.apache.spark.{SparkContext, SparkConf}object SVD {  def main (args: Array[String]): Unit ={    if(args.length != 2){      System.err.println("Usage: <input file> <reduce dimension>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    val data = sc.textFile(args(0))    //descent dimension    val dimension = args(1).toInt    val vectors = data.map{line =>      val values = line.split(" ").map(_.toDouble)      Vectors.dense(values)    }    val mat = new RowMatrix(vectors)    //Compute svd    val svd = mat.computeSVD(dimension)    //Calculate the U factor(eigenvector)    val U = svd.U    //Calculate the matrix of singular vector(eigenvalues)    val vec = svd.s    //Calculate the V factor(eigenvector)    val V = svd.V    println(svd)    sc.stop()  }}

1 0