Spark MLlib 1.6 -- 数据类型篇

来源：互联网发布：阿里云七牛云直播编辑：程序博客网时间：2024/06/10 15:07

译者续：

2016年过完年回来，把之前翻译spark mllib部分从新整理，继续未完成的工作。

MLlib 是spark 机器学习的库，它的目标是使机器学习算法能更容易上手。这个库包含通用学习算法和工具集，包括：分类，回归，聚类，协同过滤，降维，以及深层优化策略和上层管道API（pipeline）.

分为两个包：

1 spark.mllib 包含基于RDD的原始API

2 spark.ml 包含上层操作DataFrame 的API，可以构造机器学习管道，

推荐使用spark.ml 包，因为DataFrame API 在机器学习应用中更通用和灵活。但我们会持续支持spark.mllib 也配合spark.ml的开发。开发者可以提交新算法到spark.ml 包，但用户可以持续关注spark.mllib和使用spark.mllib中的特性。例如，特征抽取和特征变换。

一下列出机器学习包中主要的功能，并讲解细节。

spark.mllib: data types, algorithms, and utilities

· Data types

· Basic statistics

o summary statistics

o correlations

o stratified sampling

o hypothesis testing

o streaming significance testing

o random data generation

· Classification and regression

o linear models (SVMs, logistic regression, linear regression)

o naive Bayes

o decision trees

o ensembles of trees (Random Forests and Gradient-Boosted Trees)

o isotonic regression

· Collaborative filtering

o alternating least squares (ALS)

· Clustering

o k-means

o Gaussian mixture

o power iteration clustering (PIC)

o latent Dirichlet allocation (LDA)

o bisecting k-means

o streaming k-means

· Dimensionality reduction

o singular value decomposition (SVD)

o principal component analysis (PCA)

· Feature extraction and transformation

· Frequent pattern mining

o FP-growth

o association rules

o PrefixSpan

· Evaluation metrics

· PMML model export

· Optimization (developer)

o stochastic gradient descent

o limited-memory BFGS (L-BFGS)

一数据类型 – MLlib

· Local vector

· Labeled point

· Local matrix

· Distributed matrix

o RowMatrix

o IndexedRowMatrix

o CoordinateMatrix

o BlockMatrix

MLlib支持单个节点的本地向量和本地指标，同时也支持基于RDDs的分布式指标集。本地向量和本地指标可看做数据模型的对外接口，而底层的线性代数操作有Breeze 和 jblas提供。监督学习中的训练样本在MLlib中称为，“标签点”（本人注解，即有类别信息的样本点数据）

1.1 本地向量

本地向量有两个关键数据：0开始在索引和双精度浮点型值。MLlib支持两类本地向量：紧致向量和稀松向量。紧致向量是一个双精度浮点型向量元素组成的数组，稀松向量是两个同长度的数据，一个是非0向量指标数组，另一个是非0向量元素数组。如, 向量(1.0,0.0,3.0) 的紧致向量为[1.0,0.0,3.0] ，而对应的稀松向量为 (3, [0, 2], [1.0, 3.0])，此处，3代表向量长度（本人注解：[0,2] 是向量中非0数据的指标集，[1.0,3.0] 是对应非0.0数据的值）

本地向量的基类是Vector，我们提供两个实现：DenseVector 和SparseVector ，建议用户使用Vectors的工厂方法创建本地向量。

Scala Vector API：http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector

Scala Vectors API

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).

val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.

val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.

val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

注意：

Scala默认import scala.collection.immutable.Vector ，在运行spark ML时，需要手动引入import org.apache.spark.mllib.linalg.Vector.

1.2 标签点

标签点是本地向量，可以使紧致向量，也可以使稀松向量。在MLlib中，标签点用于监督学习算法，但是绑定双精度浮点类别标签后，也可以应用于回归和分类算法。在两类分类中，类别标签可选 0 或 1 , 对于多分类，类别标签从0 到（总类别数- 1）。

标签类使用case classs LabeledPoint .

Scala LabdledPoint API http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

// Create a labeled point with a positive label and a dense feature vector.

val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

// Create a labeled point with a negative label and a sparse feature vector.

val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

1.2.1 稀松数据

实践中经常会碰到需要训练稀松数据集，MLLib支持从LIBSVN格式直接读取训练数据，对于LIBSVN和LIBLINEAR的用户对这种格式并不陌生。这种格式是文本文件，每行是一个标签点，这个点标识一个稀松特征向量。

label index1:value1 index2:value2 ...

注意此处文件中向量的索引是从1开始，加载到spark 后自动转换为从0 开始。

MLUtils.loadLibSVMFile 读取按LIBSVN格式存储的训练测试数据

Scala MLUtils API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.util.MLUtils

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.rdd.RDD

val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

1.3 本地矩阵

本地矩阵是单个主机上的矩阵，具有特性：整数的矩阵索引和（双精度）浮点矩阵元素。MLLib支持紧致矩阵，矩阵元素按列优先存储在数组中，稀松矩阵，矩阵非0元素按列优先存储在CSC格式（Compressed Sparse Column,压缩稀松列）,如下面紧致矩阵：

（3,2）的矩阵存储在数组中为： [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]

本地矩阵的基类是Matrix ，同时提供两种本地矩阵实现：DenseMatrix,和SparseMatrix 。建议用户使用Matrices 类的工厂方法创建本地矩阵。再次提醒，矩阵是按列优先的数组存储。

Scala Matrix

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrix

Matrices API :

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices

import org.apache.spark.mllib.linalg.{Matrix, Matrices}

// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))

val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))

1.4 分布式矩阵

分布式矩阵是分布在一个或多个RDDs的矩阵，具有特征：长整型矩阵索引，双精度浮点矩阵元素。考虑到将分布式矩阵转换为其他形式需要全局shuffle，这样很消耗时间，因此有必要仔细斟酌选择合适形式来存储分布式大矩阵。暂时支持三种类型的分布式矩阵。

第一类是RowMatrix .RowMatrix 是面向行存储的矩阵，因此忽略行索引。例如，特征向量。这种矩阵每一行是一个本地向量（RDD）。假设每行的数据并不多，这样本地矩阵可以在单节点的driver间自由通信，也可以在单节点上存储和操作。

第二类是IndexedRowMatrix ,它比RowMatrix多了行索引，这个行索引可以标记行并用于关联操作。

第三类是CoordinateMatrix ，这种举证按CCO链表(coordinate list, https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29 ) 格式存储, 链表每个元素是一个RDD。

注意：

分布式矩阵的RDD的行和列在cache时必须是确定的，否则会出错。

1.4.1 RowMatrix

因为每行是一个本地向量，因此矩阵的列数限制在integer的范围，在实际中不建议太大。

RowMatrix 可以由一个RDD[Vector]实例创建，然后可以做列统计和分解。QR分解的形式 A = QR ，此处Q是一个正交矩阵，而R是一个上三角矩阵。了解更多奇异值分解（SVD，https://en.wikipedia.org/wiki/Singular_value_decomposition）和主成分分析（PCA，https://en.wikipedia.org/wiki/Principal_component_analysis），请看降维章节, http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html。

Scala RowMatrix API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix

import org.apache.spark.mllib.linalg.Vector

import org.apache.spark.mllib.linalg.distributed.RowMatrix

val rows: RDD[Vector] = ... // an RDD of local vectors

// Create a RowMatrix from an RDD[Vector].

val mat: RowMatrix = new RowMatrix(rows)

// Get its size.

val m = mat.numRows()

val n = mat.numCols()

// QR decomposition

val qrResult = mat.tallSkinnyQR(true)

1.4.2 IndexedRowMatrix

IndexedRowMatrix 可由RDD[IndexedRow] 实例创建，此处IndexedRow 封装为(Long, Vector) . IndexedRowMatrix 去掉行索引就变成了RowMatrix 。

Scala IndexedRowMatrix API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix

import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}

val rows: RDD[IndexedRow] = ... // an RDD of indexed rows

// Create an IndexedRowMatrix from an RDD[IndexedRow].

val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)

// Get its size.

val m = mat.numRows()

val n = mat.numCols()

// Drop its row indices.

val rowMat: RowMatrix = mat.toRowMatrix()

1.4.3 CoordinateMatrix ( 调和矩阵)

Coordinatematrix 是分布式矩阵，所有元素做成的RDD对象。其中Tuple3 形如( i : Long , j : Long, value : Double ) ,此处 i 是行索引， j 是列索引， value 是元素的值。CoordinateMatrix 只在当矩阵行和列都很大时，同时矩阵非0 元素很稀松。

CoordinateMatrix 可以从RDD[MatrixEntry]实例创建，此处MatrixEntry 封装为(Long , Long, Double )。 CoordinateMatrix 调用toIndexeedRowMatrix 方法可以将CoordinateMatrix 矩阵转化为IndexedRowMatrix 矩阵，其他coordinateMatrix 的计算暂时还不支持。

Scala CoordinateMatrix API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries

// Create a CoordinateMatrix from an RDD[MatrixEntry].

val mat: CoordinateMatrix = new CoordinateMatrix(entries)

// Get its size.

val m = mat.numRows()

val n = mat.numCols()

// Convert it to an IndexRowMatrix whose rows are sparse vectors.

val indexedRowMatrix = mat.toIndexedRowMatrix()

1.4.4 BlockMatrix (分块矩阵)

BlockMatrix是分布式矩阵RDD[MarixBlock]，此处MatrixBlock是元组((Int, Int) , Matrix ), 其中(Int, Int) 是矩阵块的索引， Matrix 是给定矩阵块索引的子矩阵，矩阵维度（是数组的长度）rowsPerBlock * colsPerBlock。 BlockMatrix矩阵支持add 和 multiply 方法和另一个同维度的BlockMatrix 计算。 Helper 函数 validate 可以校验 BlockMatrix 是否设置正确。

BlockMatrix 矩阵可以有IndexedRowMatrix 或 CoordinateMatrix 调用toBlockMatrix 方法得到， toBlockMatrix 方法默认创建 1024 * 1024 的块矩阵。用户可以调用接口 toBlockMatrix(rowsPerBlock , colsPerBlock ) 修改矩阵维度。

Scala BlockMatrix API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix

import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}

val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries

// Create a CoordinateMatrix from an RDD[MatrixEntry].

val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)

// Transform the CoordinateMatrix to a BlockMatrix

val matA: BlockMatrix = coordMat.toBlockMatrix().cache()

// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.

// Nothing happens if it is valid.

matA.validate()

// Calculate A^T A.

val ata = matA.transpose.multiply(matA)

0 0