spark mllib之基本数据类型

来源：互联网发布：线切割报价软件编辑：程序博客网时间：2024/05/22 03:43

spark mllib支持较多的数据类型，从最基本的数据集RDD到部署到集群的向量和矩阵，mllib的基本数据类型如下：
类型名称释义
Local vector 本地向量集，向spark提供可操作的数据集
Labeled point 向量标签，让用户能够分类不同的数据集合
Local matrix 本地矩阵，将数据集合以矩阵的形式存储在本地计算机
Distributed matrix 分布式矩阵，将数据集合以矩阵的形式存储于分布式计算机中
RowMatrix
IndexedRowMatrix
CoordinateMatrix
BlockMatrix

MLlib支持存储在单个机器上的本地向量和矩阵，以及由一个或多个RDD支持的分布式矩阵。局部向量和局部矩阵是用作公共接口的简单数据模型。底层线性代数运算由Breeze提供。在监督学习中使用的训练示例在MLlib中被称为向量标签

1.Local vector：具有整数类型和基于0的索引和双类型值，存储在单个机器上。 MLlib支持两种类型的局部向量：密集和稀疏。密集向量由表示其条目值的双数组支持，而稀疏向量由两个并行数组支持：索引和值。例如，向量（1.0,0.0,3.0）可以密集格式表示为[1.0,0.0,3.0]，或以稀疏格式表示为（3，[0,2]，[1.0,3.0]），其中3为矢量的大小。
dense可以理解为mllib专用的集合形式，他的结果和方法调用和Array类似。sparse是将给定的数组Array(9,5,2,7)分解为四个部分，其对应值分别属于vs的向量值，下表从1开始。
mllib数据支持格式整形和浮点型

The base class of local vectors is Vector, and we provide two implementations: DenseVector and SparseVector. We recommend using the factory methods implemented in Vectors to create local vectors.

Refer to the Vector Scala docs and Vectors Scala docs for details on the API.

import org.apache.spark.mllib.linalg.{Vector, Vectors}// Create a dense vector (1.0, 0.0, 3.0).val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

2.Labeled point在监督学习算法中大量使用，我们使用一个double来存储标签，在分类和回归中使用，对于二进制分类，标签应为0（负）或1（正）。对于多类分类，标签应该是从零开始的类索引：0，1，2，…. 实际上在mllib的决策树，随机森林，GBDT都有用到，例如在分类问题信用卡反欺诈中，可以将不同的数据集分为若干份，以0or1进行标记

import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPoint// Create a labeled point with a positive label and a dense feature vector.val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))// Create a labeled point with a negative label and a sparse feature vector.val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

Sparse data

在实践中很少有训练数据稀疏。 MLlib支持阅读以LIBSVM格式存储的训练示例，这是LIBSVM和LIBLINEAR使用的默认格式。它是一种文本格式，其中每行代表使用以下格式的标记稀疏特征向量：
label index1:value1 index2:value2 …

import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.util.MLUtilsimport org.apache.spark.rdd.RDDval examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

Local matrix
具有整数类型的行和列索引和双类型值，存储在单个机器上。 MLlib支持密集矩阵，其入口值以列主序列存储在单个双阵列中，稀疏矩阵的非零入口值以列主要顺序存储在压缩稀疏列（CSC）格式中。

The base class of local matrices is Matrix, and we provide two implementations: DenseMatrix, and SparseMatrix. We recommend using the factory methods implemented in Matrices to create local matrices. Remember, local matrices in MLlib are stored in column-major order.

Refer to the Matrix Scala docs and Matrices Scala docs for details on the API.

import org.apache.spark.mllib.linalg.{Matrix, Matrices}// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))

Distributed matrix
分布式矩阵具有长类型的行和列索引和双类型值，分布存储在一个或多个RDD中。选择正确的格式存储大型和分布式矩阵是非常重要的。将分布式矩阵转换为不同的格式可能需要全局洗牌，这是相当昂贵的。到目前为止已经实现了四种类型的分布式矩阵。

基本类型称为RowMatrix。 RowMatrix是没有有意义的行索引的行导向分布式矩阵，例如特征向量的集合。它由其行的RDD支持，其中每行是局部向量。我们假设RowMatrix的列数不是很大，因此单个本地向量可以合理地传递给驱动程序，也可以使用单个节点进行存储/操作。 IndexedRowMatrix与RowMatrix类似，但具有行索引，可用于标识行和执行连接。协调矩阵是以坐标列表（COO）格式存储的分布式矩阵，由其条目的RDD支持。 BlockMatrix是由MatrixBlock的RDD支持的分布式矩阵，它是（Int，Int，Matrix）的元组。

RowMatrix
可以从RDD [Vector]实例创建RowMatrix。然后我们可以计算其列汇总统计和分解。 QR分解形式为A = QR，其中Q是正交矩阵，R是上三角矩阵。对于奇异值分解（SVD）和主成分分析（PCA），请参考尺寸减小。

import org.apache.spark.mllib.linalg.Vectorimport org.apache.spark.mllib.linalg.distributed.RowMatrixval rows: RDD[Vector] = ... // an RDD of local vectors// Create a RowMatrix from an RDD[Vector].val mat: RowMatrix = new RowMatrix(rows)// Get its size.val m = mat.numRows()val n = mat.numCols()// QR decomposition val qrResult = mat.tallSkinnyQR(true)

IndexedRowMatrix
IndexedRowMatrix与RowMatrix类似，但具有有意义的行索引。它由索引行的RDD支持，因此每行都由其索引（长类型）和局部向量表示。

IndexedRowMatrix可以从RDD [IndexedRow]实例创建，其中IndexedRow是一个包装器（Long，Vector）。 IndexedRowMatrix可以通过删除其行索引转换为RowMatrix。

import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}val rows: RDD[IndexedRow] = ... // an RDD of indexed rows// Create an IndexedRowMatrix from an RDD[IndexedRow].val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)// Get its size.val m = mat.numRows()val n = mat.numCols()// Drop its row indices.val rowMat: RowMatrix = mat.toRowMatrix()

CoordinateMatrix
CoordinateMatrix是由其条目的RDD支持的分布式矩阵。每个条目是（i：Long，j：Long，value：Double）的元组，其中i是行索引，j是列索引，value是条目值。只有当矩阵的两个维度都很大并且矩阵非常稀疏时，才应使用CoordinateMatrix。

可以从RDD [MatrixEntry]实例创建一个CoordinateMatrix，其中MatrixEntry是一个包装器（Long，Long，Double）。可以通过调用toIndexedRowMatrix将CoordinateMatrix转换为具有稀疏行的IndexedRowMatrix。目前不支持CoordinateMatrix的其他计算。

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries// Create a CoordinateMatrix from an RDD[MatrixEntry].val mat: CoordinateMatrix = new CoordinateMatrix(entries)// Get its size.val m = mat.numRows()val n = mat.numCols()// Convert it to an IndexRowMatrix whose rows are sparse vectors.val indexedRowMatrix = mat.toIndexedRowMatrix()

BlockMatrix
BlockMatrix是由MatrixBlocks的RDD支持的分布式矩阵，其中MatrixBlock是（（Int，Int），Matrix）的元组，其中（Int，Int）是块的索引，Matrix是子矩阵，在给定索引处的矩阵大小为rowsPerBlock x colsPerBlock。 BlockMatrix支持添加和乘以另一个BlockMatrix的方法。 BlockMatrix还有一个帮助函数validate，可用于检查BlockMatrix是否正确设置。
可以通过调用toBlockMatrix从IndexedRowMatrix或CoordinateMatrix创建BlockMatrix。 toBlockMatrix默认创建大小为1024 x 1024的块。用户可以通过toBlockMatrix（rowsPerBlock，colsPerBlock）提供值来更改块大小。

import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries// Create a CoordinateMatrix from an RDD[MatrixEntry].val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)// Transform the CoordinateMatrix to a BlockMatrixval matA: BlockMatrix = coordMat.toBlockMatrix().cache()// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.// Nothing happens if it is valid.matA.validate()// Calculate A^T A.val ata = matA.transpose.multiply(matA)

阅读全文

0 0