spark MLlib-2 data Types

来源:互联网 发布:中国被禁纪录片 知乎 编辑:程序博客网 时间:2024/06/06 06:43

Machine Learning Lib - Data Types

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze andjblas. A training example used in supervised learning is called a “labeled point” in MLlib.

Local vector


local vector 由基于Integer类型同时基于0的索引和double类型的值构成,通常包含两类稠密向量和稀疏向量,下面使用官方scala代码来创建这两种向量

import org.apache.spark.mllib.linalg.{Vector, Vectors}// Create a dense vector (1.0, 0.0, 3.0).val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))


Labeled point

Labeled point 也属于一种Local vector,用于有监督学习,下面使用官方scala代码创建Labeled point
import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPoint// Create a labeled point with a positive label and a dense feature vector.val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))// Create a labeled point with a negative label and a sparse feature vector.val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

sparse data
读取文件,默认使用LIBSVM和LIBLINEAR,文件格式如下:
label index1:value1 index2:value2 ...

     加载文件代码:

val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

Local Matrix

创建Local Matrix 代码:

import org.apache.spark.mllib.linalg.{Matrix, Matrices}// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))


Distributed Matrix

分布式矩阵一般包括:RowMatrix,IndexedRowMatrix,CoordinateMatrix
RowMatrix基于RDD rows,且无行索引,每一行是一个local vactor
import org.apache.spark.mllib.linalg.Vectorimport org.apache.spark.mllib.linalg.distributed.RowMatrixval rows: RDD[Vector] = ... // an RDD of local vectors// Create a RowMatrix from an RDD[Vector].val mat: RowMatrix = new RowMatrix(rows)// Get its size.val m = mat.numRows()val n = mat.numCols()


IndexedRowMatrix 基于RDD rows,和RowMatrix类似,包含行索引,且可以转换成RowMatrix
val rows: RDD[IndexedRow] = ... // an RDD of indexed rows// Create an IndexedRowMatrix from an RDD[IndexedRow].val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)// Get its size.val m = mat.numRows()val n = mat.numCols()// Drop its row indices.val rowMat: RowMatrix = mat.toRowMatrix()


coordinateMatrix 基于RDD entries,且可以转换成IndexedMatrix
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}val rows: RDD[IndexedRow] = ... // an RDD of indexed rows// Create an IndexedRowMatrix from an RDD[IndexedRow].val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)// Get its size.val m = mat.numRows()val n = mat.numCols()// Drop its row indices.val rowMat: RowMatrix = mat.toRowMatrix()





0 0
原创粉丝点击