CoordinateMatrix To IndexedRowMatrix or To RowMatrix then SVD

来源：互联网发布：php就业发展前景编辑：程序博客网时间：2024/06/15 11:44

目前，从spark1.0.0的scala文档来看，MLlib下的linalg包下的分布式矩阵，有三种：RowMatrix, IndexedRowMatrix, CoordinateMatrix. 除了CoordinateMatrix没有SVD方法，RowMatrix 有SVD，PCA方法，IndexedRowMatrix有SVD方法。但是CoordinateMatrix对于大的稀疏矩阵而言是最好的选择，如果想做SVD分解，该怎么办呢？

方法是有的，CoordinateMatrix 可以通过toIndexedRowMatrix(),转化成IndexedRowMatrix，然后再做分解。CoordinateMatrix还可以通过toRowMatrix()函数转化成RowMatrix（通过源代码发现本质是先转化成IndexedRowMatrix，然后再转化成RowMatrix），然后再做SVD，PCA。但是有个问题值得考虑。这个过程中是将CoordinateMatrix的元素(行，列，值）通过扩充0，变成稠密矩阵那样吗？这样对于大的稀疏的矩阵来说，会不会因为转化而膨胀，然后内存不够呢？

通过分析CoordinateMatrix的源代码，其中toIndexedRowMatrix(),

def toIndexedRowMatrix(): IndexedRowMatrix = {    val nl = numCols()    if (nl > Int.MaxValue) {      sys.error(s"Cannot convert to a row-oriented format because the number of columns $nl is " +        "too large.")    }    val n = nl.toInt    val indexedRows = entries.map(entry => (entry.i, (entry.j.toInt, entry.value)))      .groupByKey()      .map { case (i, vectorEntries) =>        IndexedRow(i, <span style="background-color: rgb(204, 204, 255);">Vectors.sparse(n, vectorEntries.toSeq)</span>)      }    new IndexedRowMatrix(indexedRows, numRows(), n)  }

以上高亮可以看出生成的一个稀疏向量。也就说只存储了非零元素。那么转化成的IndexedRowMatrix的row是稀疏向量。如果再将得到的IndexedRowMatrix转换到RowMatrix,源代码如下：

def toRowMatrix(): RowMatrix = {    new RowMatrix(rows.map(_.vector), 0L, nCols)  }

可以看到RDD[vector]生成RowMatrix,这里更准确的说是RDD[SparseVector]生成的。SparseVector 又是继承于Vector。所以应该是不会出现数据爆炸。不知道这样推理是否正确。

下面是将上一篇中的CoordinateMatrix转化成IndexedRowMatrix ,再转化成RowMatrix,再做SVD的结果。

import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}

import org.apache.spark.mllib.linalg.Matrix

import org.apache.spark.mllib.linalg.Vector

import org.apache.spark.mllib.linalg.distributed.RowMatrix

import org.apache.spark.mllib.linalg.SingularValueDecomposition

scala> val indexedRowMatrix = mat.toIndexedRowMatrix()
indexedRowMatrix: org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix = org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix@65e3fe67

scala> val rowMat:RowMatrix=indexedRowMatrix.toRowMatrix()
rowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@66bb8f81

scala> val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMat.computeSVD(5, computeU = true)

svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] =
SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@1673b484,[7.0,4.0,2.0],0.0 0.0 0.0
0.0 0.0 1.0
0.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0 )

scala> val U: RowMatrix = svd.U
U: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@1673b484

scala> val s: Vector = svd.s
s: org.apache.spark.mllib.linalg.Vector = [7.0,4.0,2.0]

scala> val V: Matrix = svd.V
V: org.apache.spark.mllib.linalg.Matrix =
0.0 0.0 0.0
0.0 0.0 1.0
0.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0

0 0