CoordinateMatrix To IndexedRowMatrix or To RowMatrix then SVD
来源:互联网 发布:php就业发展前景 编辑:程序博客网 时间:2024/06/15 11:44
目前,从spark1.0.0的scala文档 来看,MLlib下的linalg包下的分布式矩阵,有三种:RowMatrix, IndexedRowMatrix, CoordinateMatrix. 除了CoordinateMatrix没有SVD方法,RowMatrix 有SVD,PCA方法,IndexedRowMatrix有SVD方法。但是CoordinateMatrix对于大的稀疏矩阵而言是最好的选择,如果想做SVD分解,该怎么办呢?
方法是有的,CoordinateMatrix 可以通过toIndexedRowMatrix(),转化成IndexedRowMatrix,然后再做分解。CoordinateMatrix还可以通过toRowMatrix()函数转化成RowMatrix(通过源代码发现本质是先转化成IndexedRowMatrix,然后再转化成RowMatrix),然后再做SVD,PCA。但是有个问题值得考虑。这个过程中是将CoordinateMatrix的元素(行,列,值)通过扩充0,变成稠密矩阵那样吗?这样对于大的稀疏的矩阵来说,会不会因为转化而膨胀,然后内存不够呢?
通过分析CoordinateMatrix的源代码,其中toIndexedRowMatrix(),
def toIndexedRowMatrix(): IndexedRowMatrix = { val nl = numCols() if (nl > Int.MaxValue) { sys.error(s"Cannot convert to a row-oriented format because the number of columns $nl is " + "too large.") } val n = nl.toInt val indexedRows = entries.map(entry => (entry.i, (entry.j.toInt, entry.value))) .groupByKey() .map { case (i, vectorEntries) => IndexedRow(i, <span style="background-color: rgb(204, 204, 255);">Vectors.sparse(n, vectorEntries.toSeq)</span>) } new IndexedRowMatrix(indexedRows, numRows(), n) }
以上高亮可以看出生成的一个稀疏向量。也就说只存储了非零元素。那么转化成的IndexedRowMatrix的row是稀疏向量。如果再将得到的IndexedRowMatrix转换到RowMatrix,源代码如下:
def toRowMatrix(): RowMatrix = { new RowMatrix(rows.map(_.vector), 0L, nCols) }可以看到RDD[vector]生成RowMatrix,这里更准确的说是RDD[SparseVector]生成的。SparseVector 又是继承于Vector。所以应该是不会出现数据爆炸。不知道这样推理是否正确。
下面是将上一篇中的CoordinateMatrix转化成IndexedRowMatrix ,再转化成RowMatrix,再做SVD的结果。
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.SingularValueDecomposition
scala> val indexedRowMatrix = mat.toIndexedRowMatrix()
indexedRowMatrix: org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix = org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix@65e3fe67
scala> val rowMat:RowMatrix=indexedRowMatrix.toRowMatrix()
rowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@66bb8f81
scala> val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMat.computeSVD(5, computeU = true)
svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] =
SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@1673b484,[7.0,4.0,2.0],0.0 0.0 0.0
0.0 0.0 1.0
0.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0 )
scala> val U: RowMatrix = svd.U
U: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@1673b484
scala> val s: Vector = svd.s
s: org.apache.spark.mllib.linalg.Vector = [7.0,4.0,2.0]
scala> val V: Matrix = svd.V
V: org.apache.spark.mllib.linalg.Matrix =
0.0 0.0 0.0
0.0 0.0 1.0
0.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0
- CoordinateMatrix To IndexedRowMatrix or To RowMatrix then SVD
- Only then do you need to ......
- My brief introduction to K-SVD
- UIImagePickerController Save to Disk then Load to UIImageView
- To Annotate or Not?
- To be or not
- From hell to Paradise,but then back to hell,but last to paradise
- To be or not to be
- To Save Or Not To Save?
- To be or not to be
- To jar or not to jar?
- To be or not to be
- TO BE OR NOT TO BE
- To be or not to be
- To Be or Not To Be
- Learn to fail or fail to learn.
- str to asc or asc to str
- To convert QString to LPTSTR or LPCTSTR:
- Mysql join语句的优化
- js为datagrid动态插入行和列
- 使用Winedt的几个小技巧(转载)
- java摄像头截图
- 网络流
- CoordinateMatrix To IndexedRowMatrix or To RowMatrix then SVD
- 科普文之iis写权限漏洞
- 一、pairs vs ipairs
- Ubuntu的命令及设置等
- android- Button点击与监听器处理
- System.Security.Cryptography.CryptographicException: 系统找不到指定的文件
- 平年闰年示意图
- 八爪鱼大数据应用技能培训课程开始报名啦!
- gnuradio的安装问题