Spark MLlib 1.6 -- 降维

来源：互联网发布：手机照片制作视频软件编辑：程序博客网时间：2024/05/19 11:18

· Singular value decomposition (SVD)

· Performance

· SVD Example

· Principal component analysis (PCA)

降维是在计算过程中减少计算量，降低计算复杂度的方法。把特征向量中可以乎略的部分或噪音部分剔除，也可以是保持特征向量主要特征的前提下对向量维度进行压缩。Spark.mllib支持行矩阵类（rowmatrix class）的维度降低方法。

6.1 奇异值分解 ---SVD

奇异值分解方法将矩阵拆分成三个矩阵的乘积，U , \Sigmad, V .使得

A = U \Sigma V^T

此处

1） U是一个正交上三角矩阵

2） \Sigma是一个对角非负矩阵，进一步可以要求对角上的值按降序排列，这些对角值称为奇异值

3） V是一个正交下三角矩阵

对于高维矩阵进行分解时，可以进一步将U , \Sigma , V进行分块，只计算相应于\Sigma不为0的块，这样我们只关心U , \Sigma, V左上角那个块即可。这样可以节省存储空间，去噪并能保存特征的主要部分。

当然我们也可以人为的只取\Sigma的前k 个奇异值，这样每个矩阵分别是：

1) U: m x k

2) \Sigma: k x k

3) V: n x k

6.1.1 性能

假设n < m ,那么奇异值和奇异向量分别是Gramian阵 A^T A 的特征值和特征向量。上三角矩阵U可以通过公式计算 U =A (V S^(-1)).实际计算中要考虑计算开销来选择计算方法。

1) 如果n很小（如n < 100），或 k相对于n 很大（如 k > n / 2 ）, 我们首先计算Gramian矩阵 A^T A ,然后计算Gramian矩阵的前k 个特征值和特征向量。一次交互时，每个executor和每个worker上的存储消耗是O(n^2) ,在driver上计算消耗是O(n^2 k)

2) 否则，我们并行计算(A^T A)v，然后使用ARPACK(一种软件)在driver上计算(A^T A) 的前k表特征值和特征向量。这就需要O(k)次交互，每次交互在executor上存储消耗是O(n)，在driver上存储消耗是O(n k)。

6.1.2 SVD 例子

Spark.mllib 提供按行存储矩形的SVD分解，见RowMatrix类

SingularValueDecompositionScala Docs API :http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SingularValueDecomposition

importorg.apache.spark.mllib.linalg.Matrix

importorg.apache.spark.mllib.linalg.distributed.RowMatrix

importorg.apache.spark.mllib.linalg.SingularValueDecomposition

val mat:RowMatrix=...

// Computethe top 20 singular values and corresponding singular vectors.

val svd:SingularValueDecomposition[RowMatrix,Matrix]= mat.computeSVD(20, computeU =true)

val U:RowMatrix= svd.U // The Ufactor is a RowMatrix.

val s:Vector= svd.s // Thesingular values are stored in a local dense vector.

val V:Matrix= svd.V // The Vfactor is a local dense matrix.

如果 U是一个IndexedRowMatrix，同样的代码仍然适用。

6.2 主成分分析（PCA）

主成分分析（PCA）是寻找一组旋转轴，将原有特征向量投影到旋转轴得到一组新的特征向量，可以保证新得到的特征向量方差最大化。这个旋转矩阵（由旋转轴列向量组成的矩阵）的每一列（或每个旋转轴列向量）叫主成分。 PCA是广范使用的降维方法。

Spark.mllib 支持对”瘦高“矩阵的PCA算法，按行存储行向量，或其它列向量。

下面的演示代码展示如何对按行存储向量RowMatrix使用PCA，将向量集投影到低维空间。

RowMatrix ScalaDocs API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix

importorg.apache.spark.mllib.linalg.Matrix

importorg.apache.spark.mllib.linalg.distributed.RowMatrix

val mat:RowMatrix=...

// Computethe top 10 principal components.

val pc:Matrix= mat.computePrincipalComponents(10)//Principal components are stored in a local dense matrix.

// Projectthe rows to the linear space spanned by the top 10 principal components.

val projected:RowMatrix= mat.multiply(pc)

下面代码展示如果计算列向量集的PCA，然后将向量投影到低维空间，同时保存向量标签

PCA ScalaDocsAPI : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.feature.PCA

importorg.apache.spark.mllib.regression.LabeledPoint

importorg.apache.spark.mllib.feature.PCA

val data:RDD[LabeledPoint]=...

// Computethe top 10 principal components.

val pca=newPCA(10).fit(data.map(_.features))

// Projectvectors to the linear space spanned by the top 10 principal components, keepingthe label

val projected= data.map(p=> p.copy(features= pca.transform(p.features)))

为了运行上面的实例代码，需要阅读spark自包含应用(self-containedapplications)章节,确定引入spark-mllib的所有依赖。

0 1