SVD算法实战应用解析

来源：互联网发布：jenkins 修改端口号编辑：程序博客网时间：2024/06/06 20:07

svd底层是怎么实现的就不去细说了，我们先来谈谈到底可以利用svd来做什么。通过调用svd算法，我们可以得到各个属性的特征值，这个特征值越大对我们判断的影响就越大。特征比较小的时候，我们可以直接忽略该特征进行对事物的判断，判断结果也能比较精准，在这里就体现了svd算的降维。
下面通过调用mlib的svd算法和kmeans算法来，证实svd降维的准确性。
1，首先调用svd算法对数据进行特征值分析：

    object MySVD {  val conf =new SparkConf().setAppName("Svd").setMaster("local");  def main(args: Array[String]) {    test1()  }  def test1(): Unit ={    val sc = new SparkContext(conf)    // $example on$    val data = Array(      Vectors.dense( 5.0,1.0, 1.0, 3.0, 7.0),      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),      Vectors.dense(4.0, 1.0, 0.0, 6.0, 7.0))    val dataRDD = sc.parallelize(data, 2)    val mat: RowMatrix = new RowMatrix(dataRDD)    // Compute the top 5 singular values and corresponding singular vectors.    val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(5, computeU = true)    val U: RowMatrix = svd.U  // The U factor is a RowMatrix.    val s: Vector = svd.s  // The singular values are stored in a local dense vector.    val V: Matrix = svd.V  // The V factor is a local dense matrix.    // $example off$    val collect = U.rows.collect()    println("U factor is:")    collect.foreach { vector => println(vector) }    println(s"Singular values are: $s")    println(s"V factor is:\n$V")  }}

结果：

Singular values are: [18.07857954647125,2.8132737647378407,2.604276497395555,0.6842486621559452,7.432464828786868E-8]

说明第一个特征和最后一个特征的影响最大，倒数第二个特征几乎没什么影响，所以等下做测试的时候，会拿原数据进行一次kmeans分类,然后去掉倒数第二个特征进行kmeans分类

object TestKmeansSvd {  def main(args: Array[String]) {   //beforeSvd()    Svdkenmans()  }  def beforeSvd(): Unit ={    val conf =new SparkConf().setAppName("k-means").setMaster("local");    val sc = new SparkContext(conf)    val data1 = Array(      Vectors.dense( 5.0,1.0, 1.0, 3.0, 7.0),      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),    Vectors.dense(4.0, 1.0, 0.0, 6.0, 7.0))    val data = sc.parallelize(data1, 2)   // val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()   val parsedData=data    val numClusters = 2  //将目标数据分成几类    val numIterations = 20//迭代的次数    //将参数，和训练数据传入，形成模型    val clusters = KMeans.train(parsedData, numClusters, numIterations)    // Evaluate clustering by computing Within Set Sum of Squared Errors    val WSSSE = clusters.computeCost(parsedData)    println("Within Set Sum of Squared Errors = " + WSSSE)    //预测结果    val result=clusters.predict(data)    //打印分类结果    result.foreach(println)  }  def Svdkenmans(): Unit ={    val conf =new SparkConf().setAppName("k-means").setMaster("local");    val sc = new SparkContext(conf)    val data1 = Array(      Vectors.dense(5.0,1.0, 1.0,7.0),      Vectors.dense(2.0, 0.0, 3.0,5.0),      Vectors.dense(4.0, 0.0, 0.0,7.0),      Vectors.dense(4.0, 1.0, 0.0,7.0))    val data = sc.parallelize(data1, 2)    // val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()    val parsedData=data    val numClusters = 2  //将目标数据分成几类    val numIterations = 20//迭代的次数    //将参数，和训练数据传入，形成模型    val clusters = KMeans.train(parsedData, numClusters, numIterations)    // Evaluate clustering by computing Within Set Sum of Squared Errors    val WSSSE = clusters.computeCost(parsedData)    println("Within Set Sum of Squared Errors = " + WSSSE)    //预测结果    val result=clusters.predict(data)    //打印分类结果    result.foreach(println)  }}

这两次的分类结果相同，说明svd的降维操作还是比较准确的，也可以多做一次测试，去掉第一个特征数据，再对该数据进行分类，结果就不同了。

0 0