MLlib-Basics (III)

来源:互联网 发布:淘宝快递助手怎么授权 编辑:程序博客网 时间:2024/05/29 16:35

MLlib-Basics

4.3CoordinateMatrix

    坐标矩阵也是一种RDD存储的分布式矩阵。顾名思义,这里的每一项都是一个(i: Long, j: Long, value: Double)指示行列值的元组tuple。

其中i是行坐标,j是列坐标,value是值。如果矩阵是非常大的而且稀疏,坐标矩阵一定是最好的选择。

     坐标矩阵通过RDD[MatrixEntry]实例创建,MatrixEntry是(long,long.Double)形式。坐标矩阵可以转化为IndexedRowMatrix。


importorg.apache.spark.mllib.linalg.distributed.{CoordinateMatrix,MatrixEntry}

valentries:RDD[MatrixEntry]=...// an RDD of matrix entries

// Create a CoordinateMatrix from an RDD[MatrixEntry].

valmat:CoordinateMatrix=new CoordinateMatrix(entries)

// Get its size.

valm=mat.numRows()

valn=mat.numCols()

// Convert it to an IndexRowMatrix whose rows are sparse vectors.

valindexedRowMatrix=mat.toIndexedRowMatrix()


我们仍然用这组数据做实验。

1    1    2
2    3    4
5    6    7

* 先从hdfs上load数据,text中的每行看作是一个String ,是RDD[String]格式。

scala> val textfile=sc.textFile("hdfs://node001:9000/spark/input/data.txt")
14/07/11 05:01:46 INFO MemoryStore: ensureFreeSpace(167504) called with curMem=0, maxMem=309225062
14/07/11 05:01:46 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 163.6 KB, free 294.7 MB)
textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

*将每行,分成了一个字符数组.变成RDD[Array[String]]

scala> val middle=textfile.map((arg)=>arg.split("\\t"))
middle: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14

* 想办法剖离出(long,long,Double)的元组。

scala> val mid=middle.map((arg)=>(arg(0).toLong,arg(1).toLong,arg(2).toDouble))
mid: org.apache.spark.rdd.RDD[(Long, Long, Double)] = MappedRDD[3] at map at <console>:16

*生成MatrixEntry。

scala> val entries=mid.map((arg)=>MatrixEntry(arg._1,arg._2,arg._3))
entries: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = MappedRDD[4] at map at <console>:19

*生成坐标矩阵

scala> val mat: CoordinateMatrix = new CoordinateMatrix(entries)
mat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix= org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@be5b71c

* 计算它的行数

scala> val m = mat.numRows()
14/07/11 05:15:32 INFO FileInputFormat: Total input paths to process : 1
14/07/11 05:15:32 INFO SparkContext: Starting job: reduce at CoordinateMatrix.scala:99
14/07/11 05:15:32 INFO DAGScheduler: Got job 0 (reduce at CoordinateMatrix.scala:99) with 2 output partitions (allowLocal=false)
14/07/11 05:15:32 INFO DAGScheduler: Final stage: Stage 0(reduce at CoordinateMatrix.scala:99)
14/07/11 05:15:32 INFO DAGScheduler: Parents of final stage: List()
14/07/11 05:15:32 INFO DAGScheduler: Missing parents: List()
14/07/11 05:15:32 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[5] at map at CoordinateMatrix.scala:99), which has no missing parents
14/07/11 05:15:33 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[5] at map at CoordinateMatrix.scala:99)
14/07/11 05:15:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/07/11 05:15:33 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 05:15:33 INFO TaskSetManager: Serialized task 0.0:0 as 1936 bytes in 1 ms
14/07/11 05:15:33 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 05:15:33 INFO TaskSetManager: Serialized task 0.0:1 as 1936 bytes in 0 ms
14/07/11 05:15:33 INFO Executor: Running task ID 1
14/07/11 05:15:33 INFO Executor: Running task ID 0
14/07/11 05:15:33 INFO BlockManager: Found block broadcast_0 locally
14/07/11 05:15:33 INFO BlockManager: Found block broadcast_0 locally
14/07/11 05:15:33 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 05:15:33 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 05:15:33 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/07/11 05:15:33 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/07/11 05:15:33 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/07/11 05:15:33 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/07/11 05:15:33 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/07/11 05:15:33 INFO Executor: Serialized size of result for 1 is 724
14/07/11 05:15:33 INFO Executor: Serialized size of result for 0 is 724
14/07/11 05:15:33 INFO Executor: Sending result for 0 directly to driver
14/07/11 05:15:33 INFO Executor: Sending result for 1 directly to driver
14/07/11 05:15:33 INFO Executor: Finished task ID 1
14/07/11 05:15:33 INFO Executor: Finished task ID 0
14/07/11 05:15:33 INFO TaskSetManager: Finished TID 1 in 82 ms on localhost (progress: 1/2)
14/07/11 05:15:33 INFO DAGScheduler: Completed ResultTask(0, 1)
14/07/11 05:15:33 INFO DAGScheduler: Completed ResultTask(0, 0)
14/07/11 05:15:33 INFO TaskSetManager: Finished TID 0 in 91 ms on localhost (progress: 2/2)
14/07/11 05:15:33 INFO DAGScheduler: Stage 0 (reduce at CoordinateMatrix.scala:99) finished in 0.097 s
14/07/11 05:15:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/07/11 05:15:33 INFO SparkContext: Job finished: reduce at CoordinateMatrix.scala:99, took 0.170012294 s
m: Long = 6

*它的列数
scala> val n = mat.numCols()
n: Long = 7

* 转换为了IndexedRowMatrix

scala> val indexedRowMatrix = mat.toIndexedRowMatrix()
indexedRowMatrix: org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix=

org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix@65e3fe67


0 0