spark mlib入门实例

来源:互联网 发布:赵丽颖电影知乎 编辑:程序博客网 时间:2024/06/12 00:05

1.mlib基本数据类型

2.本地向量集实例

val vd: Vector = Vectors.dense(2,0,6)//打印密集向量println(vd(2))//打印第三个值val vs: Vector = Vectors.sparse(4,Array(0,1,2,3),Array(9,5,2,7))//打印稀疏向量println(vs(2))//打印第三个值===========================输出======6.017/05/29 17:34:09 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.0.105:58985 in memory (size: 2.3 KB, free: 1444.6 MB)2.0
3.向量标签的使用

val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))//密集型// Create a labeled point with a negative label and a sparse feature vector.val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))//稀疏性println(pos.features)//打印标记点的内容数据println(pos.label)//打印标记点println(neg.features)println(neg.label)===========================输出====================[1.0,0.0,3.0]1.0(3,[0,2],[1.0,3.0])0.0
4.MLUtils。loadLibSVMFile用法

数据格式

val sparkConf = new SparkConf().setAppName("MLUtilsTest").setMaster("local[2]")val sc = new SparkContext(sparkConf)val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "E:\\softfile\\spark-2.1.0-bin-hadoop2.7\\data\\mllib\\test.txt")examples.foreach(print(_))
5.本地矩阵使用

//本地矩阵val dense: Matrix = Matrices.dense(2,3,Array(1,2,3,4,5,6));println(dense)//将数组分为23列=========================================输出======================17/05/29 17:50:29 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.105, 59358, None)1.0  3.0  5.0  2.0  4.0  6.0
6.分布式矩阵

1.行矩阵

数据格式


//行矩阵val rdd1 =  sc.textFile("E:\\softfile\\spark-2.1.0-bin-hadoop2.7\\data\\mllib\\test2.txt")                .map(_.split(" ").map(_.toDouble)).map(line=>Vectors.dense(line))val rm  = new RowMatrix(rdd1)//读入行矩阵println(rm.numRows())//打印行数println(rm.numCols())================================输出==============================17/05/29 17:56:47 INFO DAGScheduler: Job 1 finished: count at RowMatrix.scala:75, took 0.059700 s217/05/29 17:56:47 INFO SparkContext: Starting job: first at RowMatrix.scala:6117/05/29 17:56:47 INFO DAGScheduler: Job 2 finished: first at RowMatrix.scala:61, took 0.032437 s317/05/29 17:56:47 INFO SparkContext: Invoking stop() from shutdown hook
2.带有行索引的行矩阵(数据格式同上)
//带有行索引的行矩阵val rdd1 =  sc.textFile("E:\\softfile\\spark-2.1.0-bin-hadoop2.7\\data\\mllib\\test3.txt")                .map(_.split(" ").map(_.toDouble)).map(line=>Vectors.dense(line))                .map((vd)=>new IndexedRow(vd.size,vd))val rm  = new IndexedRowMatrix(rdd1)//读入行矩阵println(rm.getClass)//打印类型println(rm.rows.foreach(println))====================================输出=========================class org.apache.spark.mllib.linalg.distributed.IndexedRowMatrixIndexedRow(3,[4.0,5.0,6.0])17/05/29 18:01:47 INFO Executor: Finished task 1.0 in stage 1.0 (TID 4). 1006 bytes result sent to driverIndexedRow(3,[1.0,2.0,3.0])