第46课程 Spark 2.0实战之Dataset:sort、join、joinWith、randomSplit、sample、select、groupBy、agg、col等

来源:互联网 发布:suse linux和linux区别 编辑:程序博客网 时间:2024/05/29 15:20

第46课程 Spark 2.0实战之Dataset:sort、join、joinWith、randomSplit、sample、select、groupBy、agg、col等

people.json

{"name":"Michael", "age":16}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

 

package com.dt.spark200

import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ArrayBuffer

object DataSetsops {

 

 case class Person(name:String,age:Long)
   case class Score(n:String,score:Long)
  def main(args: Array[String]): Unit = {
   
     val spark = SparkSession
       .builder()
       .appName("DatasetOps")
       .master("local")
       .config("spark.sql.warehouse.dir", "file:///G:/IMFBigDataSpark2016/IMFScalaWorkspace_spark200/Spark200/spark-warehouse")
       .getOrCreate()
      
  import spark.implicits._ 
  import org.apache.spark.sql.functions._
  val personDF= spark.read.json("G:\\IMFBigDataSpark2016\\spark-2.0.0-bin-hadoop2.6\\examples\\src\\main\\resources\\people.json")
  val personScoresDF= spark.read.json("G:\\IMFBigDataSpark2016\\spark-2.0.0-bin-hadoop2.6\\examples\\src\\main\\resources\\peopleScores.json")
  val personDS = personDF.as[Person]
   val personScoresDS =personScoresDF.as[Score]
     personDS.joinWith(personScoresDS, $"name" === $"n").show
        personDS.sort("age").show()

    spark.stop()
  }
}

 

 

运行结果

 

16/09/17 08:01:20 INFO CodeGenerator: Code generated in 20.029504 ms
+------------+------------+
|          _1|          _2|
+------------+------------+
|[16,Michael]|[Michael,88]|
|   [30,Andy]|  [Andy,100]|
| [19,Justin]| [Justin,89]|
+------------+------------+

 

16/09/17 08:01:20 INFO DAGScheduler: Job 4 finished: show at DataSetsops.scala:28, took 0.043914 s
16/09/17 08:01:20 INFO CodeGenerator: Code generated in 8.421075 ms
+---+-------+
|age|   name|
+---+-------+
| 16|Michael|
| 19| Justin|
| 30|   Andy|
+---+-------+

 

 


  personDS.sort($"age".desc).show()

 

+---+-------+
|age|   name|
+---+-------+
| 30|   Andy|
| 19| Justin|
| 16|Michael|
+---+-------+

 

  personDS.randomSplit(Array(10,20)).foreach(dataset=>dataset.show())
  

16/09/17 08:10:33 INFO CodeGenerator: Code generated in 11.151499 ms
+---+------+
|age|  name|
+---+------+
| 19|Justin|
| 30|  Andy|
+---+------+

 

16/09/17 08:10:33 INFO DAGScheduler: Job 3 finished: show at DataSetsops.scala:26, took 0.050779 s
+---+-------+
|age|   name|
+---+-------+
| 16|Michael|
+---+-------+

 

 

  personDS.sample(false,0.5).show()

 

16/09/17 08:15:36 INFO CodeGenerator: Code generated in 9.319317 ms
+---+-------+
|age|   name|
+---+-------+
| 16|Michael|
| 30|   Andy|
+---+-------+

 

 

 

 

 

0 0