第46课程 Spark 2.0实战之Dataset:sort、join、joinWith、randomSplit、sample、select、groupBy、agg、col等
来源:互联网 发布:suse linux和linux区别 编辑:程序博客网 时间:2024/05/29 15:20
第46课程 Spark 2.0实战之Dataset:sort、join、joinWith、randomSplit、sample、select、groupBy、agg、col等
people.json
{"name":"Michael", "age":16}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
package com.dt.spark200
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ArrayBuffer
object DataSetsops {
case class Person(name:String,age:Long)
case class Score(n:String,score:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("DatasetOps")
.master("local")
.config("spark.sql.warehouse.dir", "file:///G:/IMFBigDataSpark2016/IMFScalaWorkspace_spark200/Spark200/spark-warehouse")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
val personDF= spark.read.json("G:\\IMFBigDataSpark2016\\spark-2.0.0-bin-hadoop2.6\\examples\\src\\main\\resources\\people.json")
val personScoresDF= spark.read.json("G:\\IMFBigDataSpark2016\\spark-2.0.0-bin-hadoop2.6\\examples\\src\\main\\resources\\peopleScores.json")
val personDS = personDF.as[Person]
val personScoresDS =personScoresDF.as[Score]
personDS.joinWith(personScoresDS, $"name" === $"n").show
personDS.sort("age").show()
spark.stop()
}
}
运行结果
16/09/17 08:01:20 INFO CodeGenerator: Code generated in 20.029504 ms
+------------+------------+
| _1| _2|
+------------+------------+
|[16,Michael]|[Michael,88]|
| [30,Andy]| [Andy,100]|
| [19,Justin]| [Justin,89]|
+------------+------------+
16/09/17 08:01:20 INFO DAGScheduler: Job 4 finished: show at DataSetsops.scala:28, took 0.043914 s
16/09/17 08:01:20 INFO CodeGenerator: Code generated in 8.421075 ms
+---+-------+
|age| name|
+---+-------+
| 16|Michael|
| 19| Justin|
| 30| Andy|
+---+-------+
personDS.sort($"age".desc).show()
+---+-------+
|age| name|
+---+-------+
| 30| Andy|
| 19| Justin|
| 16|Michael|
+---+-------+
personDS.randomSplit(Array(10,20)).foreach(dataset=>dataset.show())
16/09/17 08:10:33 INFO CodeGenerator: Code generated in 11.151499 ms
+---+------+
|age| name|
+---+------+
| 19|Justin|
| 30| Andy|
+---+------+
16/09/17 08:10:33 INFO DAGScheduler: Job 3 finished: show at DataSetsops.scala:26, took 0.050779 s
+---+-------+
|age| name|
+---+-------+
| 16|Michael|
+---+-------+
personDS.sample(false,0.5).show()
16/09/17 08:15:36 INFO CodeGenerator: Code generated in 9.319317 ms
+---+-------+
|age| name|
+---+-------+
| 16|Michael|
| 30| Andy|
+---+-------+
- 第46课程 Spark 2.0实战之Dataset:sort、join、joinWith、randomSplit、sample、select、groupBy、agg、col等
- 大数据Spark “蘑菇云”行动第47课程 Spark 2.0实战之Dataset:collect_list、collect_set、avg、sum、countDistinct等
- 第45课 Spark 2.0实战之Dataset:map、flatMap、mapPartitions、dropDuplicate、coalesce、repartition等
- 大数据Spark “蘑菇云”行动第89课:Hive中GroupBy优化、Join的多种类型实战及性能优化、OrderBy和SortBy、UnionAll等实战和优化
- 第43课:Spark 2.0编程实战之SparkSession、DataFrame、DataSet开发实战
- 第44课:Spark 2.0编程实战之DataSet案例开发实战
- spark dataset col,$,column,apply
- 【Spark Java API】Transformation(2)—sample、randomSplit
- scala实战之SparkSQL应用实例(单表count和groupby多来源表join等)
- 简要说明python pandas中groupby,agg等的用法
- 大数据Spark “蘑菇云”行动第100课:Hive性能调优之企业级Join、MapJoin、GroupBy、Count、数据倾斜彻底解密和最佳实践
- 大数据Spark “蘑菇云”行动第106课:Hive源码大师之路第四步:Hive中GroupBy和各种类型Join源码剖析
- spark sql 之join等函数用法
- 大数据Spark “蘑菇云”行动第41课:Spark编程实战之join、cogroup、cartesian深度解密
- SPark算子学习之FlatMap和Glom和randomSplit
- join,sort等命令
- LINQ查询操作符之Select、Where、OrderBy、OrderByDescending、GroupBy、Join、GroupJoin及其对应的查询语法
- Spark Transformation —— randomSplit
- bzoj 3036: 绿豆蛙的归宿 期望dp
- [CORS:跨域资源共享] ASP.NET Web API自身对CORS的支持:从实例开始
- PHP开发框架浅析
- Glide--在listview中加载高度不固定图片,加载刷新图片跳动解决
- 提升电脑速度、让电脑高效运行的方法之缓存和数据篇
- 第46课程 Spark 2.0实战之Dataset:sort、join、joinWith、randomSplit、sample、select、groupBy、agg、col等
- springmvc文件上传和下载
- [CORS:跨域资源共享] ASP.NET Web API自身对CORS的支持: EnableCorsAttribute特性背后的故事
- linux0.11 setup.s
- websoket的简单应用
- 一致性哈希库consistent
- 开源的反向代理项目推荐
- Linux下的top命令
- 移植内核学习笔记2-----修改分区及制作根文件系统