spark catalyst 中的strategy 自定义研究
来源:互联网 发布:淘宝男鞋店铺排行 编辑:程序博客网 时间:2024/05/04 01:50
spark catalyst 中的strategy 自定义研究
2016 Spark-Summit-EU-talk-by-Herman-van-Hovell 的实例:
val tableA: Dataset[Long] = spark.range(100000000).as('a) val tableB: Dataset[Long] = spark.range(100000000).as('b) val result = tableA.join(tableB, tableA("id") === tableB("id")) .groupBy() .count() result.count() result.show() result.explain(true) spark.experimental.extraStrategies = IntervalJoin :: Nil case object IntervalJoin extends Strategy with Serializable { def apply(myplan: LogicalPlan): Seq[SparkPlan] = myplan match { case Join( Range(start1, end1, 1, part1, Seq(o1)), Range(start2, end2, 1, part2, Seq(o2)), Inner, Some(EqualTo(e1, e2))) if ((o1 semanticEquals e1) && (o2 semanticEquals e2)) || ((o1 semanticEquals e2) && (o2 semanticEquals e1)) => { if ((start1 <= end2) && (end1 >= end2)) { val start: Long = math.max(start1, start2) val end = math.min(end1, end2)val part = math.max(part1.getOrElse(200),part2.getOrElse(200))val result: RangeExec = RangeExec(Range(start, end, 1, part, o1 :: Nil)) val twoColumns: ProjectExec = ProjectExec( Alias(o1, o1.name)(exprId = o1.exprId) :: Nil, result) twoColumns :: Nil } else { Nil } } case _ => Nil } } }}
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+---------+
| count|
+---------+
|100000000|
+---------+
优化之前,运行时间为: 106798 ms
== Parsed Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))
== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))
== Optimized Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Project
+- Join Inner, (id#0L = id#4L)
:- Range (0, 100000000, step=1, splits=Some(8))
+- Range (0, 100000000, step=1, splits=Some(8))
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#22L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#37L])
+- *Project
+- *SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *Range (0, 100000000, step=1, splits=Some(8))
+- *Sort [id#4L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
+---------+
| count|
+---------+
|100000000|
+---------+
== Parsed Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))
== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))
== Optimized Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Project
+- Join Inner, (id#0L = id#4L)
:- Range (0, 100000000, step=1, splits=Some(8))
+- Range (0, 100000000, step=1, splits=Some(8))
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#22L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#37L])
+- *Project
+- *Project [id#0L AS id#0L]
+- *Range (0, 100000000, step=1, splits=Some(200))
优化之后,运行时间为: 3587 ms
Process finished with exit code 0
- spark catalyst 中的strategy 自定义研究
- Spark 中的Tungsten和Catalyst
- spark catalyst中的DSL 解析
- 45:神速理解Spark中的新解析引擎Catalyst
- 46:Spark中的新解析引擎Catalyst源码初探
- Spark Catalyst 源码分析
- spark-sql-catalyst
- Spark SQL -- Catalyst
- Spark Catalyst 源码分析
- 52:Spark中的新解析引擎Catalyst源码中的外部数据源、缓存及其它
- Spark SQL Catalyst深入理解
- 认识SparkSQL中的Catalyst
- 第47课:spark中的新解析引擎catalyst源码sqlparser彻底详解
- 第46课:Spark中的新解析引擎Catalyst源码初探
- 第48课:Spark中的新解析引擎Catalyst源码Analyzer彻底详解
- 第49课:Spark中的新解析引擎Catalyst源码Optimizer彻底详解.
- 第52课:spark的新解析引擎catalyst源码中的外部数据源、缓存及其他
- 47:Spark中的新解析引擎Catalyst源码SqlParser彻底详解
- 关于个人CSDN博客的格式规范
- imageloader图片大小与占位图大小
- 浅学CMake
- MySQL相关文章索引(3)
- 深入浅出讲解:php的socket通信
- spark catalyst 中的strategy 自定义研究
- 深入浅出讲解:php的socket通信_0
- SSH超时自动断开问题解决
- 表达式
- Java基础复习01
- 漫谈时间和时区
- jsp页面保留double的2位小数
- 玩转spring boot——简单登录认证
- 玩转spring boot——结合阿里云持续交付