spark catalyst 中的strategy 自定义研究

来源：互联网发布：淘宝男鞋店铺排行编辑：程序博客网时间：2024/05/04 01:50

2016 Spark-Summit-EU-talk-by-Herman-van-Hovell 的实例：

    val tableA: Dataset[Long] = spark.range(100000000).as('a)    val tableB: Dataset[Long] = spark.range(100000000).as('b)    val result =      tableA.join(tableB, tableA("id") === tableB("id"))        .groupBy()        .count()    result.count()    result.show()    result.explain(true)     spark.experimental.extraStrategies = IntervalJoin :: Nil    case object IntervalJoin extends Strategy with Serializable {      def apply(myplan: LogicalPlan): Seq[SparkPlan] = myplan match {        case Join(        Range(start1, end1, 1, part1, Seq(o1)),        Range(start2, end2, 1, part2, Seq(o2)),        Inner,        Some(EqualTo(e1, e2)))          if ((o1 semanticEquals e1) && (o2 semanticEquals e2)) ||            ((o1 semanticEquals e2) && (o2 semanticEquals e1)) => {          if ((start1 <= end2) && (end1 >= end2)) {            val start: Long = math.max(start1, start2)            val end = math.min(end1, end2)                       val part = math.max(part1.getOrElse(200),part2.getOrElse(200))
            val result: RangeExec = RangeExec(Range(start, end, 1, part, o1 :: Nil))            val twoColumns: ProjectExec = ProjectExec(              Alias(o1, o1.name)(exprId = o1.exprId) :: Nil,              result)            twoColumns :: Nil          } else {            Nil          }        }        case _ => Nil      }    }  }}

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+---------+
| count|
+---------+
|100000000|
+---------+

优化之前，运行时间为： 106798 ms
== Parsed Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))

== Optimized Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Project
+- Join Inner, (id#0L = id#4L)
:- Range (0, 100000000, step=1, splits=Some(8))
+- Range (0, 100000000, step=1, splits=Some(8))

== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#22L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#37L])
+- *Project
+- *SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *Range (0, 100000000, step=1, splits=Some(8))
+- *Sort [id#4L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
+---------+
| count|
+---------+
|100000000|
+---------+

== Parsed Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
:- SubqueryAlias a
: +- Range (0, 100000000, step=1, splits=Some(8))
+- SubqueryAlias b
+- Range (0, 100000000, step=1, splits=Some(8))

== Optimized Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Project
+- Join Inner, (id#0L = id#4L)
:- Range (0, 100000000, step=1, splits=Some(8))
+- Range (0, 100000000, step=1, splits=Some(8))

== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#22L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#37L])
+- *Project
+- *Project [id#0L AS id#0L]
+- *Range (0, 100000000, step=1, splits=Some(200))
优化之后，运行时间为： 3587 ms

Process finished with exit code 0

0 0