spark catalyst 中的strategy 自定义研究

来源:互联网 发布:淘宝男鞋店铺排行 编辑:程序博客网 时间:2024/05/04 01:50

 spark catalyst 中的strategy 自定义研究

2016 Spark-Summit-EU-talk-by-Herman-van-Hovell 的实例:



    val tableA: Dataset[Long] = spark.range(100000000).as('a)    val tableB: Dataset[Long] = spark.range(100000000).as('b)    val result =      tableA.join(tableB, tableA("id") === tableB("id"))        .groupBy()        .count()    result.count()    result.show()    result.explain(true)     spark.experimental.extraStrategies = IntervalJoin :: Nil    case object IntervalJoin extends Strategy with Serializable {      def apply(myplan: LogicalPlan): Seq[SparkPlan] = myplan match {        case Join(        Range(start1, end1, 1, part1, Seq(o1)),        Range(start2, end2, 1, part2, Seq(o2)),        Inner,        Some(EqualTo(e1, e2)))          if ((o1 semanticEquals e1) && (o2 semanticEquals e2)) ||            ((o1 semanticEquals e2) && (o2 semanticEquals e1)) => {          if ((start1 <= end2) && (end1 >= end2)) {            val start: Long = math.max(start1, start2)            val end = math.min(end1, end2)            
           val part = math.max(part1.getOrElse(200),part2.getOrElse(200))
val result: RangeExec = RangeExec(Range(start, end, 1, part, o1 :: Nil)) val twoColumns: ProjectExec = ProjectExec( Alias(o1, o1.name)(exprId = o1.exprId) :: Nil, result) twoColumns :: Nil } else { Nil } } case _ => Nil } } }}




Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+---------+
|    count|
+---------+
|100000000|
+---------+


 优化之前,运行时间为:  106798 ms
== Parsed Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
   :- SubqueryAlias a
   :  +- Range (0, 100000000, step=1, splits=Some(8))
   +- SubqueryAlias b
      +- Range (0, 100000000, step=1, splits=Some(8))


== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
   :- SubqueryAlias a
   :  +- Range (0, 100000000, step=1, splits=Some(8))
   +- SubqueryAlias b
      +- Range (0, 100000000, step=1, splits=Some(8))


== Optimized Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Project
   +- Join Inner, (id#0L = id#4L)
      :- Range (0, 100000000, step=1, splits=Some(8))
      +- Range (0, 100000000, step=1, splits=Some(8))


== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#22L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#37L])
      +- *Project
         +- *SortMergeJoin [id#0L], [id#4L], Inner
            :- *Sort [id#0L ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(id#0L, 200)
            :     +- *Range (0, 100000000, step=1, splits=Some(8))
            +- *Sort [id#4L ASC NULLS FIRST], false, 0
               +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
+---------+
|    count|
+---------+
|100000000|
+---------+


== Parsed Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
   :- SubqueryAlias a
   :  +- Range (0, 100000000, step=1, splits=Some(8))
   +- SubqueryAlias b
      +- Range (0, 100000000, step=1, splits=Some(8))


== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#22L]
+- Join Inner, (id#0L = id#4L)
   :- SubqueryAlias a
   :  +- Range (0, 100000000, step=1, splits=Some(8))
   +- SubqueryAlias b
      +- Range (0, 100000000, step=1, splits=Some(8))


== Optimized Logical Plan ==
Aggregate [count(1) AS count#22L]
+- Project
   +- Join Inner, (id#0L = id#4L)
      :- Range (0, 100000000, step=1, splits=Some(8))
      +- Range (0, 100000000, step=1, splits=Some(8))


== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#22L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#37L])
      +- *Project
         +- *Project [id#0L AS id#0L]
            +- *Range (0, 100000000, step=1, splits=Some(200))
 优化之后,运行时间为:  3587 ms


Process finished with exit code 0

0 0
原创粉丝点击