Spark源码--逻辑计划优化之表达式简化

来源:互联网 发布:淘宝最优类目查询 编辑:程序博客网 时间:2024/05/21 11:33


  • 一、常量合并(Constant Folding)
  • 二、简化过滤器 (Simlify Filters)
  • 三、简化Cast (Simplify Casts)
  • 四、简化大小写转化表达式 (Simplify Case Conversion Expressions)
  • 五、优化In语句 (Optimize In)
  • 六、简化Like语句(Simplify Like)
  • 七、替换Null表达式 (Null Propagation)
  • 八、简化布尔表达式 (Boolean Simplification)
一、常量合并(Constant Folding)

替换可以被静态计算的表达式

例如sql: 


select 1+2+3 from t1


优化过程:


scala> sqlContext.sql("select 1+2+3 from t1")
17/07/25 16:50:21 INFO parse.ParseDriver: Parsing command: select 1+2+3 from t1
17/07/25 16:50:21 INFO parse.ParseDriver: Parse Completed
res27: org.apache.spark.sql.DataFrame = [_c0: int]
 
scala> res27.queryExecution
res28: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(((1 + 2) + 3))]
+- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
_c0: int
Project [((1 + 2) + 3) AS _c0#19]
+- Subquery t1
   +- Project [_1#0 AS name#5,_2#1 AS date#6,_3#2 AS cate#7,_4#3 AS amountSpent#8,_5#4 AS time#9]
      +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Optimized Logical Plan ==
Project [6 AS _c0#19]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Physical Plan ==
Project [6 AS _c0#19]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

可见经过优化后,逻辑计划里的project转化成了6(1+2+3的结果),物理计划直接返回6

实现代码如下:

/**
  * 替换可以被静态计算的表达式
  */
object ConstantFolding extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsDown { // 对计划的表达式执行转化操作
      // 如果是字面量,直接返回,避免对字面量的重复计算(因为Literal也是foldable的)
      case l: Literal => l
      // 调用eval方法合并foldable的表达式,返回字面量
      case if e.foldable => Literal.create(e.eval(EmptyRow), e.dataType)
    }
  }
}


二、简化过滤器 (Simlify Filters)

 如果过滤器一直返回true, 则删掉此过滤器(如:where 2>1)
 如果过滤器一直返回false, 则直接让计划返回空(如: where 2<1)

例如sql: 

select name from t1 where 2 > 1
优化过程:
scala> sqlContext.sql("select name from t1 where 2 > 1")
17/07/25 15:50:25 INFO parse.ParseDriver: Parsing command: select name from t1 where 2 > 1
17/07/25 15:50:25 INFO parse.ParseDriver: Parse Completed
res23: org.apache.spark.sql.DataFrame = [name: string]
 
scala> res23.queryExecution
res24: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('name)]
+- 'Filter (2 > 1)
   +- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name: string
Project [name#5]
+- Filter (2 > 1)
   +- Subquery t1
      +- Project [_1#0 AS name#5,_2#1 AS date#6,_3#2 AS cate#7,_4#3 AS amountSpent#8,_5#4 AS time#9]
         +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Optimized Logical Plan ==
Project [_1#0 AS name#5]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Physical Plan ==
Project [_1#0 AS name#5]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

可见经过优化后,逻辑计划里的的 2 > 1这个恒为true的过滤器被删除了

实现代码如下:

object SimplifyFilters extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    // If the filter condition always evaluate to true, remove the filter.
    case Filter(Literal(true, BooleanType), child) => child
    // If the filter condition always evaluate to null or false,
    // replace the input with an empty relation.
    case Filter(Literal(null_), child) => LocalRelation(child.output, data = Seq.empty)
    case Filter(Literal(false, BooleanType), child) => LocalRelation(child.output, data = Seq.empty)
  }
}

三、简化Cast (Simplify Casts)

如果数据类型和要转换的类型一致,则去掉Cast

例如sql: 

select cast(name as String) from t1
优化过程:
// name本身就是String类型
scala> sqlContext.sql("select cast(name as String) from t1")
17/07/25 16:59:44 INFO parse.ParseDriver: Parsing command: select cast(name as String) from t1
17/07/25 16:59:44 INFO parse.ParseDriver: Parse Completed
res29: org.apache.spark.sql.DataFrame = [name: string]
 
scala> res29.queryExecution
res30: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(cast('name as string))]
+- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name: string
Project [cast(name#5 as string) AS name#20]
+- Subquery t1
   +- Project [_1#0 AS name#5,_2#1 AS date#6,_3#2 AS cate#7,_4#3 AS amountSpent#8,_5#4 AS time#9]
      +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Optimized Logical Plan ==
// 去掉了无用的cast
Project [_1#0 AS name#20]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Physical Plan ==
Project [_1#0 AS name#20]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

由于name本身就是String类型,所以优化器把cast to String这个表达式给优化删除了。

实现代码如下:

object SimplifyCasts extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
    case Cast(e, dataType) if e.dataType == dataType => e
  }
}

四、简化大小写转化表达式 (Simplify Case Conversion Expressions)

对于嵌套大小写转化表达式,以最外层为准,去掉里层的转化表达式

例如sql: 

select upper(lower(name)) from t1
优化过程:
scala> sqlContext.sql("select upper(lower(name)) from t1")
17/07/25 17:13:01 INFO parse.ParseDriver: Parsing command: select upper(lower(name)) from t1
17/07/25 17:13:01 INFO parse.ParseDriver: Parse Completed
res34: org.apache.spark.sql.DataFrame = [_c0: string]
 
scala> res34.queryExecution
res35: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('upper('lower('name)))]
+- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
_c0: string
Project [upper(lower(name#5)) AS _c0#22]
+- Subquery t1
   +- Project [_1#0 AS name#5,_2#1 AS date#6,_3#2 AS cate#7,_4#3 AS amountSpent#8,_5#4 AS time#9]
      +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Optimized Logical Plan ==
// 只剩下最外层的upper方法
Project [upper(_1#0) AS _c0#22]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Physical Plan ==
Project [upper(_1#0) AS _c0#22]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

经过优化后,只剩下最外层的大小写转化方法,相当于执行: select upper(name) from t1

实现代码如下:

object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsUp {
      // 以最外层转化表达式为准,其余删掉
      case Upper(Upper(child)) => Upper(child)
      case Upper(Lower(child)) => Upper(child)
      case Lower(Upper(child)) => Lower(child)
      case Lower(Lower(child)) => Lower(child)
    }
  }
}

五、优化In语句 (Optimize In)

把In List优化为In Set

例如sql: 

select from t1 where id in (1,1,2,2,1,2,1,2,2,2,2,2)
经过优化后相当于执行(注意:在Spark-1.6.2实验环境下没看出优化!): 
select from t1 where id in (1,2)

实现代码如下:

/**
  * Replaces [[In (value, seq[Literal])]] with optimized version[[InSet (value, HashSet[Literal])]]
  * which is much faster
  */
object OptimizeIn extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsDown {
      case In(v, list) if !list.exists(!_.isInstanceOf[Literal]) && list.size > 10 =>
        val hSet = list.map(e => e.eval(EmptyRow))
        InSet(v, HashSet() ++ hSet)
    }
  }
}
}

六、简化Like语句(Simplify Like)

对一下几种场景的正则表达式做了优化:

startsWith:    'abc%'
endsWith:     '%abc'
contains:      '%abc%'
equalTo:       'abc'

例如sql: 

select name from t1 where name like 'Bo%'
并不会以正则表达式匹配的方式执行,优化过程:
scala> sqlContext.sql("select name from t1 where name like 'B%'")
17/07/25 18:25:04 INFO parse.ParseDriver: Parsing command: select name from t1 where name like 'B%'
17/07/25 18:25:04 INFO parse.ParseDriver: Parse Completed
res46: org.apache.spark.sql.DataFrame = [name: string]
 
scala> res46.queryExecution
res47: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('name)]
+- 'Filter 'name LIKE B%
   +- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name: string
Project [name#5]
+- Filter name#5 LIKE B%
   +- Subquery t1
      +- Project [_1#0 AS name#5,_2#1 AS date#6,_3#2 AS cate#7,_4#3 AS amountSpent#8,_5#4 AS time#9]
         +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Optimized Logical Plan ==
Project [_1#0 AS name#5]
+- Filter StartsWith(_1#0, B) // 优化为字符串的startWith()
   +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Physical Plan ==
Project [_1#0 AS name#5]
+- Filter StartsWith(_1#0, B)
   +- Scan ExistingRDD[_1#0,_2#1,_3#2,_...

经过优化后,原始输入的正则表达式转化为字符串的startWith()操作

实现代码如下:

/**
  * 简化不需要使用正则表达式匹配的like语句
  */
object LikeSimplification extends Rule[LogicalPlan] {
  // if guards below protect from escapes on trailing %.
  // Cases like "something\%" are not optimized, but this does not affect correctness.
  private val startsWith = "([^_%]+)%".r  // 'abc%'
  private val endsWith = "%([^_%]+)".r    // '%abc'
  private val contains = "%([^_%]+)%".r   // '%abc%'
  private val equalTo = "([^_%]*)".r      // 'abc'
 
  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
    case Like(l, Literal(utf, StringType)) =>
      utf.toString match {
        case startsWith(pattern) if !pattern.endsWith("\\"=>
          StartsWith(l, Literal(pattern)) // 字符串的startWith()
        case endsWith(pattern) =>
          EndsWith(l, Literal(pattern))   // 字符串的endWith()
        case contains(pattern) if !pattern.endsWith("\\"=>
          Contains(l, Literal(pattern))   // 通过字节码检查包含
        case equalTo(pattern) =>
          EqualTo(l, Literal(pattern))    // 字符串的=操作
        case _ =>
          Like(l, Literal.create(utf, StringType))
      }
  }
}


七、替换Null表达式 (Null Propagation)

在某些特定场景下替换null表达式为字面量,阻止NULL表达式传播

例如sql: 

select count(nullfrom t1
优化过程:
scala> sqlContext.sql("select count(null) from t1")
17/07/26 11:40:18 INFO parse.ParseDriver: Parsing command: select count(nullfrom t1
17/07/26 11:40:18 INFO parse.ParseDriver: Parse Completed
res8: org.apache.spark.sql.DataFrame = [_c0: bigint]
 
scala> res8.queryExecution
res10: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('count(null))]
+- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
_c0: bigint
Aggregate [(count(null),mode=Complete,isDistinct=falseAS _c0#10L]
+- Subquery t1
   +- Project [_1#0 AS name#5,_2#1 AS date#6,_3#2 AS cate#7,_4#3 AS amountSpent#8,_5#4 AS time#9]
      +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Optimized Logical Plan ==
// 直接返回0
Aggregate [0 AS _c0#10L]
+- Project
   +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27

经过优化后,逻辑计划里的对count(null)优化后直接返回0, 不必全表扫描

实现代码如下:

object NullPropagation extends Rule[LogicalPlan] { 
  def apply(plan: LogicalPlan): LogicalPlan = plan transform { 
    case q: LogicalPlan => q transformExpressionsUp { 
case @ Count(Literal(null_)) => Cast(Literal(0L), e.dataType)//如果count(null)则转化为count(0) 
case e@AggregateExpression(Count(exprs), __if !exprs.exists(nonNullLiteral) =>
        Cast(Literal(0L), e.dataType)
    case e@IsNull(c) if !c.nullable => Literal.create(false, BooleanType)
    case e@IsNotNull(c) if !c.nullable => Literal.create(true, BooleanType)
    case ...
}

八、简化布尔表达式 (Boolean Simplification)

如果布尔表达式是通过逻辑门(and、or、not)等连接起来的,则根据逻辑门的特性做优化(如 true && a > 1 可以优化为 a > 1, true || a > 1可以优化为true)

例如sql: 

select name from t1 where 2 > 1 and time > 1
优化过程:
scala> sqlContext.sql("select name from t1 where 2 > 1 and time > 1")
17/07/26 12:10:17 INFO parse.ParseDriver: Parsing command: select name from t1 where 2 > 1 and time > 1
17/07/26 12:10:17 INFO parse.ParseDriver: Parse Completed
res26: org.apache.spark.sql.DataFrame = [name: string]
 
scala> res26.queryExecution
res28: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('name)]
+- 'Filter ((2 > 1) && ('time > 1))
   +- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name: string
Project [name#5]
+- Filter ((2 > 1) && (time#9 > 1))
   +- Subquery t1
      +- Project [_1#0 AS name#5,_2#1 AS date#6,_3#2 AS cate#7,_4#3 AS amountSpent#8,_5#4 AS time#9]
         +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27
 
== Optimized Logical Plan ==
Project [_1#0 AS name#5]
// 2 > 1 恒为true, 此筛选条件在&&情况下
+- Filter (_5#4 > 1)
   +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1] at rddToDataFrameHolder at <console>:27

经过优化后,逻辑计划里 2 > 1 这个恒为true的布尔表达式,在and操作符情况下被优化去掉了

实现代码如下:

/**
  * Simplifies boolean expressions:
  * 1. Simplifies expressions whose answer can be determined without evaluating both sides.
  * 2. Eliminates / extracts common factors.
  * 3. Merge same expressions
  * 4. Removes `Not` operator.
  */
object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsUp {
      // and操作符的优化,如果有true的过滤器,在and条件下可以消除,如果有false,直接返回false
      case and@And(left, right) => (left, right) match {
        // true && r  =>  r
        case (Literal(true, BooleanType), r) => r
        // l && true  =>  l
        case (l, Literal(true, BooleanType)) => l
        // false && r  =>  false
        case (Literal(false, BooleanType), _=> Literal(false)
        // l && false  =>  false
        case (_, Literal(false, BooleanType)) => Literal(false)
        // a && a  =>  a
        case (l, r) if l fastEquals r => l
        // a && (not(a) || b) => a && b
        case (l, Or(l1, r)) if (Not(l) == l1=> And(l, r)
        case (l, Or(r, l1)) if (Not(l) == l1=> And(l, r)
        case (Or(l, l1), r) if (l1 == Not(r)) => And(l, r)
        case (Or(l1, l), r) if (l1 == Not(r)) => And(l, r)
        // (a || b) && (a || c)  =>  a || (b && c)
        case ...
      // end of And(left, right)
 
      // or操作符的优化,短路原则
      case or@Or(left, right) => (left, right) match {
        // true || r  =>  true, 有一个为true就返回true
        case (Literal(true, BooleanType), _=> Literal(true)
        // r || true  =>  true
        case (_, Literal(true, BooleanType)) => Literal(true)
        // false || r  =>  r
        case (Literal(false, BooleanType), r) => r
        // l || false  =>  l
        case (l, Literal(false, BooleanType)) => l
        // a || a => a
        case (l, r) if l fastEquals r => l
        // (a && b) || (a && c)  =>  a && (b || c)
        case ...
      // end of Or(left, right)
       
       // 消除Not操作符, 直接取反义
      case not@Not(exp) => exp match {
        // not(true)  =>  false, true的反义是false
        case Literal(true, BooleanType) => Literal(false)
        // not(false)  =>  true
        case Literal(false, BooleanType) => Literal(true)
        // not(l > r)  =>  l <= r
        case GreaterThan(l, r) => LessThanOrEqual(l, r)
        // not(l >= r)  =>  l < r
        case GreaterThanOrEqual(l, r) => LessThan(l, r)
        // not(l < r)  =>  l >= r
        case LessThan(l, r) => GreaterThanOrEqual(l, r)
        // not(l <= r)  =>  l > r
        case LessThanOrEqual(l, r) => GreaterThan(l, r)
        // not(l || r) => not(l) && not(r)
        case Or(l, r) => And(Not(l), Not(r))
        // not(l && r) => not(l) or not(r)
        case And(l, r) => Or(Not(l), Not(r))
        // not(not(e))  =>  e
        case Not(e) => e
        case _ => not
      // end of Not(exp)
 
      // if (true) a else b  =>  a
      // if (false) a else b  =>  b
      case e@If(Literal(v, _), trueValue, falseValue) =if (v == true) trueValue else falseValue
    }
  }
}