第51课： Spark中的新解析引擎Catalyst源码SQL最终转化为RDD具体实现

来源：互联网发布：js将字符串转换为加减编辑：程序博客网时间：2024/05/16 10:46

基于DataSet的代码转换为RDD之前需要一个Action的操作，基于Spark中的新解析引擎Catalyst进行优化，Spark中的Catalyst不仅限于SQL的优化，Spark的五大子框架（Spark Cores、Spark SQL、Spark Streaming、Spark GraphX、Spark Mlib）将来都会基于Catalyst基础之上。

Dataset.scala的collect方法源码：

1. defcollect(): Array[T] = collect(needCallback = true)

进入collect(needCallback = true)方法：

1. private def collect(needCallback: Boolean):Array[T] = {

2. def execute(): Array[T] =withNewExecutionId {

3. queryExecution.executedPlan.executeCollect().map(boundEnc.fromRow)

4. }

6. if (needCallback) {

7. withCallback("collect",toDF())(_ => execute())

8. } else {

9. execute()

10. }

11. }

其中关键的一行代码是queryExecution.executedPlan.executeCollect().map(boundEnc.fromRow)，我们看一下executedPlan。executedplan不用来初始化任何SparkPlan，仅用于执行。

QueryExecution.scala的源码：

1. class QueryExecution(valsparkSession: SparkSession, val logical: LogicalPlan) {

2. ……

3. // executedPlan should not beused to initialize any SparkPlan. It should be

4. // only used for execution.

5. lazy val executedPlan: SparkPlan =prepareForExecution(sparkPlan)

6. ……

7. lazyval toRdd: RDD[InternalRow] = executedPlan.execute()

8. ……

queryExecution.executedPlan.executeCollect()其中的executeCollect方法运行此查询，将结果作为数组返回。executeCollect方法调用了byteArrayRdd.collect()方法。

SparkPlan .scala的executeCollect源码如下：

1. def executeCollect(): Array[InternalRow] = {

2. val byteArrayRdd = getByteArrayRdd()

4. val results = ArrayBuffer[InternalRow]()

5. byteArrayRdd.collect().foreach { bytes=>

6. decodeUnsafeRows(bytes).foreach(results.+=)

7. }

8. results.toArray

9. }

byteArrayRdd.collect()方法调用RDD.scala的collect方法，collect方法最终通过sc.runJob提交Spark集群运行。

RDD.scala的collect方法源码：

1. defcollect(): Array[T] = withScope {

2. val results = sc.runJob(this, (iter:Iterator[T]) => iter.toArray)

3. Array.concat(results: _*)

4. }

回到QueryExecution.scala中，executedPlan.execute()是关键性的代码。

1. lazyval toRdd: RDD[InternalRow] = executedPlan.execute()

进入SparkPlan.scala的execute返回查询结果类型为RDD[InternalRow]。调用`doExecute`

执行，SparkPlan应重写`doExecute`进行具体实现。在execute 方法就生成了RDD[InternalRow]。execute源码方法：

1. final def execute(): RDD[InternalRow] =executeQuery {

2. doExecute()

3. }

SparkPlan.scala的doExecute()抽象方法没有具体实现，通过SparkPlan具体实现重写。产生的查询结果作为RDD[InternalRow]。

1. protected def doExecute(): RDD[InternalRow]

InternalRow是通过语法树生成的一些数据结构。其子类包括BaseGenericInternalRow、JoinedRow、Row、UnsafeRow

InternalRow.scala源码：

1. abstract class InternalRowextends SpecializedGetters with Serializable {

2. ……

3. def setBoolean(i: Int, value: Boolean): Unit= update(i, value)

4. def setByte(i: Int, value: Byte): Unit =update(i, value)

5. def setShort(i: Int, value: Short): Unit =update(i, value)

6. def setInt(i: Int, value: Int): Unit =update(i, value)

7. def setLong(i: Int, value: Long): Unit =update(i, value)

8. def setFloat(i: Int, value: Float): Unit =update(i, value)

9. def setDouble(i: Int, value: Double): Unit =update(i, value)

10. ……..

DataSet的代码转化成为RDD的内部流程如下：

Parse SQL(DataSet) -> AnalyzeLogical Plan -> Optimize Logical Plan -> Generate Physical Plan->Prepareed Spark Plan -> Execute SQL -> Generate RDD

基于DataSet的代码一步步转化成为RDD：最终是调用execute()生成RDD。

阅读全文

0 0