Spark SQL中的broadcast join分析

来源：互联网发布：学编程需要什么电脑编辑：程序博客网时间：2024/06/01 13:20

　　在Spark-1.6.2中，执行相同join查询语句，broadcast join模式下，DAG和执行时间如下图所示：
1、broadcast join
（1）DAG
　　
　　这里写图片描述
　　
（2）执行时间

122 rows selected (22.709 seconds)

　　
2、非broadcast join
（1）DAG
　　
　　这里写图片描述
　　
（2）执行时间

122 rows selected (55.512 seconds)

　　对于broadcast join模式，会将小于spark.sql.autoBroadcastJoinThreshold值（默认为10M）的表广播到其他计算节点，不走shuffle过程，所以会更加高效。

一、Spark源码解析

　　源码中的基本流程如下所示：

1、`org.apache.spark.sql.execution.SparkStrategies`类

　　决定是否使用broadcast join的逻辑在SparkStrategies类中，

/*** Matches a plan whose output should be small enough to be used in broadcast join.*/object CanBroadcast {  def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {    case BroadcastHint(p) => Some(p)    case p if sqlContext. conf.autoBroadcastJoinThreshold > 0 &&      p.statistics.sizeInBytes <= sqlContext.conf.autoBroadcastJoinThreshold => Some(p)    case _ => None  }}

　　这里面sqlContext.conf.autoBroadcastJoinThreshold由参数spark.sql.autoBroadcastJoinThreshold来设置，默认为10 * 1024 * 1024Bytes（10M）。上面这段逻辑是说，如果该参数值大于0，并且p.statistics.sizeInBytes的值比该参数值小时，就会认为该表比较小，在做join时会broadcast到各个executor上，从而避免了shuffle过程。

2、`org.apache.spark.sql.hive.HiveMetastoreCatalog`类

　　p.statistics.sizeInBytes的值，查看HiveMetastoreCatalog类文件，如下所示

@transient override lazy val statistics: Statistics = Statistics(  sizeInBytes = {    val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)    val rawDataSize = hiveQlTable .getParameters.get(StatsSetupConst.RAW_DATA_SIZE)    // TODO: check if this estimate is valid for tables after partition pruning.    // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be    // relatively cheap if parameters for the table are populated into the metastore.  An    // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot    // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,    // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.    BigInt(      // When table is external,`totalSize` is always zero, which will influence join strategy      // so when `totalSize` is zero, use `rawDataSize` instead      // if the size is still less than zero, we use default size      Option (totalSize).map(_.toLong).filter(_ > 0)        .getOrElse(Option (rawDataSize).map(_.toLong).filter(_ > 0)        .getOrElse(sqlContext.conf.defaultSizeInBytes)))  })

　　会从hive的tblproperties属性中优先取出totalSize的数值，如果该值不大于0，则取rawDataSize的值，如果该值也不大于0，那么就取sqlContext.conf.defaultSizeInBytes。

3、`org.apache.spark.sql.SQLConf`类

　　那么sqlContext.conf.defaultSizeInBytes又是多少呢？这个值配置在SQLConf类中，如下所示。

private[spark] def defaultSizeInBytes: Long =  getConf(DEFAULT_SIZE_IN_BYTES , autoBroadcastJoinThreshold + 1L )

　　取DEFAUTL_SIZE_IN_BYTES的值，这个值一般需要设置的比spark.sql.autoBroadcastJoinThreshold大，以避免其他表被broadcast出去了。可以看到，默认值为autoBroadcastJoinThreshold值加1。

　　上面这一段的意思是，按顺序取hive的表统计信息中的totalSize属性和rawDataSize属性，直到取到一个大于零的值为止。如果这两个值都不大于零，那么就默认该表不能被broadcast出去。
　　

二、问题

　　从上面的过程可以看出，确定是否广播小表的决定性因素是hive的表统计信息一定要准确。并且，由于视图是没有表统计信息的，所以所有的视图在join时都不会被广播。

　　假如遇到这种需求：有一张全量快照分区表A，需要有一张表B永远指向表A的最新分区，但是又不能影响表B的broadcast join功能。

1、问题1：直接insert会慢

　　最直接的做法是使用insert overwrite table B select * from A where partition='20170322'来进行，在插入数据到B表后，会更新该表的totalSize和rawDataSize属性。

　　但是每次进行insert是一个很耗时的过程。

2、问题2：set location不更新表统计信息

　　如果想要避免这一过程的话，是不是可以用alter table B set location 'PATH_TO_TABLE_A/partition=20170322'的形式呢？可以这样做，但是这样又不会每次更新totalSize和rawDataSize属性了。

　　如果需要更新表统计信息的话，测试过了alter table B set tblproperties('totalSize'='123')语句，也不能生效。
　　如下所示，在更新前后，不仅totalSize没有变化，反而将rawDataSize给置为了-1。并且多了一些之前没有的默认值

0: jdbc:hive2://client:10000> show tblproperties shop_info;numFiles    1COLUMN_STATS_ACCURATE    truetransient_lastDdlTime    1490283932totalSize    102030numRows    2000rawDataSize    1000300: jdbc:hive2://client:10000> alter table shop_info set tblproperties('totalSize'='1234');0: jdbc:hive2://client:10000> show tblproperties shop_info;                               numFiles    1last_modified_by    hadooplast_modified_time    1490284040transient_lastDdlTime    1490284040COLUMN_STATS_ACCURATE    falsetotalSize    102030numRows    -1rawDataSize    -1

　　只能执行analyze table B compute statistics语句来跑map-reduce任务来统计了，这一过程也会比较耗时。

3、问题3：external table set location后会清除表统计信息

　　并且最好将表B建成external table形式，避免删除表B时将表A的数据删除掉。

　　在实践中发现，每次对external tale进行set location操作后，即使重新统计过表信息，它的totalSize和rawDataSize仍然会被清除掉。

# 首先按照shop_info表格式创建一个外部表，0: jdbc:hive2://client:10000> create external table test_broadcast like shop_info;# 查看表信息0: jdbc:hive2://client:10000>  show tblproperties test_broadcast;EXTERNAL    TRUEtransient_lastDdlTime    1490284194# 重定向location0: jdbc:hive2://client:10000> alter table test_broadcast set location 'hdfs://m000/user/hive/warehouse/tianchi.db/user_pay_train';# 查看表信息0: jdbc:hive2://client:10000>  show tblproperties test_broadcast;numFiles    0EXTERNAL    TRUElast_modified_by    hadooplast_modified_time    1490284413COLUMN_STATS_ACCURATE    falsetransient_lastDdlTime    1490284413numRows    -1totalSize    0rawDataSize    -1# 统计表信息0: jdbc:hive2://client:10000> analyze table test_broadcast compute statistics;# 查看表信息0: jdbc:hive2://client:10000>show tblproperties test_broadcast;numFiles    0EXTERNAL    TRUElast_modified_by    hadooplast_modified_time    1490284413transient_lastDdlTime    1490284751COLUMN_STATS_ACCURATE    truenumRows    65649782totalSize    0rawDataSize    2098020423# 再次重定向0: jdbc:hive2://client:10000> alter table test_broadcast set location 'hdfs://m000/user/hive/warehouse/tianchi.db/user_pay';# 查看表信息0: jdbc:hive2://client:10000>show tblproperties test_broadcast;numFiles    0EXTERNAL    TRUElast_modified_by    hadooplast_modified_time    1490284790transient_lastDdlTime    1490284790COLUMN_STATS_ACCURATE    falsenumRows    -1totalSize    0rawDataSize    -1

三、解决办法

　　遇到这个问题有没有觉得很棘手。

1、问题分析

　　其实方法已经在上面展示过了。我们来看一下org.apache.spark.sql.hive.HiveMetastoreCatalog类中关于statistics的注释：

// NOTE: getting totalSize directly from params is kind of hacky, but this should be
// relatively cheap if parameters for the table are populated into the metastore. An
// alternative would be going through Hadoop’s FileSystem API, which can be expensive if a lot
// of RPCs are involved. Besides totalSize, there are also numFiles, numRows,
// rawDataSize keys (see StatsSetupConst in Hive) that we can look at in the future.

　　这里说，直接获取totalSize并不是一个友好的办法，直接从这里获取totalSize只是相对比较快而已。其实可以通过HDFS的FileSystem API来获取该表在HDFS上的文件大小的。

　　看到这里，应该已经有了一个基本方法了。那就是将表B建为external table，并且每次遇到external table，直接去取hdfs上的文件大小。每次set location即可。这里唯一需要注意的是，通过HDFS FileSystem API获取文件大小是否耗时。这个可以通过下面的代码测试一下。

2、性能测试

　　测一下HDFS FileSystem API获取文件大小的耗时。

object HDFSFilesystemTest {  def main(args: Array[String ]) {    val conf = new Configuration()    conf.set("fs.default.name" , "hdfs://m000:8020")    val hiveWarehouse = "/user/hive/warehouse"    val path = new Path(hiveWarehouse)    val fs: FileSystem = path.getFileSystem(conf)    val begin = System.currentTimeMillis()    val size = fs.getContentSummary(path).getLength    println( s"$hiveWarehouse size is: $size Bytes")    val end = System.currentTimeMillis()    println( s"time consume ${end - begin} ms")  }}

　　输出结果如下，统计hive warehouse路径大小，耗时132毫秒。

/user/hive/warehouse size is: 4927963752 Bytestime consume 132 ms

3、源码修改

　　那么接下来修改一下org.apache.spark.sql.hive.HiveMetastoreCatalog中获取表大小的逻辑就可以了。

　　由于外部表会被重定向路径，或者指向路径的文件可以直接被put上来，所以统计的totalSize或者rawDataSize一般不准确。因此如果是外部表，直接获取hdfs文件大小。如果是非外部表，则按顺序取totalSize，rawDataSize的值，如果都不大于0，则通过HDFS FileSystem API获取hdfs文件大小了。

@transient override lazy val statistics: Statistics = Statistics(  sizeInBytes = {    BigInt (if ( hiveQlTable.getParameters.get("EXTERNAL" ) == "TRUE") {      try {        val hadoopConf = sqlContext.sparkContext.hadoopConfiguration        val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)        fs.getContentSummary(hiveQlTable.getPath).getLength      } catch {        case e: IOException =>          logWarning("Failed to get table size from hdfs." , e)          sqlContext.conf.defaultSizeInBytes      }    } else {      val totalSize = hiveQlTable .getParameters.get(StatsSetupConst. TOTAL_SIZE)      val rawDataSize = hiveQlTable .getParameters.get(StatsSetupConst. RAW_DATA_SIZE)      // TODO: check if this estimate is valid for tables after partition pruning.      // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be      // relatively cheap if parameters for the table are populated into the metastore.  An      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.      if (totalSize != null && totalSize.toLong > 0L) {        totalSize.toLong      } else if (rawDataSize != null && rawDataSize.toLong > 0L) {        rawDataSize.toLong      } else {        try {          val hadoopConf = sqlContext.sparkContext.hadoopConfiguration          val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)          fs.getContentSummary(hiveQlTable.getPath).getLength        } catch {          case e: IOException =>            logWarning("Failed to get table size from hdfs." , e)            sqlContext.conf.defaultSizeInBytes        }      }    })  })

　　改为之后，看了一下Spark-2中的源码，发现在org.apache.spark.sql.hive.MetastoreRelation中已经对此次进行了修改，如下所示：

  @transient override lazy val statistics: Statistics = {    catalogTable.stats.getOrElse(Statistics(      sizeInBytes = {        val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)        val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)        // TODO: check if this estimate is valid for tables after partition pruning.        // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be        // relatively cheap if parameters for the table are populated into the metastore.        // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys        // (see StatsSetupConst in Hive) that we can look at in the future.        BigInt(          // When table is external,`totalSize` is always zero, which will influence join strategy          // so when `totalSize` is zero, use `rawDataSize` instead          // when `rawDataSize` is also zero, use `HiveExternalCatalog.STATISTICS_TOTAL_SIZE`,          // which is generated by analyze command.          if (totalSize != null && totalSize.toLong > 0L) {            totalSize.toLong          } else if (rawDataSize != null && rawDataSize.toLong > 0) {            rawDataSize.toLong          } else if (sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled) {            try {              val hadoopConf = sparkSession.sessionState.newHadoopConf()              val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)              fs.getContentSummary(hiveQlTable.getPath).getLength            } catch {              case e: IOException =>                logWarning("Failed to get table size from hdfs.", e)                sparkSession.sessionState.conf.defaultSizeInBytes            }          } else {            sparkSession.sessionState.conf.defaultSizeInBytes          })      }    ))  }