Spark SQL中的broadcast join分析

来源:互联网 发布:学编程需要什么电脑 编辑:程序博客网 时间:2024/06/01 13:20

  在Spark-1.6.2中,执行相同join查询语句,broadcast join模式下,DAG和执行时间如下图所示:
1、broadcast join
(1)DAG
  
  这里写图片描述
  
(2)执行时间

122 rows selected (22.709 seconds)

  
2、非broadcast join
(1)DAG
  
  这里写图片描述
  
(2)执行时间

122 rows selected (55.512 seconds)

  对于broadcast join模式,会将小于spark.sql.autoBroadcastJoinThreshold值(默认为10M)的表广播到其他计算节点,不走shuffle过程,所以会更加高效。

一、Spark源码解析

  源码中的基本流程如下所示:

1、org.apache.spark.sql.execution.SparkStrategies

  决定是否使用broadcast join的逻辑在SparkStrategies类中,

/*** Matches a plan whose output should be small enough to be used in broadcast join.*/object CanBroadcast {  def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {    case BroadcastHint(p) => Some(p)    case p if sqlContext. conf.autoBroadcastJoinThreshold > 0 &&      p.statistics.sizeInBytes <= sqlContext.conf.autoBroadcastJoinThreshold => Some(p)    case _ => None  }}

  这里面sqlContext.conf.autoBroadcastJoinThreshold由参数spark.sql.autoBroadcastJoinThreshold来设置,默认为10 * 1024 * 1024Bytes(10M)。上面这段逻辑是说,如果该参数值大于0,并且p.statistics.sizeInBytes的值比该参数值小时,就会认为该表比较小,在做join时会broadcast到各个executor上,从而避免了shuffle过程。

2、org.apache.spark.sql.hive.HiveMetastoreCatalog

  p.statistics.sizeInBytes的值,查看HiveMetastoreCatalog类文件,如下所示

@transient override lazy val statistics: Statistics = Statistics(  sizeInBytes = {    val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)    val rawDataSize = hiveQlTable .getParameters.get(StatsSetupConst.RAW_DATA_SIZE)    // TODO: check if this estimate is valid for tables after partition pruning.    // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be    // relatively cheap if parameters for the table are populated into the metastore.  An    // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot    // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,    // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.    BigInt(      // When table is external,`totalSize` is always zero, which will influence join strategy      // so when `totalSize` is zero, use `rawDataSize` instead      // if the size is still less than zero, we use default size      Option (totalSize).map(_.toLong).filter(_ > 0)        .getOrElse(Option (rawDataSize).map(_.toLong).filter(_ > 0)        .getOrElse(sqlContext.conf.defaultSizeInBytes)))  })

  会从hive的tblproperties属性中优先取出totalSize的数值,如果该值不大于0,则取rawDataSize的值,如果该值也不大于0,那么就取sqlContext.conf.defaultSizeInBytes

3、org.apache.spark.sql.SQLConf

  那么sqlContext.conf.defaultSizeInBytes又是多少呢?这个值配置在SQLConf类中,如下所示。

private[spark] def defaultSizeInBytes: Long =  getConf(DEFAULT_SIZE_IN_BYTES , autoBroadcastJoinThreshold + 1L )

  取DEFAUTL_SIZE_IN_BYTES的值,这个值一般需要设置的比spark.sql.autoBroadcastJoinThreshold大,以避免其他表被broadcast出去了。可以看到,默认值为autoBroadcastJoinThreshold值加1。

  上面这一段的意思是,按顺序取hive的表统计信息中的totalSize属性和rawDataSize属性,直到取到一个大于零的值为止。如果这两个值都不大于零,那么就默认该表不能被broadcast出去。
  

二、问题

  从上面的过程可以看出,确定是否广播小表的决定性因素是hive的表统计信息一定要准确。并且,由于视图是没有表统计信息的,所以所有的视图在join时都不会被广播。

  假如遇到这种需求:有一张全量快照分区表A,需要有一张表B永远指向表A的最新分区,但是又不能影响表B的broadcast join功能。

1、问题1:直接insert会慢

  最直接的做法是使用insert overwrite table B select * from A where partition='20170322'来进行,在插入数据到B表后,会更新该表的totalSizerawDataSize属性。

  但是每次进行insert是一个很耗时的过程。

2、问题2:set location不更新表统计信息

  如果想要避免这一过程的话,是不是可以用alter table B set location 'PATH_TO_TABLE_A/partition=20170322'的形式呢?可以这样做,但是这样又不会每次更新totalSizerawDataSize属性了。

  如果需要更新表统计信息的话,测试过了alter table B set tblproperties('totalSize'='123')语句,也不能生效。
  如下所示,在更新前后,不仅totalSize没有变化,反而将rawDataSize给置为了-1。并且多了一些之前没有的默认值

0: jdbc:hive2://client:10000> show tblproperties shop_info;numFiles    1COLUMN_STATS_ACCURATE    truetransient_lastDdlTime    1490283932totalSize    102030numRows    2000rawDataSize    1000300: jdbc:hive2://client:10000> alter table shop_info set tblproperties('totalSize'='1234');0: jdbc:hive2://client:10000> show tblproperties shop_info;                               numFiles    1last_modified_by    hadooplast_modified_time    1490284040transient_lastDdlTime    1490284040COLUMN_STATS_ACCURATE    falsetotalSize    102030numRows    -1rawDataSize    -1

  只能执行analyze table B compute statistics语句来跑map-reduce任务来统计了,这一过程也会比较耗时。

3、问题3:external table set location后会清除表统计信息

  并且最好将表B建成external table形式,避免删除表B时将表A的数据删除掉。

  在实践中发现,每次对external tale进行set location操作后,即使重新统计过表信息,它的totalSizerawDataSize仍然会被清除掉。

# 首先按照shop_info表格式创建一个外部表,0: jdbc:hive2://client:10000> create external table test_broadcast like shop_info;# 查看表信息0: jdbc:hive2://client:10000>  show tblproperties test_broadcast;EXTERNAL    TRUEtransient_lastDdlTime    1490284194# 重定向location0: jdbc:hive2://client:10000> alter table test_broadcast set location 'hdfs://m000/user/hive/warehouse/tianchi.db/user_pay_train';# 查看表信息0: jdbc:hive2://client:10000>  show tblproperties test_broadcast;numFiles    0EXTERNAL    TRUElast_modified_by    hadooplast_modified_time    1490284413COLUMN_STATS_ACCURATE    falsetransient_lastDdlTime    1490284413numRows    -1totalSize    0rawDataSize    -1# 统计表信息0: jdbc:hive2://client:10000> analyze table test_broadcast compute statistics;# 查看表信息0: jdbc:hive2://client:10000>show tblproperties test_broadcast;numFiles    0EXTERNAL    TRUElast_modified_by    hadooplast_modified_time    1490284413transient_lastDdlTime    1490284751COLUMN_STATS_ACCURATE    truenumRows    65649782totalSize    0rawDataSize    2098020423# 再次重定向0: jdbc:hive2://client:10000> alter table test_broadcast set location 'hdfs://m000/user/hive/warehouse/tianchi.db/user_pay';# 查看表信息0: jdbc:hive2://client:10000>show tblproperties test_broadcast;numFiles    0EXTERNAL    TRUElast_modified_by    hadooplast_modified_time    1490284790transient_lastDdlTime    1490284790COLUMN_STATS_ACCURATE    falsenumRows    -1totalSize    0rawDataSize    -1

三、解决办法

  遇到这个问题有没有觉得很棘手。

1、问题分析

  其实方法已经在上面展示过了。我们来看一下org.apache.spark.sql.hive.HiveMetastoreCatalog类中关于statistics的注释:

// NOTE: getting totalSize directly from params is kind of hacky, but this should be
// relatively cheap if parameters for the table are populated into the metastore. An
// alternative would be going through Hadoop’s FileSystem API, which can be expensive if a lot
// of RPCs are involved. Besides totalSize, there are also numFiles, numRows,
// rawDataSize keys (see StatsSetupConst in Hive) that we can look at in the future.

  这里说,直接获取totalSize并不是一个友好的办法,直接从这里获取totalSize只是相对比较快而已。其实可以通过HDFS的FileSystem API来获取该表在HDFS上的文件大小的。

  看到这里,应该已经有了一个基本方法了。那就是将表B建为external table,并且每次遇到external table,直接去取hdfs上的文件大小。每次set location即可。这里唯一需要注意的是,通过HDFS FileSystem API获取文件大小是否耗时。这个可以通过下面的代码测试一下。

2、性能测试

  测一下HDFS FileSystem API获取文件大小的耗时。

object HDFSFilesystemTest {  def main(args: Array[String ]) {    val conf = new Configuration()    conf.set("fs.default.name" , "hdfs://m000:8020")    val hiveWarehouse = "/user/hive/warehouse"    val path = new Path(hiveWarehouse)    val fs: FileSystem = path.getFileSystem(conf)    val begin = System.currentTimeMillis()    val size = fs.getContentSummary(path).getLength    println( s"$hiveWarehouse size is: $size Bytes")    val end = System.currentTimeMillis()    println( s"time consume ${end - begin} ms")  }}

  输出结果如下,统计hive warehouse路径大小,耗时132毫秒。

/user/hive/warehouse size is: 4927963752 Bytestime consume 132 ms

3、源码修改

  那么接下来修改一下org.apache.spark.sql.hive.HiveMetastoreCatalog中获取表大小的逻辑就可以了。

  由于外部表会被重定向路径,或者指向路径的文件可以直接被put上来,所以统计的totalSize或者rawDataSize一般不准确。因此如果是外部表,直接获取hdfs文件大小。如果是非外部表,则按顺序取totalSize,rawDataSize的值,如果都不大于0,则通过HDFS FileSystem API获取hdfs文件大小了。

@transient override lazy val statistics: Statistics = Statistics(  sizeInBytes = {    BigInt (if ( hiveQlTable.getParameters.get("EXTERNAL" ) == "TRUE") {      try {        val hadoopConf = sqlContext.sparkContext.hadoopConfiguration        val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)        fs.getContentSummary(hiveQlTable.getPath).getLength      } catch {        case e: IOException =>          logWarning("Failed to get table size from hdfs." , e)          sqlContext.conf.defaultSizeInBytes      }    } else {      val totalSize = hiveQlTable .getParameters.get(StatsSetupConst. TOTAL_SIZE)      val rawDataSize = hiveQlTable .getParameters.get(StatsSetupConst. RAW_DATA_SIZE)      // TODO: check if this estimate is valid for tables after partition pruning.      // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be      // relatively cheap if parameters for the table are populated into the metastore.  An      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.      if (totalSize != null && totalSize.toLong > 0L) {        totalSize.toLong      } else if (rawDataSize != null && rawDataSize.toLong > 0L) {        rawDataSize.toLong      } else {        try {          val hadoopConf = sqlContext.sparkContext.hadoopConfiguration          val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)          fs.getContentSummary(hiveQlTable.getPath).getLength        } catch {          case e: IOException =>            logWarning("Failed to get table size from hdfs." , e)            sqlContext.conf.defaultSizeInBytes        }      }    })  })

  改为之后,看了一下Spark-2中的源码,发现在org.apache.spark.sql.hive.MetastoreRelation中已经对此次进行了修改,如下所示:

  @transient override lazy val statistics: Statistics = {    catalogTable.stats.getOrElse(Statistics(      sizeInBytes = {        val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)        val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)        // TODO: check if this estimate is valid for tables after partition pruning.        // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be        // relatively cheap if parameters for the table are populated into the metastore.        // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys        // (see StatsSetupConst in Hive) that we can look at in the future.        BigInt(          // When table is external,`totalSize` is always zero, which will influence join strategy          // so when `totalSize` is zero, use `rawDataSize` instead          // when `rawDataSize` is also zero, use `HiveExternalCatalog.STATISTICS_TOTAL_SIZE`,          // which is generated by analyze command.          if (totalSize != null && totalSize.toLong > 0L) {            totalSize.toLong          } else if (rawDataSize != null && rawDataSize.toLong > 0) {            rawDataSize.toLong          } else if (sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled) {            try {              val hadoopConf = sparkSession.sessionState.newHadoopConf()              val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)              fs.getContentSummary(hiveQlTable.getPath).getLength            } catch {              case e: IOException =>                logWarning("Failed to get table size from hdfs.", e)                sparkSession.sessionState.conf.defaultSizeInBytes            }          } else {            sparkSession.sessionState.conf.defaultSizeInBytes          })      }    ))  }

4、效果展示

  对一张文件大小小于spark.sql.autoBroadcastJoinThreshold的external table进行表信息统计,将totalSize和rawDataSize置为不大于0的值后。分别在改源码前后进行测试。

(1)改动前
  执行join语句,DAG执行计划如下图所示:
  
  这里写图片描述
  
  执行时间如下所示:

122 rows selected (58.862 seconds)

(2)改动后
  执行join语句,DAG执行计划如下图所示:
  
  这里写图片描述
  
  执行时间如下图所示:

122 rows selected (27.484 seconds)

  可以看到,改动源码后,走了broadcast join,并且执行时间明显缩短。

2 0