spark core 2.0 Compression 压缩.

来源：互联网发布：多益网络校园编辑：程序博客网时间：2024/05/17 08:53

spark在输出时，都会调用serializerManager的wrapForCompression(blockId, outputStream)，在此方法里，先判断shouldCompress来判断是否该压缩。

  private def shouldCompress(blockId: BlockId): Boolean = {    blockId match {      case _: ShuffleBlockId => compressShuffle      case _: BroadcastBlockId => compressBroadcast      case _: RDDBlockId => compressRdds      case _: TempLocalBlockId => compressShuffleSpill      case _: TempShuffleBlockId => compressShuffle      case _ => false    }  }

由以下默认值可见，rdd默认是不压缩的，其它类型默认压缩。

  // Whether to compress broadcast variables that are stored  private[this] val compressBroadcast = conf.getBoolean("spark.broadcast.compress", true)  // Whether to compress shuffle output that are stored  private[this] val compressShuffle = conf.getBoolean("spark.shuffle.compress", true)  // Whether to compress RDD partitions that are stored serialized  private[this] val compressRdds = conf.getBoolean("spark.rdd.compress", false)  // Whether to compress shuffle output temporarily spilled to disk  private[this] val compressShuffleSpill = conf.getBoolean("spark.shuffle.spill.compress", true)  /* The compression codec to use. Note that the "lazy" val is necessary because we want to delay   * the initialization of the compression codec until it is first used. The reason is that a Spark   * program could be using a user-defined codec in a third party jar, which is loaded in   * Executor.updateDependencies. When the BlockManager is initialized, user level jars hasn't been   * loaded yet. */  private lazy val compressionCodec: CompressionCodec = CompressionCodec.createCodec(conf)

压缩算法由ComressionCodec.createCodec(conf)来决定，由以下代码可知，默认是lz4.

def getCodecName(conf: SparkConf): String = {    conf.get(configKey, DEFAULT_COMPRESSION_CODEC)  }  def createCodec(conf: SparkConf): CompressionCodec = {    createCodec(conf, getCodecName(conf))  }

val DEFAULT_COMPRESSION_CODEC = "lz4"

1 0