spark core 2.0 SerializedShuffleHandle UnsafeShuffleWriter ShuffleExternalSorter

来源：互联网发布：传奇霸业龙脉升级数据编辑：程序博客网时间：2024/06/05 02:25

使用序列化Shuffle的三个条件：

1. dependency.serializer.supportsRelocationOfSerializedObjects.

2. !dependency.aggregator.isDefined,即不使用聚合函数。

3. !numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE，即分区的数量小于1677216.

/**   * Helper method for determining whether a shuffle should use an optimized serialized shuffle   * path or whether it should fall back to the original path that operates on deserialized objects.   */  def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {    val shufId = dependency.shuffleId    val numPartitions = dependency.partitioner.numPartitions    if (!dependency.serializer.supportsRelocationOfSerializedObjects) {      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +        s"${dependency.serializer.getClass.getName}, does not support object relocation")      false    } else if (dependency.aggregator.isDefined) {      log.debug(        s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")      false    } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")      false    } else {      log.debug(s"Can use serialized shuffle for shuffle $shufId")      true    }  }

如果handle是SerializedShuffleHandle ，则使用 UnsafeShuffleWriter来写数据。

 handle match {      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>        new UnsafeShuffleWriter(          env.blockManager,          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],          context.taskMemoryManager(),          unsafeShuffleHandle,          mapId,          context,          env.conf)

UnsafeShuffleWriter的构造方法如下：

 public UnsafeShuffleWriter(      BlockManager blockManager,      IndexShuffleBlockResolver shuffleBlockResolver,      TaskMemoryManager memoryManager,      SerializedShuffleHandle<K, V> handle,      int mapId,      TaskContext taskContext,      SparkConf sparkConf) throws IOException {    final int numPartitions = handle.dependency().partitioner().numPartitions();    if (numPartitions > SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE()) {      throw new IllegalArgumentException(        "UnsafeShuffleWriter can only be used for shuffles with at most " +        SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE() +        " reduce partitions");    }    this.blockManager = blockManager;    this.shuffleBlockResolver = shuffleBlockResolver;    this.memoryManager = memoryManager;    this.mapId = mapId;    final ShuffleDependency<K, V, V> dep = handle.dependency();    this.shuffleId = dep.shuffleId();    this.serializer = dep.serializer().newInstance();    this.partitioner = dep.partitioner();    this.writeMetrics = taskContext.taskMetrics().shuffleWriteMetrics();    this.taskContext = taskContext;    this.sparkConf = sparkConf;    this.transferToEnabled = sparkConf.getBoolean("spark.file.transferTo", true);    open();  }

构造方法最后调用的 open方法如下：

  private void open() throws IOException {    assert (sorter == null);    sorter = new ShuffleExternalSorter(      memoryManager,      blockManager,      taskContext,      INITIAL_SORT_BUFFER_SIZE,      partitioner.numPartitions(),      sparkConf,      writeMetrics);    serBuffer = new MyByteArrayOutputStream(1024 * 1024);    serOutputStream = serializer.serializeStream(serBuffer);  }

ShuffleExternalSorter 为基于排序的shuflle的一个外部排序器。

/** * An external sorter that is specialized for sort-based shuffle. * <p> * Incoming records are appended to data pages. When all records have been inserted (or when the * current thread's shuffle memory limit is reached), the in-memory records are sorted according to * their partition ids (using a {@link ShuffleInMemorySorter}). The sorted records are then * written to a single output file (or multiple files, if we've spilled). The format of the output * files is the same as the format of the final output file written by * {@link org.apache.spark.shuffle.sort.SortShuffleWriter}: each output partition's records are * written as a single serialized, compressed stream that can be read with a new decompression and * deserialization stream. * <p> * Unlike {@link org.apache.spark.util.collection.ExternalSorter}, this sorter does not merge its * spill files. Instead, this merging is performed in {@link UnsafeShuffleWriter}, which uses a * specialized merge procedure that avoids extra serialization/deserialization. */final class ShuffleExternalSorter extends MemoryConsumer {

The constructor method of ShuffleExternalSorter:

ShuffleExternalSorter(      TaskMemoryManager memoryManager,      BlockManager blockManager,      TaskContext taskContext,      int initialSize,      int numPartitions,      SparkConf conf,      ShuffleWriteMetrics writeMetrics) {    super(memoryManager,      (int) Math.min(PackedRecordPointer.MAXIMUM_PAGE_SIZE_BYTES, memoryManager.pageSizeBytes()),      memoryManager.getTungstenMemoryMode());    this.taskMemoryManager = memoryManager;    this.blockManager = blockManager;    this.taskContext = taskContext;    this.numPartitions = numPartitions;    // Use getSizeAsKb (not bytes) to maintain backwards compatibility if no units are provided    this.fileBufferSizeBytes = (int) conf.getSizeAsKb("spark.shuffle.file.buffer", "32k") * 1024;    this.numElementsForSpillThreshold =      conf.getLong("spark.shuffle.spill.numElementsForceSpillThreshold", 1024 * 1024 * 1024);    this.writeMetrics = writeMetrics;    this.inMemSorter = new ShuffleInMemorySorter(      this, initialSize, conf.getBoolean("spark.shuffle.sort.useRadixSort", true));    this.peakMemoryUsedBytes = getMemoryUsage();  }

The main purpose of MyByteArrayOutputStream is let buf can be accessed outside.

  /** Subclass of ByteArrayOutputStream that exposes `buf` directly. */  private static final class MyByteArrayOutputStream extends ByteArrayOutputStream {    MyByteArrayOutputStream(int size) { super(size); }    public byte[] getBuf() { return buf; }  }

 @Override  public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {    // Keep track of success so we know if we encountered an exception    // We do this rather than a standard try/catch/re-throw to handle    // generic throwables.    boolean success = false;    try {      while (records.hasNext()) {        insertRecordIntoSorter(records.next());      }      closeAndWriteOutput();      success = true;    } finally {      if (sorter != null) {        try {          sorter.cleanupResources();        } catch (Exception e) {          // Only throw this error if we won't be masking another          // error.          if (success) {            throw e;          } else {            logger.error("In addition to a failure during writing, we failed during " +                         "cleanup.", e);          }        }      }    }  }

/**   * Write a record to the shuffle sorter.   */  public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)    throws IOException {    // for tests    assert(inMemSorter != null);    if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {      logger.info("Spilling data because number of spilledRecords crossed the threshold " +        numElementsForSpillThreshold);      spill();    }    growPointerArrayIfNecessary();    // Need 4 bytes to store the record length.    final int required = length + 4;    acquireNewPageIfNecessary(required);    assert(currentPage != null);    final Object base = currentPage.getBaseObject();    final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);    Platform.putInt(base, pageCursor, length);    pageCursor += 4;    Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);    pageCursor += length;    inMemSorter.insertRecord(recordAddress, partitionId);  }

/**   * Sort and spill the current records in response to memory pressure.   */  @Override  public long spill(long size, MemoryConsumer trigger) throws IOException {    if (trigger != this || inMemSorter == null || inMemSorter.numRecords() == 0) {      return 0L;    }    logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",      Thread.currentThread().getId(),      Utils.bytesToString(getMemoryUsage()),      spills.size(),      spills.size() > 1 ? " times" : " time");    writeSortedFile(false);    final long spillSize = freeMemory();    inMemSorter.reset();    // Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the    // records. Otherwise, if the task is over allocated memory, then without freeing the memory    // pages, we might not be able to get memory for the pointer array.    taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);    return spillSize;  }

 /**   * Sorts the in-memory records and writes the sorted records to an on-disk file.   * This method does not free the sort data structures.   *   * @param isLastFile if true, this indicates that we're writing the final output file and that the   *                   bytes written should be counted towards shuffle spill metrics rather than   *                   shuffle write metrics.   */  private void writeSortedFile(boolean isLastFile) throws IOException {    final ShuffleWriteMetrics writeMetricsToUse;    if (isLastFile) {      // We're writing the final non-spill file, so we _do_ want to count this as shuffle bytes.      writeMetricsToUse = writeMetrics;    } else {      // We're spilling, so bytes written should be counted towards spill rather than write.      // Create a dummy WriteMetrics object to absorb these metrics, since we don't want to count      // them towards shuffle bytes written.      writeMetricsToUse = new ShuffleWriteMetrics();    }    // This call performs the actual sort.    final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =      inMemSorter.getSortedIterator();    // Currently, we need to open a new DiskBlockObjectWriter for each partition; we can avoid this    // after SPARK-5581 is fixed.    DiskBlockObjectWriter writer;    // Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to    // be an API to directly transfer bytes from managed memory to the disk writer, we buffer    // data through a byte array. This array does not need to be large enough to hold a single    // record;    final byte[] writeBuffer = new byte[DISK_WRITE_BUFFER_SIZE];    // Because this output will be read during shuffle, its compression codec must be controlled by    // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use    // createTempShuffleBlock here; see SPARK-3426 for more details.    final Tuple2<TempShuffleBlockId, File> spilledFileInfo =      blockManager.diskBlockManager().createTempShuffleBlock();    final File file = spilledFileInfo._2();    final TempShuffleBlockId blockId = spilledFileInfo._1();    final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);    // Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.    // Our write path doesn't actually use this serializer (since we end up calling the `write()`    // OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work    // around this, we pass a dummy no-op serializer.    final SerializerInstance ser = DummySerializerInstance.INSTANCE;    writer = blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);    int currentPartition = -1;    while (sortedRecords.hasNext()) {      sortedRecords.loadNext();      final int partition = sortedRecords.packedRecordPointer.getPartitionId();      assert (partition >= currentPartition);      if (partition != currentPartition) {        // Switch to the new partition        if (currentPartition != -1) {          writer.commitAndClose();          spillInfo.partitionLengths[currentPartition] = writer.fileSegment().length();        }        currentPartition = partition;        writer =          blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);      }      final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();      final Object recordPage = taskMemoryManager.getPage(recordPointer);      final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);      int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);      long recordReadPosition = recordOffsetInPage + 4; // skip over record length      while (dataRemaining > 0) {        final int toTransfer = Math.min(DISK_WRITE_BUFFER_SIZE, dataRemaining);        Platform.copyMemory(          recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);        writer.write(writeBuffer, 0, toTransfer);        recordReadPosition += toTransfer;        dataRemaining -= toTransfer;      }      writer.recordWritten();    }    if (writer != null) {      writer.commitAndClose();      // If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,      // then the file might be empty. Note that it might be better to avoid calling      // writeSortedFile() in that case.      if (currentPartition != -1) {        spillInfo.partitionLengths[currentPartition] = writer.fileSegment().length();        spills.add(spillInfo);      }    }    if (!isLastFile) {  // i.e. this is a spill file      // The current semantics of `shuffleRecordsWritten` seem to be that it's updated when records      // are written to disk, not when they enter the shuffle sorting code. DiskBlockObjectWriter      // relies on its `recordWritten()` method being called in order to trigger periodic updates to      // `shuffleBytesWritten`. If we were to remove the `recordWritten()` call and increment that      // counter at a higher-level, then the in-progress metrics for records written and bytes      // written would get out of sync.      //      // When writing the last file, we pass `writeMetrics` directly to the DiskBlockObjectWriter;      // in all other cases, we pass in a dummy write metrics to capture metrics, then copy those      // metrics to the true write metrics here. The reason for performing this copying is so that      // we can avoid reporting spilled bytes as shuffle write bytes.      //      // Note that we intentionally ignore the value of `writeMetricsToUse.shuffleWriteTime()`.      // Consistent with ExternalSorter, we do not count this IO towards shuffle write time.      // This means that this IO time is not accounted for anywhere; SPARK-3577 will fix this.      writeMetrics.incRecordsWritten(writeMetricsToUse.recordsWritten());      taskContext.taskMetrics().incDiskBytesSpilled(writeMetricsToUse.bytesWritten());    }  }

/** * Metadata for a block of data written by {@link ShuffleExternalSorter}. */final class SpillInfo {  final long[] partitionLengths;  final File file;  final TempShuffleBlockId blockId;  SpillInfo(int numPartitions, File file, TempShuffleBlockId blockId) {    this.partitionLengths = new long[numPartitions];    this.file = file;    this.blockId = blockId;  }}

0 0