Spark源码解析之小分区合并

来源:互联网 发布:霍华德新秀体测数据 编辑:程序博客网 时间:2024/05/01 12:24

coalesce 函数来减少分区

小分区合并,不需要 Shuffle 过程

 /**   * Return a new RDD that is reduced into `numPartitions` partitions.   *   * This results in a narrow dependency, e.g. if you go from 1000 partitions   * to 100 partitions, there will not be a shuffle, instead each of the 100   * new partitions will claim 10 of the current partitions.   *   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,   * this may result in your computation taking place on fewer nodes than   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,   * you can pass shuffle = true. This will add a shuffle step, but means the   * current upstream partitions will be executed in parallel (per whatever   * the current partitioning is).   *   * Note: With shuffle = true, you can actually coalesce to a larger number   * of partitions. This is useful if you have a small number of partitions,   * say 100, potentially with a few partitions being abnormally large. Calling   * coalesce(1000, shuffle = true) will result in 1000 partitions with the   * data distributed using a hash partitioner.   */  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)      : RDD[T] = withScope {    if (shuffle) {      /** Distributes elements evenly across output partitions, starting from a random partition. */      val distributePartition = (index: Int, items: Iterator[T]) => {        var position = (new Random(index)).nextInt(numPartitions)        items.map { t =>          // Note that the hash code of the key will just be the key itself. The HashPartitioner          // will mod it with the number of total partitions.          position = position + 1          (position, t)        }      } : Iterator[(Int, T)]      // include a shuffle step so that our upstream tasks are still distributed      new CoalescedRDD(        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),        new HashPartitioner(numPartitions)),        numPartitions).values    } else {      new CoalescedRDD(this, numPartitions)    }  }

repartition 重分区

增加分区数,需要 Shuffle 过程

  /**   * Return a new RDD that has exactly numPartitions partitions.   *   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses   * a shuffle to redistribute data.   *   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,   * which can avoid performing a shuffle.   */  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {    coalesce(numPartitions, shuffle = true)  }
0 0
原创粉丝点击