Spark的优化(一)---分区

来源:互联网 发布:人工智能展馆 编辑:程序博客网 时间:2024/06/05 07:10

 spark的可以通过分区,调整任务的并行度,以减少分布式程序中,通信的代价。因此,控制数据的分布可以减少网络的传输,以提升性能。

 本节主要从分区的角度,来说明spark的优化点。

业务场景:

    如果开始时候业务的数据比较大,分区过多,但当经过一系列算子后(如filter等),造成大量的小分区数据,这样,由于分区过多导致线程开销过大,反而降低了应用的性能。

解决:

    合理减少数据的分区,由于过滤后的数据变小,这样既减少线程的开销,同时,又满足单机的处理能力,从而提升性能。

    可以调用repartion或coalesce重新分区。


源码翻译及解析: 

/**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * Note: With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]


      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions).values
    } else {
      new CoalescedRDD(this, numPartitions)
    }
  }

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }


由源码可以看出repartition就是调用coalesce,实际就是coalesce为shuffle的情况。关于coalesce,看参见稍后的文章



 

原创粉丝点击