
来源:互联网 发布:php成绩管理系统 编辑:程序博客网 时间:2024/05/16 06:03




  /**   * Merge the values for each key using an associative and commutative reduce function. This will   * also perform the merging locally on each mapper before sending results to a reducer, similarly   * to a "combiner" in MapReduce.   * 传入分区器,根据分区器重新分区   */  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)  }  /**   * Merge the values for each key using an associative and commutative reduce function. This will   * also perform the merging locally on each mapper before sending results to a reducer, similarly   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.   * 重新设置分区数   */  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {    reduceByKey(new HashPartitioner(numPartitions), func)  }  /**   * Merge the values for each key using an associative and commutative reduce function. This will   * also perform the merging locally on each mapper before sending results to a reducer, similarly   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/   * parallelism level.   * 使用默认分区器   */  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {    reduceByKey(defaultPartitioner(self), func)  }

接着往下面来看,reduceByKey方法主要执行逻辑在combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)这个方法中,贴出源码:

def combineByKeyWithClassTag[C](      createCombiner: V => C,  //把V装进C中      mergeValue: (C, V) => C, //把V整合进入C中      mergeCombiners: (C, C) => C, //整合两个C成为一个      partitioner: Partitioner,      mapSideCombine: Boolean = true,      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0    //这里可以看到,pairRDD的key类型不能为数组,否则会报错    if (keyClass.isArray) {      if (mapSideCombine) {        throw new SparkException("Cannot use map-side combining with array keys.")      }      //hash分区器不能作用于数组键      if (partitioner.isInstanceOf[HashPartitioner]) {        throw new SparkException("HashPartitioner cannot partition array keys.")      }    }    val aggregator = new Aggregator[K, V, C](      self.context.clean(createCombiner),      self.context.clean(mergeValue),      self.context.clean(mergeCombiners))    //判断传入分区器是否相同    if (self.partitioner == Some(partitioner)) {      self.mapPartitions(iter => {        val context = TaskContext.get()        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))      }, preservesPartitioning = true)    } else {    //不相同的话重新返回shufferRDD      new ShuffledRDD[K, V, C](self, partitioner)        .setSerializer(serializer)        .setAggregator(aggregator)        .setMapSideCombine(mapSideCombine)    }  }


三个方法只是传递的参数不同,整体需要实现的功能是相同的,需要对结果的分区进行控制的话可以使用带有分区器参数的方法,需要重新设置分区数量的话可以使用带有分区数参数的方法,使用官方默认设置的话则是用无参数的方法。/**   * Group the values for each key in the RDD into a single sequence. Hash-partitions the   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements   * within each group is not guaranteed, and may even differ each time the resulting RDD is   * evaluated.   *   * @note This operation may be very expensive. If you are grouping in order to perform an   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`   * or `PairRDDFunctions.reduceByKey` will provide much better performance.   * 默认设置的方法   */   def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {    groupByKey(defaultPartitioner(self))  }
/**   * Group the values for each key in the RDD into a single sequence. Allows controlling the   * partitioning of the resulting key-value pair RDD by passing a Partitioner.   * The ordering of elements within each group is not guaranteed, and may even differ   * each time the resulting RDD is evaluated.   *   * @note This operation may be very expensive. If you are grouping in order to perform an   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`   * or `PairRDDFunctions.reduceByKey` will provide much better performance.   *   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any   * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].   * 带有分区器参数的方法   */  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {    // groupByKey shouldn't use map side combine because map side combine does not    // reduce the amount of data shuffled and requires all map side data be inserted    // into a hash table, leading to more objects in the old gen.    val createCombiner = (v: V) => CompactBuffer(v)    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)    bufs.asInstanceOf[RDD[(K, Iterable[V])]]  }
/**   * Group the values for each key in the RDD into a single sequence. Hash-partitions the   * resulting RDD with into `numPartitions` partitions. The ordering of elements within   * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.   *   * @note This operation may be very expensive. If you are grouping in order to perform an   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`   * or `PairRDDFunctions.reduceByKey` will provide much better performance.   *   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any   * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].   * 带有分区数量参数的方法   */  def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {    groupByKey(new HashPartitioner(numPartitions))  }



  1. 返回值类型不同:reduceByKey返回的是RDD[(K, V)],而groupByKey返回的是RDD[(K, Iterable[V])],举例来说这两者的区别。比如含有一下数据的rdd应用上面两个方法做求和:(a,1),(a,2),(a,3),(b,1),(b,2),(c,1);reduceByKey产生的中间结果(a,6),(b,3),(c,1);而groupByKey产生的中间结果结果为((a,1)(a,2)(a,3)),((b,1)(b,2)),(c,1),(以上结果为一个分区中的中间结果)可见groupByKey的结果更加消耗资源
  2. 作用不同,reduceByKey作用是聚合,异或等,groupByKey作用主要是分组,也可以做聚合(分组之后)
  3. map端中间结果对键对应的值得聚合方式不同


val words = Array("a", "a", "a", "b", "b", "b")  val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))  val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _)  //reduceByKeyval wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum))  //groupByKey


  1. reduceByKey在每个分区移动数据之前,会对每一个分区中的key所对应的values进行求和,然后再利用reduce对所有分区中的每个键对应的值进行再次聚合。整个过程如图:
  2. groupByKey是把分区中的所有的键值对都进行移动,然后再进行整体求和,这样会导致集群节点之间的开销较大,传输效率较低,也是上文所说的内存溢出错误出现的根本原因