Hadoop MapReduce之ReduceTask任务执行（五）

来源：互联网发布：快手油画照片软件编辑：程序博客网时间：2024/05/05 10:59

本节分析ReduceTask的最后一个阶段：reduce，经历了copy、sort后，reduce的输入数据就准备好了，reduce数据输入由Reducer.Context提供，该Context封装了sort阶段的迭代器，可以对内存和磁盘的KV进行迭代，这部分需要注意两个大的循环：1、对KEY的循环由Reducer类实现，具体参考run函数 2、在自定义的reduce函数中对VALUE的循环。在自定义的reduce函数中会处理迭代器中的数据，当迭代器中的数据没有的时候就意味着需要处理下一个KEY了，reduce函数的输出会直接输出目的地如HDFS中，具体位置是可以自定义的。下面我们先看Reducer中run函数是如何实现KEY循环的

  public void run(Context context) throws IOException, InterruptedException {    setup(context);    while (context.nextKey()) {//循环读取KEY      reduce(context.getCurrentKey(), context.getValues(), context);//进入自定义的reduce函数    }    cleanup(context);  }

nextKey函数的逻辑如下：

    /** Start processing next unique key. */  public boolean nextKey() throws IOException,InterruptedException {    while (hasMore && nextKeyIsSame) {//读取新KEY时nextKeyIsSame为假      nextKeyValue();    }    if (hasMore) {      if (inputKeyCounter != null) {        inputKeyCounter.increment(1);      }      return nextKeyValue();//如果为新的KEY，则会预读一条KV    } else {      return false;    }  }  KV的预读逻辑如下  public boolean nextKeyValue() throws IOException, InterruptedException {    if (!hasMore) {      key = null;      value = null;      return false;    }    //读取新KEY的时候firstValue为真，此时nextKeyIsSame为假    //当读取相同KEY的非首条记录时，firstValue会置为假    firstValue = !nextKeyIsSame;    //将KV读取到buffer中    DataInputBuffer next = input.getKey();    currentRawKey.set(next.getData(), next.getPosition(),                       next.getLength() - next.getPosition());    buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());    key = keyDeserializer.deserialize(key);    next = input.getValue();    buffer.reset(next.getData(), next.getPosition(), next.getLength());    value = valueDeserializer.deserialize(value);    //再读取一条，用于判断下一条的KEY是否相同，来设置nextKeyIsSame    hasMore = input.next();    if (hasMore) {      next = input.getKey();      nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,                                          currentRawKey.getLength(),                                         next.getData(),                                         next.getPosition(),                                         next.getLength() - next.getPosition()                                         ) == 0;    } else {      nextKeyIsSame = false;    }    inputValueCounter.increment(1);    return true;  }

KEY判断完毕，如果确定还有数据则进入到自定义的reduce函数中，这里我们以WordCount为例，由于函数中会对同一个KEY的相同VALUE进行迭代，因此会传入Iterable（第二个参数），该参数封装了org.apache.hadoop.mapreduce.ReduceContext.ValueIterator

protected void reduce(Text key, java.lang.Iterable<IntWritable> arg1,Context context) throws IOException, InterruptedException {Iterator<IntWritable> iterator = arg1.iterator();//获得迭代器int sum = 0;while (iterator.hasNext()) {//判断是否有下一个VALUEsum += iterator.next().get();//自定义操作}context.write(key, new IntWritable(sum));//写出操作};

在对VALUE的迭代中每读取一次VALUE，都会判断下一个VALUE是否相同，以设置nextKeyIsSame的值，当相同KEY的VALUE有多条时，一旦nextKeyIsSame为假，那么证明需要处理下一个KEY了。

  protected class ValueIterator implements Iterator<VALUEIN> {    @Override    public boolean hasNext() {      return firstValue || nextKeyIsSame;    }    @Override    public VALUEIN next() {      // 如果为首条记录则直接返回，注意此时firstValue状态变化      if (firstValue) {        firstValue = false;        return value;      }      // if this isn't the first record and the next key is different, they      // can't advance it here.      if (!nextKeyIsSame) {        throw new NoSuchElementException("iterate past last value");      }      // 读取下一条KV，具体逻辑见上面nextKeyValue的分析      try {        nextKeyValue();        return value;      } catch (IOException ie) {        throw new RuntimeException("next value iterator failed", ie);      } catch (InterruptedException ie) {        // this is bad, but we can't modify the exception list of java.util        throw new RuntimeException("next value iterator interrupted", ie);              }    }    @Override    public void remove() {      throw new UnsupportedOperationException("remove not implemented");    }      }

当reduce阶段输出时，如果目的地是HDFS，则会直接写入，此时HDFS相当于服务端，reduce任务相当于客户端，也是调用FSDataOutputStream来写出的，这里就不再多分析了。