MR源码学习(二)

来源:互联网 发布:宗庆后 虚拟经济 知乎 编辑:程序博客网 时间:2024/05/17 01:33

继续之前的源码学习,上一篇分析了InputFormat的getSplits()方法,接下来是createRecordReader()方法。

从这里可以看到该方法为一个split创建一个recordReader,并且在使用split之前会回调recordReader的初始化方法,该方法的具体实现在TextInputFormat中。

@Override  public RecordReader<LongWritable, Text>     createRecordReader(InputSplit split,                       TaskAttemptContext context) {    String delimiter = context.getConfiguration().get(        "textinputformat.record.delimiter");    byte[] recordDelimiterBytes = null;    if (null != delimiter)      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);    return new LineRecordReader(recordDelimiterBytes);  }

这段代码没什么作用,就是new了一个LineRecordReader对象,构造方法也没什么特别的,但是回想上面的注释说到:会调用一次初始化方法,所以看initialize()。

public void initialize(InputSplit genericSplit,                         TaskAttemptContext context) throws IOException {    FileSplit split = (FileSplit) genericSplit;    Configuration job = context.getConfiguration();    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);    start = split.getStart();    end = start + split.getLength();    final Path file = split.getPath();    // open the file and seek to the start of the split    final FileSystem fs = file.getFileSystem(job);    fileIn = fs.open(file);        CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);    if (null!=codec) {      isCompressedInput = true;      decompressor = CodecPool.getDecompressor(codec);      if (codec instanceof SplittableCompressionCodec) {        final SplitCompressionInputStream cIn =          ((SplittableCompressionCodec)codec).createInputStream(            fileIn, decompressor, start, end,            SplittableCompressionCodec.READ_MODE.BYBLOCK);        in = new CompressedSplitLineReader(cIn, job,            this.recordDelimiterBytes);        start = cIn.getAdjustedStart();        end = cIn.getAdjustedEnd();        filePosition = cIn;      } else {        in = new SplitLineReader(codec.createInputStream(fileIn,            decompressor), job, this.recordDelimiterBytes);        filePosition = fileIn;      }    } else {      fileIn.seek(start);      in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);      filePosition = fileIn;    }    // If this is not the first split, we always throw away first record    // because we always (except the last split) read one extra line in    // next() method.    if (start != 0) {      start += in.readLine(new Text(), 0, maxBytesToConsume(start));    }    this.pos = start;  }

这段代码主要作用就是初始化参数,将游标移动到split的起始位置,查看outline发现一些类似迭代功能的方法

但是没有找到方法的调用,应该是在其他的方法使用key,value,回想MR过程,联想到步骤二,所以应该是在Mapper类调用了这些方法。下面就进入步骤二的源码学习。

 

1.2自定义map函数,对<k1,v1>进行处理,转换成<k2,v2>输出。Mapper类中有一个map方法,正如注释所说一般我们都会重写map方法,添加自己的业务逻辑在里面,简单的例子就是单词计数,对v1进行切分。

/**   * Called once for each key/value pair in the input split. Most applications   * should override this, but the default is the identity function.   */  @SuppressWarnings("unchecked")  protected void map(KEYIN key, VALUEIN value,                      Context context) throws IOException, InterruptedException {    context.write((KEYOUT) key, (VALUEOUT) value);  }
map方法的定义找到了,但是如何调用map方法需要另外找。除了map方法,Mapper类中还有一个run方法,发现在run方法就调用了map方法。
 /**   * Expert users can override this method for more complete control over the   * execution of the Mapper.   * @param context   * @throws IOException   */  public void run(Context context) throws IOException, InterruptedException {    setup(context);    try {      while (context.nextKeyValue()) {        map(context.getCurrentKey(), context.getCurrentValue(), context);      }    } finally {      cleanup(context);    }  }
这里就是一个简单的遍历,context方法的具体实现需要寻找一下,首先Context是实现了MapContext接口

 /**   * The <code>Context</code> passed on to the {@link Mapper} implementations.   */  public abstract class Context    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {  }

MapContext只有定义了一个getInputSplit方法,所以继续找。MapContext继承了TaskInputOutputContext接口

public interface MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>   extends TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
而TaskInputOutputContext接口里面就定义了之前run方法中的调用,那么下面就去找TaskInputOutputContext接口的具体实现类。

如下图,我们最终就找到了MapContextImpl类,熟悉JAVA的同学肯定不会对这个命名刚到陌生。



/** * The context that is given to the {@link Mapper}. * @param <KEYIN> the key input type to the Mapper * @param <VALUEIN> the value input type to the Mapper * @param <KEYOUT> the key output type from the Mapper * @param <VALUEOUT> the value output type from the Mapper */@InterfaceAudience.Private@InterfaceStability.Unstablepublic class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>     extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>     implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {  private RecordReader<KEYIN,VALUEIN> reader;  private InputSplit split;  public MapContextImpl(Configuration conf, TaskAttemptID taskid,                        RecordReader<KEYIN,VALUEIN> reader,                        RecordWriter<KEYOUT,VALUEOUT> writer,                        OutputCommitter committer,                        StatusReporter reporter,                        InputSplit split) {    super(conf, taskid, writer, committer, reporter);    this.reader = reader;    this.split = split;  }  /**   * Get the input split for this map.   */  public InputSplit getInputSplit() {    return split;  }  @Override  public KEYIN getCurrentKey() throws IOException, InterruptedException {    return reader.getCurrentKey();  }  @Override  public VALUEIN getCurrentValue() throws IOException, InterruptedException {    return reader.getCurrentValue();  }  @Override  public boolean nextKeyValue() throws IOException, InterruptedException {    return reader.nextKeyValue();  }}
在这个类中,我们就发现了其实run方法就是在调用recordReader的方法。

0 0