hadoop map方法的运行流程浅析（1.2.1的src）

来源：互联网发布：天刀少女捏脸数据编辑：程序博客网时间：2024/06/03 12:28

通过抛出的异常看看hadoop任务的执行流程：

at org.apache.hadoop.mapred.lib.db.DBInputFormat$DBRecordReader.getSelectQuery

(DBInputFormat.java:93)
at org.apache.hadoop.mapred.lib.db.DBInputFormat$DBRecordReader.<init>

(DBInputFormat.java:82)
at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:286)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

下面是MapRunner里的run方法:
public void run(RecordReader<K1, V1> input, OutputCollector<K2, V2> output,
Reporter reporter)
throws IOException {
try {
// allocate key & value instances that are re-used for all entries
K1 key = input.createKey();
V1 value = input.createValue();

while (input.next(key, value)) {
// map pair to output
mapper.map(key, value, output, reporter);
if(incrProcCount) {
reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,
SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);
}
}
} finally {
mapper.close();
}
}

从以上代码可以看到，每一个split都被执行run方法使用。

InputFormat
这个接口定义了分块的方法，定义了获取recordreader对象的方法。是一个控制类。
分块方法：
框架会在适当的时候，调用InputFormat的getSplits的方法，得到一个InputSplit的数组。这个split包含的仅

仅是对应的数据在整个输入中的位置（locations和length），具体的实现类可以在此进行相应的补充，如：如

果是文本文件，对应的是FileSplit，除了有locations以外，还有字节偏移值，长度等信息，如果是数据库的数

据，则扩展有分页的数据，等等。然后，InputSplit的每一块，都会分配给某个tasktracker执行。

获取recordreader对象的方法：
具备了块的位置信息（在哪个机器节点上，起始偏移值、长度），仅仅还是一个概念，实际上并没有获取到真实

的数据，下一步就是要拿到真实的数据，并且转化为一条条的记录，传递给用户map程序处理。
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
可以看到，在获取RecordReader的时候，需要使用到split信息，就是用位置信息拿到真实的数据，然后关联给

具体的某个RecordReader，并返回这个RecordReader。这个方法，在数据库来说，可能就是根据分页信息，查询

到数据记录；对于文本文件来说，就是利用偏移值，读取到对应的文件流。
不同的分块，一般需要不同的RecordReader来具体解析获取相应的真实数据和分解相应的记录。

InputSplit
代表了一块的数据，这个块到底怎么定义，取决于用户。但是一般都有位置和长度信息。这个InputSplit实际上

并不是真实数据，仅仅是所代表的数据的一些位置等信息，这个很重要。

RecordReader
代表了从块中读取记录的方法，同样的，记录是什么样的，取决于用户。上面说了，InputSplit中，并没有真实

的数据，仅仅是数据的位置的表示，那么RecordReader怎么得到这个真实的块数据呢？在InputFormat中，调用

getRecordReader方法时，需要传递一个InputSplit作为参数，可以想象，RecordReader需要的真实数据，肯定

是在这个地方得到并关联给RecordReader的。

具体分析一下代码吧：

首先看一下TextInputFormat中的这个方法：
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit genericSplit, JobConf job,
Reporter reporter)
throws IOException {

reporter.setStatus(genericSplit.toString());
return new LineRecordReader(job, (FileSplit) genericSplit);
}

跟踪看一下LineRecordReader:
public LineRecordReader(Configuration job,
FileSplit split) throws IOException {
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
codec = compressionCodecs.getCodec(file);

// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());

if (isCompressedInput()) {
decompressor = CodecPool.getDecompressor(codec);
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec)codec).createInputStream(
fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
in = new LineReader(cIn, job);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn; // take pos from compressed stream
} else {
in = new LineReader(codec.createInputStream(fileIn, decompressor), job);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
in = new LineReader(fileIn, job);
filePosition = fileIn;
}
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}

看到那个in了，这个就是真实的数据，这个LineRecordReader在构建的时候，被赋予了真实的数据。那么在调用

这个的next方法时，就能按照规定的格式一条一条的把记录读取出来了。
/** Read a line. */
public synchronized boolean next(LongWritable key, Text value)
throws IOException {

// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
key.set(pos);

int newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
if (newSize == 0) {
return false;
}
pos += newSize;
if (newSize < maxLineLength) {
return true;
}

// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}

return false;
}

再来看一下DBInputFormat：

/** {@inheritDoc} */
@SuppressWarnings("unchecked")
public RecordReader<LongWritable, T> getRecordReader(InputSplit split,
JobConf job, Reporter reporter) throws IOException {

Class inputClass = dbConf.getInputClass();
try {
return new DBRecordReader((DBInputSplit) split, inputClass, job);
}
catch (SQLException ex) {
throw new IOException(ex.getMessage());
}
}

它的RecordReader是DBRecordReader。好像传人了一个inputClass，这是什么东西？接着往下看：
看一下DBRecordReader：
/**
* @param split The InputSplit to read data for
* @throws SQLException
*/
protected DBRecordReader(DBInputSplit split, Class<T> inputClass, JobConf job) throws

SQLException {
this.inputClass = inputClass;
this.split = split;
this.job = job;

statement = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY,

ResultSet.CONCUR_READ_ONLY);

//statement.setFetchSize(Integer.MIN_VALUE);
results = statement.executeQuery(getSelectQuery());
}

抛开其他细节不看，可以发现，构造函数，实际上就是进行了一次数据库查询，把结果放在了results成员变量

中。它的next方法如下：
/** {@inheritDoc} */
public boolean next(LongWritable key, T value) throws IOException {
try {
if (!results.next())
return false;

// Set the key field value as the output key value
key.set(pos + split.getStart());

value.readFields(results);

pos ++;
} catch (SQLException e) {
throw new IOException(e.getMessage());
}
return true;
}

发现就是正常的读取数据库查询记录的方法，但是那个value是什么？把值读到这个地方，它怎么知道数据库查

询的各字段的顺序等？
为了解决这个问题，得找一下，这个value是谁传递进来的？回查一下MapRunner的run方法，发现有一句：
V1 value = input.createValue();
这个input就是相应的RecordReader，那么在这里来说，就是DBRecordReader，看看它的方法：
/** {@inheritDoc} */
public T createValue() {
return ReflectionUtils.newInstance(inputClass, job);
}
还记得构建DBRecordReader的时候传入的inputClass方法吗？原来是用在这里。其实就是在这个地方提供了一个

给用户切入的地方，用户自己提供一个类，负责读取具体的数据。这个用户自行实现的类需要实现一个接口：

DBWritable。这里面定义了读取数据的方法：
public interface DBWritable {

/**
* Sets the fields of the object in the {@link PreparedStatement}.
* @param statement the statement that the fields are put into.
* @throws SQLException
*/
public void write(PreparedStatement statement) throws SQLException;

/**
* Reads the fields of the object from the {@link ResultSet}.
* @param resultSet the {@link ResultSet} to get the fields from.
* @throws SQLException
*/
public void readFields(ResultSet resultSet) throws SQLException ;

}

总结一下：
jobtracker负责分解输入源（通过调用用户指定的InputFormat的方法），得到很多split，每个split将由一个

map程序处理。通过得到的split信息，结合maptask的数量，选择相应的tasktracker进行map运算，通常会选择

拥有某个split的tasktracker来负责处理该split的数据，减少数据传送。tasktracker会创建一个child jvm，

用来运行该程序，运行的就是MapTask的run方法。在这个run方法里，会调用到runOldMapper或runNewMapper方

法，在这2个方法里，都会根据split得到相应的RecordReader，然后通过调用

MapRunnable<INKEY,INVALUE,OUTKEY,OUTVALUE> runner =
ReflectionUtils.newInstance(job.getMapRunnerClass(), job);
方法，得到具体的runner，
public Class<? extends MapRunnable> getMapRunnerClass() {
return getClass("mapred.map.runner.class",
MapRunner.class, MapRunnable.class);
}
可以看到，默认就是MapRunner。得到具体的runner以后，传入得到的RecordReader，运行其run方法，在这个

run方法里，RecordReader会不断的调用next方法，得到一条条记录，传入给用户的map方法运行。

以上就是大概的执行流程了。

0 0