mapreduce的输入格式详解

来源：互联网发布：主数据和元数据区别编辑：程序博客网时间：2024/05/20 02:24

输入流程解析

输入格式

文件从上传到HDFS到输入到map函数中，大致可以分为4步。

文件上传到hdfs中，被划分为若干份block，输入时，将所有block读取，划分为若干个split，每个split对应与一个map task，然后每个split划分为多个record

void map(     K1 key,      # record的key    V1 value,    # record的value    OutputCollector<K2, V2> output, Reporter reporter)throws IOException;

1. 上传到hdfs

一个文件上传到hdfs中时，以配置文件的dfs.block.size规定的大小，划分为若干个block，然后储存在datanode中。

2. 从hdfs读取块，再划分为split

spit大小的计算：

属性名称类型默认值描述 mapred.min.split.size int 1 一个split的最小值 mapred.max.split.size long Long.MAX_VALUE即9223372036854775807 一个split的最大值 dfs.block.size long 128M hdfs中的块大小 goalSize long totalSize/numSplit totalSize为文件总大小
numSplit为用户规定的map task个数

inputSplit大小的计算公式：
**hadoop1.x：**splitSize = max{ minSize , min{ goalSize , blockSize}}
**hadoop2.x：**splitSize = max{ minSize , min{ maxSize , blockSize}}

一般来说，为了节省网络带宽和磁盘IO，splitSize最后与blockSize相等，实现数据本地化。默认配置也是这样做的。
注意：一般来说，block的划分是不会顾虑record的完整性，所以，RecordReader规定，每个inputSplit的第一条不完整的记录划分给前一个inputSplit处理，也就是说，虽然公式计算出splitSize=blockSize。但splitSize不是正好的128M，而是把最前面那条不完整的record数据传输给前一个split，也接受后一个split传输过来的record数据。

inputsplit

划分完split，需要储存到一个对象中，这个对象就是inputsplit，但是，inputsplit不存从实际的数据，只是储存的是这个split的引用，即这个split在datanode储存的地址。即

public abstract class InputSplit {  public abstract long getLength();  public abstract String[] getLocations();}

3. 划分成record

public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {  public abstract void initialize(            InputSplit split,            TaskAttemptContext context) ;  public abstract boolean nextKeyValue() ;  public abstract KEYIN getCurrentKey();  public abstract VALUEIN getCurrentValue() ;  # 返回一个百分数，表示读取了split的百分之多少。  public abstract float getProgress() ;  public abstract void close() ;}

RecordReader类的各个方法的含义很明确，就不多做解释了。

把Record<key,value>依次传给map

现在就该说说inputformat了，咱们常用的TextInputFormat就是继承自这个类，先看看这个类的结构。

public abstract class InputFormat<K, V> {    public abstract List<InputSplit> getSplits(JobContext context);    public abstract RecordReader<K, V> createRecordReader(            InputSplit split,            TaskAttemptContext context);}

这样就可以看出了把，inputFormat是输入的核心类，我们在启动map时，会设置输入文件的路径，JobContext类就储存了这个路径。
那么getSplits 函数的作用就是将输入文件划分为若干个inputSplit，用一个List储存着。
而这createRecordReader函数的作用就是传入一个inputsplit，返回一个RecordReader。
由此看来，必然还有一个外层类，
这个类调用getsplit函数得到List，然后把每个inputSplit分配给我们重写的map类。
对于某一个map类，将分配的inputsplit，传给createRecordReader函数得到一个RecordReader，调用RecordReader的nextKeyValuehan()，getCurrentKey()，getCurrentValue()，将得到的key 和 value传出给重写的map类的map方法。

在hadoop中，这个类是MapTask类，
其中相关代码是：

  private <INKEY,INVALUE,OUTKEY,OUTVALUE>  void runOldMapper(final JobConf job,                    final TaskSplitIndex splitIndex,                    final TaskUmbilicalProtocol umbilical,                    TaskReporter reporter{...RecordReader<INKEY,INVALUE> in = isSkipping() ?         new SkippingRecordReader<INKEY,INVALUE>(umbilical, reporter, job) :          new TrackedRecordReader<INKEY,INVALUE>(reporter, job);    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());...MapRunnable<INKEY,INVALUE,OUTKEY,OUTVALUE> runner =      ReflectionUtils.newInstance(job.getMapRunnerClass(), job);runner.run(in, new OldOutputCollector(collector, conf), reporter);...}

RecordReader类的run函数主体为

public void run(RecordReader<K1, V1> input, OutputCollector<K2, V2> output,                  Reporter reporter)    throws IOException {    try {      // allocate key & value instances that are re-used for all entries      K1 key = input.createKey();      V1 value = input.createValue();      while (input.next(key, value)) {        // map pair to output        mapper.map(key, value, output, reporter);        if(incrProcCount) {          reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,               SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);        }      }    } finally {      mapper.close();    }  }

hadoop中的输入格式

InputFormat类层次

TextInputFormat

作为默认的文件输入格式，用于读取纯文本文件，文件被分为一系列以LF或者CR结束的行，key是每一行的位置偏移量，是LongWritable类型的，value是每一行的内容，为Text类型。

KeyValueTextInputFormat

同样用于读取文件，如果行被分隔符（缺省是tab）分割为两部分，第一部分为key，剩下的部分为value；如果没有分隔符，整行作为 key，value为空。

SequenceFileInputFormat

用于读取sequence file。 sequence file是Hadoop用于存储数据自定义格式的binary文件。它有两个子类：SequenceFileAsBinaryInputFormat，将 key和value以BytesWritable的类型读出；SequenceFileAsTextInputFormat，将key和value以Text类型读出。

SequenceFileInputFilter

根据filter从sequence文件中取得部分满足条件的数据，通过 setFilterClass指定Filter，内置了三种 Filter，RegexFilter取key值满足指定的正则表达式的记录；PercentFilter通过指定参数f，取记录行数%f==0的记录；MD5Filter通过指定参数f，取MD5(key)%f==0的记录。

NLineInputFormat

0.18.x新加入，可以将文件以行为单位进行split，比如文件的每一行对应一个map。得到的key是每一行的位置偏移量（LongWritable类型），value是每一行的内容，Text类型。

CombineFileInputFormat

处理大量的小文件，它是针对小文件而设置的，FileInputFormat为每个文件产生一个分片，而CombineFileInputFormat把多个文件打包到一个分片中以便每个mapper可以处理更多的数据。

MultipleInputs

不同输入类型，使用不同map处理

对于不同文件，我们想用不同map处理，还想要用同一个reduce处理，则可以使用nultipleinputs。下面的代码是大致的使用方法，具体实现可看参考文章的第二篇。

      public static class Mapper1               extends Mapper<Object, Text, Text, Text>{ ... }     public static class Mapper2               extends Mapper<Object, Text, Text, Text>{ ... }    public static void main(String[] args) {        ...        // 使用 mapper1 处理 file1 ，mapper2 处理 file2        //         MultipleInputs.addInputPath(job, new Path("file1"), TextInputFormat.class, Mapper1.class);          MultipleInputs.addInputPath(job, new Path("file2"), KeyValueTextInputFormat.class, Mapper2.class);          // 使用fileOutputFormat指定输出路径        FileOutputFormat.setOutputPath(job, new Path(output));          ...     }

map输出不同类型

首先继承GenericWritable ，如下类，它表明map要输出的类型种类。

public class MultiValueWritable extends GenericWritable {    // 配置一个Class数组，里面储存的是map要输出的类型    private static Class[] CLASS = new Class[] {         IntWritable.class,         Text.class    };    @Override    protected Class<? extends Writable>[] getTypes() {        return CLASS;    }    public MultiValueWritable() {    }    public MultiValueWritable(Writable writable) {        set(writable);    }}

有了以上类，则map和reduce可以这么写

    static class mapper1 extends            Mapper<LongWritable, Text, NullWritable, MultiValueWritable> {        @Override        protected void map(LongWritable key, Text value, Context context)                throws IOException, InterruptedException {            context.write(NullWritable.get(), new MultiValueWritable(                    new IntWritable(1)));            context.write(NullWritable.get(), new MultiValueWritable(new Text(                    "2")));        }    }    static class reducer1            extends            Reducer<NullWritable, MultiValueWritable, NullWritable, IntWritable> {        @Override        protected void reduce(NullWritable arg0,                Iterable<MultiValueWritable> arg1, Context arg2)                throws IOException, InterruptedException {            IntWritable intw = new IntWritable();            Text text = new Text();            int sum = 0;            for (MultiValueWritable multiValueWritable : arg1) {                Writable writable = multiValueWritable.get();                if (writable instanceof IntWritable) {                    intw = (IntWritable) writable;                    sum += intw.get();                } else if (writable instanceof Text) {                    text = (Text) writable;                    sum += Integer.valueOf(text.toString());                }            }            arg2.write(NullWritable.get(), new IntWritable(sum));        }    }

自定义输入格式

为了增加对inputformat的理解，写了一个常见的自定义inputformat，把整个文件当作一条记录处理。

首先定义的是inputformat的继承类，这个类的作用是将输入文件划分成若干个split，并把split传入到RecordReader中。

package testhdoop;import java.io.IOException;import org.apache.hadoop.io.BytesWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.JobContext;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;public class myInputFormat extends FileInputFormat<NullWritable, BytesWritable> {    private JobContext context;    /**     * 这个函数返回一个文件是否可以划分成若干个split。      * 因为我们要写把一个文件当作一条记录处理，     * 所以一个文件应该是不可分的，即直接返回false。     */    protected boolean isSplitable(JobContext context,            org.apache.hadoop.fs.Path filename) {        return false;    }    @Override    public org.apache.hadoop.mapreduce.RecordReader<NullWritable, BytesWritable> createRecordReader(            InputSplit arg0, TaskAttemptContext arg1) throws IOException,            InterruptedException {        myRecordReader records = new myRecordReader();        records.initialize(arg0, arg1);        return records;    }}

接下来就是写RecordReader的继承类，这个类是处理split的。因为我们要一次性读取一个文件，上面InputFormat已经把一个文件分成一个split，现在我们要做的是，把整个split整体作为一个value输入。

public class myRecordReader extends RecordReader<NullWritable, BytesWritable> {    private static final Log LOG = LogFactory.getLog(myRecordReader.class            .getName());    // 保存myInputFormat传过来的参数，会用的到    private Configuration conf;    FileSplit split;    // 文件是否被读取过了    private boolean position = false;    // 保存value值    private BytesWritable value = new BytesWritable();    @Override    public void initialize(InputSplit arg0, TaskAttemptContext arg1)            throws IOException, InterruptedException {        split = (FileSplit) arg0;        conf = arg1.getConfiguration();    }    @Override    public void close() throws IOException {    }    @Override    public NullWritable getCurrentKey() throws IOException,            InterruptedException {        return NullWritable.get();    }    @Override    public BytesWritable getCurrentValue() throws IOException,            InterruptedException {        return value;    }    /**     * 返回一个百分数，代表已经读取文件的百分之多少了     * 则说明没有读取过文件     * 即返回0；反之，返回1     */    @Override    public float getProgress() throws IOException, InterruptedException {        return position ? 1f : 0f;    }    /**     * 读取文件到value中，并把position设置为true，说明已经读取过文件了     */    @Override    public boolean nextKeyValue() throws IOException, InterruptedException {        if (!position) {            /**             * LineRecordReader源码中，读取文件的操作             * final Path file = split.getPath();              * final FileSystem fs =              * fileIn =  fs.open(file);             *              */            byte[] buf;            FSDataInputStream infile = null;            try {                Path path = split.getPath();                FileSystem fileSystem = path.getFileSystem(conf);                infile = fileSystem.open(path);                int filelength = (int) split.getLength();                buf = new byte[filelength];                IOUtils.readFully(infile, buf, 0, filelength);                value.set(buf, 0, buf.length);            } catch (Exception e) {                e.printStackTrace();                return false;            } finally {                IOUtils.closeStream(infile);            }            position = true;            return true;        }        return false;    }}

应用文章:

Hadoop中常用的InputFormat、OutputFormat
http://blog.163.com/jiayouweijiewj@126/blog/static/171232177201162155746991/

hadoop用mutipleInputs实现map读取不同格式的文件
http://blog.csdn.net/nwpuwyk/article/details/42002503

0 0