Hadoop 之 InputFormat

来源:互联网 发布:js 返回首页清除历史 编辑:程序博客网 时间:2024/05/29 17:40

splits and records

一个输入split就是一个由单个map来处理的输入块。每一个map只处理一个split。每个分片被切分成若干 records,每个record就是一个键/值对,map循环处理记录。split和record都是逻辑性概念。

/** * <code>InputSplit</code> represents the data to be processed by an  * individual {@link Mapper}.  * * <p>Typically, it presents a byte-oriented view on the input and is the  * responsibility of {@link RecordReader} of the job to process this and present * a record-oriented view. *  * @see InputFormat * @see RecordReader */@InterfaceAudience.Public@InterfaceStability.Stablepublic abstract class InputSplit {  /**   * Get the size of the split, so that the input splits can be sorted by size.   * @return the number of bytes in the split   * @throws IOException   * @throws InterruptedException   */  public abstract long getLength() throws IOException, InterruptedException;  /**   * Get the list of nodes by name where the data for the split would be local.   * The locations do not need to be serialized.   *    * @return a new array of the node nodes.   * @throws IOException   * @throws InterruptedException   */  public abstract     String[] getLocations() throws IOException, InterruptedException;  /**   * Gets info about which nodes the input split is stored on and how it is   * stored at each location.   *    * @return list of <code>SplitLocationInfo</code>s describing how the split   *    data is stored at each location. A null value indicates that all the   *    locations have the data stored on disk.   * @throws IOException   */  @Evolving  public SplitLocationInfo[] getLocationInfo() throws IOException {    return null;  }}

split在java中表示为一个抽象类。InputSplit 包含一个以字节未单位的长度和一组存储位置。分片并不包含数据本身,而是指向数据的引用。存储位置供MapReduce系统使用以便将map任务尽量放在分片数据附近,而分片大小用来排序分片,便于优先处理最大的分片,从而最小化作业时间。

InputFormat负责创建InputSplit并将它们分割成记录。

public abstract class InputFormat<K, V> {  /**    * Logically split the set of input files for the job.     *    * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}   * for processing.</p>   *   * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the   * input files are not physically split into chunks. For e.g. a split could   * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat   * also creates the {@link RecordReader} to read the {@link InputSplit}.   *    * @param context job configuration.   * @return an array of {@link InputSplit}s for the job.   */  public abstract     List<InputSplit> getSplits(JobContext context                               ) throws IOException, InterruptedException;  /**   * Create a record reader for a given split. The framework will call   * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before   * the split is used.   * @param split the split to be read   * @param context the information about the task   * @return a new record reader   * @throws IOException   * @throws InterruptedException   */  public abstract     RecordReader<K,V> createRecordReader(InputSplit split,                                         TaskAttemptContext context                                        ) throws IOException,                                                  InterruptedException;}

client运行作业的客户端通过调用getSplits()方法计算分片,然后将它们发送到Application Master。Map通过调用InputFormat 对象的 createRecordReader方法获取RecordReader对象。RecordReader就像是record的迭代器,map任务用此生成记录的键值对,然后在传递给map函数。

  /**   * Expert users can override this method for more complete control over the   * execution of the Mapper.   * @param context   * @throws IOException   */  public void run(Context context) throws IOException, InterruptedException {    setup(context);    try {      while (context.nextKeyValue()) {        map(context.getCurrentKey(), context.getCurrentValue(), context);      }    } finally {      cleanup(context);    }  }

查看Mapper的run()方法,可以看到:运行setup()方法之后,反复调用Context对象的nextKeyValue()方法,未mapper产生key/value对象。通过Context,key/value从RecordReader取出然后传递给map()。当reader读到stream的结尾时,nextKeyValue()方法返回false,map任务运行其cleanup()方法,然后结束。

Mapper的run()方法是公共的,可以由用户定制。MultithreadedMapper是一个多线程并发运行多个mapper的实现(mapreduce.mapper.multithreadedmapper.threads可以设置线程数量)。对于大多数的数据处理任务来说,默认的执行机制没有优势。但是对于因为需要链接外部服务器而造成单个记录处理时间较长的mapper来说,它允许多个mapper在同一个JVM下尽量避免竞争方式执行。

FileInputFormat

FileInputFormat是所以使用文件作为其数据源的InputFormat 基础实现。它提供两个功能:一个用于指出作业的输入文件位置;一个是输入文件生成分片的是实现。把分片分割成记录的作业由子类完成。

这里写图片描述

FileInputFormat input paths

FileInputFormat 提供了四种static 方法设定job的输入路径:

 /**   * Sets the given comma separated paths as the list of inputs    * for the map-reduce job.   *    * @param job the job   * @param commaSeparatedPaths Comma separated paths to be set as    *        the list of inputs for the map-reduce job.   */  public static void setInputPaths(Job job,                                    String commaSeparatedPaths                                   ) throws IOException {    setInputPaths(job, StringUtils.stringToPath(                        getPathStrings(commaSeparatedPaths)));  }  /**   * Add the given comma separated paths to the list of inputs for   *  the map-reduce job.   *    * @param job The job to modify   * @param commaSeparatedPaths Comma separated paths to be added to   *        the list of inputs for the map-reduce job.   */  public static void addInputPaths(Job job,                                    String commaSeparatedPaths                                   ) throws IOException {    for (String str : getPathStrings(commaSeparatedPaths)) {      addInputPath(job, new Path(str));    }  }  /**   * Set the array of {@link Path}s as the list of inputs   * for the map-reduce job.   *    * @param job The job to modify    * @param inputPaths the {@link Path}s of the input directories/files    * for the map-reduce job.   */   public static void setInputPaths(Job job,                                    Path... inputPaths) throws IOException {    Configuration conf = job.getConfiguration();    Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);    StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));    for(int i = 1; i < inputPaths.length;i++) {      str.append(StringUtils.COMMA_STR);      path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);      str.append(StringUtils.escapeString(path.toString()));    }    conf.set(INPUT_DIR, str.toString());  }  /**   * Add a {@link Path} to the list of inputs for the map-reduce job.   *    * @param job The {@link Job} to modify   * @param path {@link Path} to be added to the list of inputs for    *            the map-reduce job.   */  public static void addInputPath(Job job,                                   Path path) throws IOException {    Configuration conf = job.getConfiguration();    path = path.getFileSystem(conf).makeQualified(path);    String dirStr = StringUtils.escapeString(path.toString());    String dirs = conf.get(INPUT_DIR);    conf.set(INPUT_DIR, dirs == null ? dirStr : dirs + "," + dirStr);  }

其中,addInputPaths()和addInputPath()方法可以将一个或者多个路径加入路径列表。
setInputPaths()方法一次设定完整的路径列表。
一条路径可以表示一个文件,一个目录或者一个glob,即一个文件和目录的集合。

一个被指定为输入路径的目录,其下的内容不会被递归处理。事实上,文件夹应该只包含文件,如果包含子目录,会被当成文件处理,这里将会导致错误。解决方法:使用一个文件 glob或者一个文件命名的过滤器。mapreduce.input.fileinputformat.input.dir.recursive设置为true,强制对目录进行递归读取。

add和set方法允许指定包含文件,如果要excludes 特定文件,可以通过方法setInputPathFilter()方法设置一个过滤器。即使不设置filter,FileInputFormat也会使用一个默认的过滤器来excludes 隐藏的文件(文件名称以 . 或 _ 开头的文件)。如果设置了filter,它会在默认过滤器的基础上进行过滤。自定义过滤器只能看到非隐藏文件。

路径和过滤器也可以通过配置属性来设置。

FileInputFormat input splits

FileInputFormat只切割大文件,这里的 “大” 是指文件的大小超过了HDFS block的size。split的大小通常与HDFS block的大小一样。这个值也可以通过设置不同的Hadoop属性改变。

这里写图片描述

应用程序可以强制设置一个最小的 input split的大小。通过设置一个比HDFS块更大的一些的值,强制分片比文件块大。如果数据在HDFS上,那么这样做是没有什么好处。最大的分片大小默认是java Long类型表示的最大值。

  protected long computeSplitSize(long blockSize, long minSize,                                  long maxSize) {    return Math.max(minSize, Math.min(maxSize, blockSize));  }

在默认情况下,

 minSize < blockSize < maxSize

所以分片的大小就是blockSize。

Small files and CombineFileInputFormat

Hadoop适合处理少量的大文件。一个原因就是 FileInputFormat 生成的split是一个文件或者一个文件的一部分。如果文件很小,并且文件数量很多,那么每次map任务只处理很少的数据,就会有很多map任务,每次map操作都会造成额外的开销。

CombineFileInputFormat可以缓解这个问题。FileInputFormat未每个文件产生1个分片,而CombineFileInputFormat把多个文件打包到一个分片中以便mapper可以处理更多的数据。

当然如果可能的话,应该尽量避免许多小文件的情况,因为MapReduce处理数据的最佳速度最好与数据在集群中的传输速度相同,而处理小文件将增加运行作业寻址次数。而且,在HDFS集群中存储大量的小文件会浪费namenode的内存。使用sequenceFile将这些小文件合成一个或多个大文件,可以将文件名作为键,文件内容作为值。

Preventing splitting

有些应用不想文件被 split,允许每个mapper去处理整个input file。
有两个方法可以使文件不被split,一,让最小split size大于最大的文件的size,或者直接设置为Long.MAX_VALUE。第二种方法就是自定义InputFormat,如下代码:

import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapreduce.JobContext;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;public class NonSplittableTextInputFormat extends TextInputFormat {@Overrideprotected boolean isSplitable(JobContext context, Path file) {return false;}}

Processing a whole file as a record

有时,mapper需要访问一个文件中的全部内容。即使不分割文件,仍然需要一个RecordReader来读取文件内容作为record的值。

下面是hadoop 权威指南中的例子:

public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {    @Override    protected boolean isSplitable(JobContext context, Path filename) {        return false;    }    @Override    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {        WholeFileRecordReader reader = new WholeFileRecordReader();        reader.initialize(split,context);        return reader;    }}

WholeFileInputFormat 中,没有使用键,此处表示为NullWritable,值是文件内容。它定义了2个方法,isSplitable() 返回false,指定文本不被split,createRecordReader()返回一个定制的RecordReader实现。

public class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {    private FileSplit fileSplit;    private Configuration configuration;    private BytesWritable value = new BytesWritable();    private boolean processed = false;    @Override    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {        this.fileSplit = (FileSplit) split;        this.configuration = context.getConfiguration();    }    @Override    public boolean nextKeyValue() throws IOException, InterruptedException {        if (!processed){            byte[] contents = new byte[(int)fileSplit.getLength()];            Path file = fileSplit.getPath();            FileSystem fileSystem = file.getFileSystem(configuration);            try(FSDataInputStream inputStream = fileSystem.open(file)){                IOUtils.readFully(inputStream,contents,0,contents.length);                value.set(contents,0,contents.length);            }            processed = true;            return true;        }        return false;    }    @Override    public NullWritable getCurrentKey() throws IOException, InterruptedException {        return NullWritable.get();    }    @Override    public BytesWritable getCurrentValue() throws IOException, InterruptedException {        return value;    }    @Override    public float getProgress() throws IOException, InterruptedException {        return processed ? 1.0f : 0.0f;    }    @Override    public void close() throws IOException {    }}

WholeFIleRecordReader 负责将FileSplit 转换成一条记录,该记录的键是null,值是这个文件的内容。

下面使用这个类:

public class SmallFilesToSequenceFile extends Configured implements Tool {    private final static String INPUT_PATH = "hdfs://hadoop:9000/hadoop/smallfiles";    private final static String OUT_PATH = "hdfs://hadoop:9000/hadoop/smallfiles-out";    @Override    public int run(String[] args) throws Exception {        Job job = Job.getInstance(this.getConf());        FileInputFormat.addInputPath(job,new Path(INPUT_PATH));        FileOutputFormat.setOutputPath(job,new Path(OUT_PATH));        job.setInputFormatClass(WholeFileInputFormat.class);        job.setOutputFormatClass(SequenceFileOutputFormat.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(BytesWritable.class);        job.setMapperClass(SequenceFileMapper.class);        return job.waitForCompletion(true) ? 0: 1;    }    static class SequenceFileMapper extends Mapper<NullWritable,BytesWritable,Text,BytesWritable> {        private Text filenamekey;        @Override        protected void setup(Context context) throws IOException, InterruptedException {            InputSplit split = context.getInputSplit();            Path path = ((FileSplit)split).getPath();            filenamekey = new Text(path.toString());        }        @Override        protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {            context.write(filenamekey,value);        }    }    public static void main(String[] args) throws Exception {        int code = ToolRunner.run(new SmallFilesToSequenceFile(),args);        System.exit(code);    }}

结果:
这里写图片描述

查看结果:

可以使用 hadoop fs -text 命令查看,也可以用代码查看,输出的是sequence文件

这里写图片描述

0 0