HDPCD-Java-复习笔记(6)

来源:互联网 发布:太原java薪资待遇如何 编辑:程序博客网 时间:2024/06/05 12:43

Input and Output Formats

输入输出格式


Overview of Input and Output Formats

输入输出格式概述




In a MapReduce job, there are two configurable components that determine how data is read and written:

在一个MapReduce工作任务中,有两种可以配置的组件用来决定数据是怎么读取和写入的:

InputFormat--- The InputFormat of a job is responsible for reading the data from the InputSplit and generating a <key ,value > pair for the Mapper.

(InputFormat ---  一个工作任务的InputFormat负责从InputSplit中读取数据并且生成为Mapper的键值对输入。)

OutputFormat--- The OutputFormat is responsible for writing the < key ,value > pairs from the Reducer to an output file.

(OutputFormat ---  OutputFormat负责从将键值对从Reducer中写入到输出文件中。)


The  Built-in Input Formats

内建输入格式

FileInputFormat<K,V> -- The parent class of all input formats that read data from files.(所有从文件读取数据的输入格式的父类)

TextInputFormat<LongWritable, Text> -- The default Input Format.(默认输入格式)

SequenceFileInputFormat<K,V> -- For reading data from a Sequence File.(用来从Sequence文件中读取数据)

KeyValueTextInputFormat<Text,Text> -- Reads in lines of data as records, and the key is the first token - based on a delimiter that is configurable.(读取每行数据作为记录,键为第一个token,token分割基于配置的分割符)

CombineFileInputFormat<K,V> -- For controlling input splits.(用来控制InputSplits)

MultipleInputs-- For specifying multiple input paths with a different InputFormat and Mapper for each path.(用来为不同的InputFormat和Mapper指定多个不同的输入路径)


A custom InputFormat is a great tool when  working with  custom data types.

定制InputFortmat用来处理定制的数据类型时,表现强大


Understanding InputFormats

理解InputFormats


The InputFormat  class has two methods:

InputFormat类包含两个方法:

getSplits -- Determines the input splits.(决定input splits.)

createRecordReader -- Provides a RecordReader instance for  iterating through the input splits andgenerating <key,value> pairs.(提供一个RecordReader实例,用来遍历input splits和生成键值对


Determining Input Splits

判断Input Splits


To write a custom InputFormat, extend the FileInputFormat class and let it determine the inputsplits, allowing you to only worry about the RecordReader instance.

编写一个定制InputFormat,继承FileInputFormat类,让这个类判断inputsplits,你只需要关心RecordReader实例。


Defining a RecordReader

定义一个RecordReader



To understand how a RecordReader works, it helps to look at the Hadoop source code of the run method, which invokes methods on the RecordReader (via the Context):

为了更好的理解RecordReader是怎么工作的,查看Hadoop源码的run方法会有所帮助,通过Context调用RecordReader上的方法。

public void run(Context context)throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) {  map(context.getCurrentKey(),context.getCurrentValue(),context); } cleanup(context);}

nextKeyValue -- Is invoked to determine if there is another <key,value> pair.(被调用用来决断是否还有下一个键值对)

getCurrentKey and getCurrentValue -- Retrieve  the <key,value> pair.(获取键值对)


An  example of a custom RecordReader:

一个定制RecordReader的列子:
public class CustomerReader extends RecordReader<CustomerKey, Customer> {    private BufferedReader in;    private FSDataInputStream fsInput;    private CustomerKey key = new CustomerKey();    private Customer value = new Customer();    private long start;    private long end;    private long pos;    public void initialize(InputSplit inputSplit, TaskAttemptContext context) {        FileSplit split = (FileSplit) inputSplit;        Configuration conf = context.getConfiguration();        Path path = split.getPath();        FileSystem fs = path.getFileSystem(conf);        fsInput = fs.open(path);        in = new BufferedReader(new InputStreamReader(fsInput));        start = split.getStart();        end = start + split.getLength();    }    public boolean nextKeyValue() {        String line = in.readLine();        if (line == null) {            return false;        } else {            String[] words = StringUtils.split(line, ',');            key.setCustomerId(Integer.parseInt(words[0]));            key.setZipCode(words[4]);            value.setFirst(words[1]);            value.setLast(words[2]);            value.setStreetAddress(words[3]);            return true;        }        pos += line.length();    }    public CustomerKey getCurrentKey() {        return key;    }    public Customer getCurrentValue() {        return value;    }    public float getProgress() {        return Math.min(1.0f, (pos - start) / (float) (end - start));    }    public void close() throws IOException {        fsInput.close();    }}


Handling Records that Span Splits

处理跨Splits的记录


Considering the CustomerReader defined on the previously, it does not handle records across splits well. To fix the issue, the CustomerReader should keep track of where it is reading in the split. 

考虑前面定义的CustomerReader类,它没有处理跨splits记录。为了修复这个问题,CustomerReader需要跟中在split中读取到哪了。

public void initialize(InputSplit inputSplit,TaskAttemptContext context) {FileSplit split = (FileSplit) inputSplit;Configuration conf = context.getConfiguration();Path path = split.getPath();FileSystem fs = path.getFileSystem(conf);fsInput = fs.open(path);in = new BufferedReader(newInputStreamReader(fsInput));//Hang on to the start and end of the splitstart = split.getStart();end = start + split.getLength();//Seek to the start of the splitfsInput.seek(start);//If we are in the middle of a split, then skip the//line since it is a portion of a previous recordif(start != 0) {start += fsInput.readLine(new Text(), 0,  (int) Math.min(Integer.MAX_VALUE, end - start));}currentPosition = start;}

public boolean nextKeyValue()//Read one line beyond the end of the splitif (currentPos > end) {return false; }currentPos += in.readLine(line);if (line.getLength() == 0) {return false;}//...remainder of nextKeyValue method}


Processing Many Small Files

处理多个小文件

public class CustomerCombineFileInputFormat extends CombineFileInputFormat<CustomerKey, Customer> {    public RecordReader<CustomerKey, Customer> createRecordReader(        InputSplit split, TaskAttemptContext context) throws IOException {        return new CombineFileRecordReader((CombineFileSplit) split, context,            CustomerCombineReader.class);    }}
public class CustomerCombineReader extends RecordReader<CustomerKey, Customer> {    private int index;    private CustomerReader in;    public CustomerCombineReader(CombineFileSplit split,        TaskAttemptContext context, Integer index) throws IOException {        this.index = index;        in = new CustomerReader();    }    public void initialize(InputSplit split, TaskAttemptContext context) {        CombineFileSplit cfsplit = (CombineFileSplit) split;        FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),                cfsplit.getOffset(index), cfsplit.getLength(index),                cfsplit.getLocations());        in.initialize(fileSplit, context);    }    public boolean nextKeyValue() {        return in.nextKeyValue();    }    public CustomerKey getCurrentKey() {        return in.getCurrentKey();    }    public Customer getCurrentValue() {        return in.getCurrentValue();    }    public float getProgress() {        return in.getProgress();    }    public void close() throws IOException {        in.close();    }}

To make this combine file input work, the size of the input splits must be specified when the job is run. 

为使组合文件输入能够工作,必须在工作任务运行时指定input splits的大小。

There are two options:

两个选项:
1.Use the setMaxSplitSizesetMinSplitSizeNode and setMinSplitSizeRack methods of the CombineFileInputFormat class to specify a split size range.

1.使用CombineFileInputFormat类中的setMaxSplitSize, setMinSplitSizeNode 和 setMinSplitSizeRack方法指定split的范围。

2.Set the mapreduce.input.fileinputformat.split.maxsizemapreduce.input.fileinputformat.split.minsize.per.node and
mapreduce.input.fileinputformat.split.minsize.per.rack properties to the desired input split size.

2.设定mapreduce.input.fileinputformat.split.maxsize, mapreduce.input.fileinputformat.split.minsize.per.node 和
mapreduce.input.fileinputformat.split.minsize.per.rack属性,以获取到需要的input split大小。

public class CustomerCombineFileInputFormat extends CombineFileInputFormat<CustomerKey, Customer> {    public CustomerCombineFileInputFormat() {        setMaxSplitSize(67108864); //64MB    }}


The Built-in Output Formats

内建输出格式

FileOutputFormat<K,V> -- The abstract parent class of Output Formats that write to a file.(写一个文件的输出格式的抽象父类)

TextOutputFormat<K,V> -- For writing text - this is the default OutputFormat of a MapReduce job.(用来写文本,MapReduce工作任务的默认输出格式)

SequenceFileOutputFormat<K,V> -- For generating sequence files.(用来生成序列化文件)

MultipleOutputs<K,V> -- For sending output to multiple destinations.(用来发送输出到多个目的地)

NullOutputFormat<K,V> -- Sends all output to /dev/null,which essentially means no output is generated.(发送输出到/dev/null, 也就意味着没有输出产生)

LazyOutputFormat<K,V> -- The output file does not get created until a call to write. Useful if write will not be called, and users do not want an empty file generated.(当调用write时输出文件才会生成。当write没有被调用,用户不想生成空文件时,会非常有用。)


Writing a Custom OutputFormat

编写定制输出格式

The steps for writing a custom OutputFormat look similar to writing a custom InputFormat:

编写一个定制输出格式的步骤和编写一个定制的输入格式相同:

1.Write a class that extends OutputFormat, which typically is accomplished by extending FileOutputFormat if writing output to files.

编写一个类继承OutputFormat,通常如果写输出到文件中会继承FileOutputFormat。

2.Implement the getRecordWriter method, which needs to return a RecordWriter instance.

实现getRecordWriter方法,方法需要返回一个RecordWriter实例。

3.Write a class that extends RecordWriter and define the write method.

编写一个类继承RecordWriter,定义write方法。

4.The write method is invoked for each < key ,value pair.

每个键值对会调用write方法。


public class CustomerOutputFormat extends FileOutputFormat<CustomerKey, Customer> {    @Override    public RecordWriter<CustomerKey, Customer> getRecordWriter(        TaskAttemptContext context) throws IOException, InterruptedException {        //Create a file to write the output to        Path outputDir = FileOutputFormat.getOutputPath(context);        Path file = new Path(outputDir.getName() + "/" + "Customers_" +                context.getTaskAttemptID().getTaskID());        FileSystem fs = file.getFileSystem(context.getConfiguration());        FSDataOutputStream fileOut = fs.create(file);        //Return the RecordWriter        return new CustomerRecordWriter(fileOut);    }}
public class CustomerRecordWriter extends RecordWriter<CustomerKey, Customer> {    private PrintWriter out;    public CustomerRecordWriter(DataOutputStream fileOut) {        out = new PrintWriter(fileOut);    }    public void write(CustomerKey key, Customer value) {        out.println(key.getCustomerId() + '\t' + key.getZipCode());    }    public void close(TaskAttemptContext context) {        out.close();    }}


The MulitpleOutputs Class

public class CustomerReducer extends Reducer<CustomerKey, Customer, Text, Text> {    private MultipleOutputs<Text, Text> outs;    private Text outputKey = new Text();    private Text outputValue = new Text();    @Override    protected void setup(Context context)        throws IOException, InterruptedException {        outs = new MultipleOutputs<Text, Text>(context);    }    @Override    protected void reduce(CustomerKey key, Iterable<Customer> values,        Context context) throws IOException, InterruptedException {        while (values.iterator().hasNext()) {            Customer e = values.iterator().next();            outputKey.set(e.getCustomerId());            outputValue.set(e.getLast());            outs.write("lastnames", outputKey, outputValue, "outputPath1");            outputValue.set(e.getStreetAddress());            outs.write("addresses", outputKey, outputValue, "outputPath2");        }    }}
The “lastnames” output will contain the last name of each customer.

“lastnames” 输出将包含每个顾客的姓。
The “addresses” output will contain the street address of each customer.

“addresses”输出将包含每个顾客的街道地址。