Hadoop reduce多个输出

来源：互联网发布：java 重载返回值不同编辑：程序博客网时间：2024/05/19 23:11

转自：Hadoop in Action

在hadoop中，想要reduce支持多个输出，有两种实现方式。

第一种就是继承MultipleTextOutputFormat类，重写generateFileNameForKey方法。

public static class PartitionByCountryMTOF    extends MultipleTextOutputFormat<Text,Text>{    protected String generateFileNameForKeyValue(Text key,            Text value, String filename)    {        String[] arr = value.toString().split(",", -1);        String country = arr[4].substring(1,3);        return country + "/" + filename;    }}public int run(String[] args) throws Exception {    Configuration conf = getConf();    JobConf job = new JobConf(conf, MultiFile.class);    Path in = new Path(args[0]);    Path out = new Path(args[1]);    FileInputFormat.setInputPaths(job, in);    FileOutputFormat.setOutputPath(job, out);    job.setJobName(“MultiFile”);    job.setMapperClass(MapClass.class);    job.setInputFormat(TextInputFormat.class);    job.setOutputFormat(PartitionByCountryMTOF.class);    job.setOutputKeyClass(NullWritable.class);    job.setOutputValueClass(Text.class);    job.setNumReduceTasks(0);    JobClient.runJob(job);    return 0;}

这种方法的限制是显而易见的，它只能按照每一行数据去确定要输出的文件，而且对每一行数据，只能确定一个输出文件。假如我们对同一行数据，需要同时输出至多个文件的话，它就办不到了。这时我们可以使用MultipleOutputs类：

public class MultiFile extends Confi gured implements Tool {    public static class MapClass extends MapReduceBase        implements Mapper<LongWritable, Text, NullWritable, Text> {            private MultipleOutputs mos;            private OutputCollector<NullWritable, Text> collector;            public void confi gure(JobConf conf) {                mos = new MultipleOutputs(conf);            }            public void map(LongWritable key, Text value,                    OutputCollector<NullWritable, Text> output,                    Reporter reporter) throws IOException {                String[] arr = value.toString().split(",", -1);                String chrono = arr[0] + "," + arr[1] + "," + arr[2];                String geo = arr[0] + "," + arr[4] + "," + arr[5];                collector = mos.getCollector("chrono", reporter);                collector.collect(NullWritable.get(), new Text(chrono));                collector = mos.getCollector("geo", reporter);                collector.collect(NullWritable.get(), new Text(geo));            }            public void close() throws IOException {                mos.close();            }    }    public int run(String[] args) throws Exception {        Confi guration conf = getConf();        JobConf job = new JobConf(conf, MultiFile.class);        Path in = new Path(args[0]);        Path out = new Path(args[1]);        FileInputFormat.setInputPaths(job, in);        FileOutputFormat.setOutputPath(job, out);        job.setJobName("MultiFile");        job.setMapperClass(MapClass.class);        job.setInputFormat(TextInputFormat.class);        job.setOutputKeyClass(NullWritable.class);        job.setOutputValueClass(Text.class);        job.setNumReduceTasks(0);        MultipleOutputs.addNamedOutput(job,                "chrono",                TextOutputFormat.class,                NullWritable.class,                Text.class);        MultipleOutputs.addNamedOutput(job,                "geo",                TextOutputFormat.class,                NullWritable.class,                Text.class);        JobClient.runJob(job);        return 0;    }}

这个类维护了一个<name, OutputCollector>的map。我们可以在job配置里添加collector，然后在reduce方法中，取得对应的collector并调用collector.write即可。

最后需要注意，如果reduce的所有输出都在named collector中，那么框架最后对counter做统计的时候，会得出reduce records=0（但其实是有正常的输出的）。