MapReduce实例——ChainMapper的使用

来源：互联网发布：126邮箱ssl端口号编辑：程序博客网时间：2024/05/21 17:56

按照API上的说明：

/** * The ChainMapper class allows to use multiple Mapper classes within a single * Map task. * <p/> * The Mapper classes are invoked in a chained (or piped) fashion, the output of * the first becomes the input of the second, and so on until the last Mapper, * the output of the last Mapper will be written to the task's output. * <p/> * The key functionality of this feature is that the Mappers in the chain do not * need to be aware that they are executed in a chain. This enables having * reusable specialized Mappers that can be combined to perform composite * operations within a single task. * <p/> * Special care has to be taken when creating chains that the key/values output * by a Mapper are valid for the following Mapper in the chain. It is assumed * all Mappers and the Reduce in the chain use maching output and input key and * value classes as no conversion is done by the chaining code. * <p/> * Using the ChainMapper and the ChainReducer classes is possible to compose * Map/Reduce jobs that look like <code>[MAP+ / REDUCE MAP*]</code>. And * immediate benefit of this pattern is a dramatic reduction in disk IO. * <p/> * IMPORTANT: There is no need to specify the output key/value classes for the * ChainMapper, this is done by the addMapper for the last mapper in the chain. * <p/>**/

实例代码：

package com.joey.mapred.chainjobs;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.lib.ChainMapper;import org.apache.hadoop.mapred.lib.ChainReducer;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class ChainJobs extends Configured implements Tool {public static class TokenizerMapper extends MapReduceBase implements    Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value,    OutputCollector<Text, IntWritable> output, Reporter reporter)    throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);}}}public static class UppercaseMapper extends MapReduceBase implements    Mapper<Text, IntWritable, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Text key, IntWritable value,    OutputCollector<Text, IntWritable> output, Reporter reporter)    throws IOException {String line = key.toString();word.set(line.toUpperCase());output.collect(word, one);}}public static class Reduce extends MapReduceBase implements    Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values,    OutputCollector<Text, IntWritable> output, Reporter reporter)    throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();}output.collect(key, new IntWritable(sum));}}public int run(String[] args) throws IOException {Configuration conf = getConf();JobConf job = new JobConf(conf);job.setJarByClass(ChainJobs.class);job.setJobName("TestforChainJobs");FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));JobConf map1Conf = new JobConf(false);ChainMapper.addMapper(job, TokenizerMapper.class, LongWritable.class, Text.class,    Text.class, IntWritable.class, true, map1Conf);JobConf map2Conf = new JobConf(false);ChainMapper.addMapper(job, UppercaseMapper.class, Text.class, IntWritable.class,    Text.class, IntWritable.class, true, map2Conf);JobConf reduceConf = new JobConf(false);ChainReducer.setReducer(job, Reduce.class, Text.class, IntWritable.class,    Text.class, IntWritable.class, true, reduceConf);JobClient.runJob(job);return 0;}public static void main(String[] args) throws Exception {int res = ToolRunner.run(new ChainJobs(), args);System.exit(res);}}

输入的数据：

BROWN CORPUSA Standard Corpus of Present-Day Edited AmericanEnglish, for use with Digital Computers.by W. N. Francis and H. Kucera (1964)Department of Linguistics, Brown UniversityProvidence, Rhode Island, USARevised 1971, Revised and Amplified 1979http://www.hit.uib.no/icame/brown/bcm.htmlDistributed with the permission of the copyright holder,redistribution permitted.

输出结果：

(1964)  11971,   11979    1A       1AMERICAN        1AMPLIFIED       1AND     2BROWN   2BY      1COMPUTERS.      1COPYRIGHT       1CORPUS  2DEPARTMENT      1DIGITAL 1DISTRIBUTED     1EDITED  1ENGLISH,        1FOR     1FRANCIS 1H.      1HOLDER, 1HTTP://WWW.HIT.UIB.NO/ICAME/BROWN/BCM.HTML      1ISLAND, 1KUCERA  1LINGUISTICS,    1N.      1OF      3PERMISSION      1PERMITTED.      1PRESENT-DAY     1PROVIDENCE,     1REDISTRIBUTION  1REVISED 2RHODE   1STANDARD        1THE     2UNIVERSITY      1USA     1USE     1W.      1WITH    2

运行的log

14/01/11 18:52:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable14/01/11 18:52:10 WARN snappy.LoadSnappy: Snappy native library not loaded14/01/11 18:52:10 INFO mapred.FileInputFormat: Total input paths to process : 114/01/11 18:52:10 INFO mapred.JobClient: Running job: job_201312251053_5309214/01/11 18:52:11 INFO mapred.JobClient:  map 0% reduce 0%14/01/11 18:52:15 INFO mapred.JobClient:  map 100% reduce 0%14/01/11 18:52:23 INFO mapred.JobClient:  map 100% reduce 100%14/01/11 18:52:23 INFO mapred.JobClient: Job complete: job_201312251053_5309214/01/11 18:52:23 INFO mapred.JobClient: Counters: 2814/01/11 18:52:23 INFO mapred.JobClient:   Job Counters 14/01/11 18:52:23 INFO mapred.JobClient:     Launched reduce tasks=114/01/11 18:52:23 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=797514/01/11 18:52:23 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=014/01/11 18:52:23 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=014/01/11 18:52:23 INFO mapred.JobClient:     Rack-local map tasks=314/01/11 18:52:23 INFO mapred.JobClient:     Launched map tasks=414/01/11 18:52:23 INFO mapred.JobClient:     Data-local map tasks=114/01/11 18:52:23 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=837914/01/11 18:52:23 INFO mapred.JobClient:   FileSystemCounters14/01/11 18:52:23 INFO mapred.JobClient:     FILE_BYTES_READ=39814/01/11 18:52:23 INFO mapred.JobClient:     HDFS_BYTES_READ=142314/01/11 18:52:23 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=28109014/01/11 18:52:23 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=514/01/11 18:52:23 INFO mapred.JobClient:   Map-Reduce Framework14/01/11 18:52:23 INFO mapred.JobClient:     Map input records=1514/01/11 18:52:23 INFO mapred.JobClient:     Reduce shuffle bytes=41614/01/11 18:52:23 INFO mapred.JobClient:     Spilled Records=9814/01/11 18:52:23 INFO mapred.JobClient:     Map output bytes=29414/01/11 18:52:23 INFO mapred.JobClient:     CPU time spent (ms)=443014/01/11 18:52:23 INFO mapred.JobClient:     Total committed heap usage (bytes)=125829120014/01/11 18:52:23 INFO mapred.JobClient:     Map input bytes=38714/01/11 18:52:23 INFO mapred.JobClient:     Combine input records=014/01/11 18:52:23 INFO mapred.JobClient:     SPLIT_RAW_BYTES=44814/01/11 18:52:23 INFO mapred.JobClient:     Reduce input records=4914/01/11 18:52:23 INFO mapred.JobClient:     Reduce input groups=114/01/11 18:52:23 INFO mapred.JobClient:     Combine output records=014/01/11 18:52:23 INFO mapred.JobClient:     Physical memory (bytes) snapshot=95995494414/01/11 18:52:23 INFO mapred.JobClient:     Reduce output records=114/01/11 18:52:23 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=443677900814/01/11 18:52:23 INFO mapred.JobClient:     Map output records=49

0 0