MapReduce实例——ChainMapper的使用

来源:互联网 发布:126邮箱ssl端口号 编辑:程序博客网 时间:2024/05/21 17:56

按照API上的说明:

/** * The ChainMapper class allows to use multiple Mapper classes within a single * Map task. * <p/> * The Mapper classes are invoked in a chained (or piped) fashion, the output of * the first becomes the input of the second, and so on until the last Mapper, * the output of the last Mapper will be written to the task's output. * <p/> * The key functionality of this feature is that the Mappers in the chain do not * need to be aware that they are executed in a chain. This enables having * reusable specialized Mappers that can be combined to perform composite * operations within a single task. * <p/> * Special care has to be taken when creating chains that the key/values output * by a Mapper are valid for the following Mapper in the chain. It is assumed * all Mappers and the Reduce in the chain use maching output and input key and * value classes as no conversion is done by the chaining code. * <p/> * Using the ChainMapper and the ChainReducer classes is possible to compose * Map/Reduce jobs that look like <code>[MAP+ / REDUCE MAP*]</code>. And * immediate benefit of this pattern is a dramatic reduction in disk IO. * <p/> * IMPORTANT: There is no need to specify the output key/value classes for the * ChainMapper, this is done by the addMapper for the last mapper in the chain. * <p/>**/

实例代码:

package com.joey.mapred.chainjobs;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.lib.ChainMapper;import org.apache.hadoop.mapred.lib.ChainReducer;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class ChainJobs extends Configured implements Tool {public static class TokenizerMapper extends MapReduceBase implements    Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value,    OutputCollector<Text, IntWritable> output, Reporter reporter)    throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);}}}public static class UppercaseMapper extends MapReduceBase implements    Mapper<Text, IntWritable, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Text key, IntWritable value,    OutputCollector<Text, IntWritable> output, Reporter reporter)    throws IOException {String line = key.toString();word.set(line.toUpperCase());output.collect(word, one);}}public static class Reduce extends MapReduceBase implements    Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values,    OutputCollector<Text, IntWritable> output, Reporter reporter)    throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();}output.collect(key, new IntWritable(sum));}}public int run(String[] args) throws IOException {Configuration conf = getConf();JobConf job = new JobConf(conf);job.setJarByClass(ChainJobs.class);job.setJobName("TestforChainJobs");FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));JobConf map1Conf = new JobConf(false);ChainMapper.addMapper(job, TokenizerMapper.class, LongWritable.class, Text.class,    Text.class, IntWritable.class, true, map1Conf);JobConf map2Conf = new JobConf(false);ChainMapper.addMapper(job, UppercaseMapper.class, Text.class, IntWritable.class,    Text.class, IntWritable.class, true, map2Conf);JobConf reduceConf = new JobConf(false);ChainReducer.setReducer(job, Reduce.class, Text.class, IntWritable.class,    Text.class, IntWritable.class, true, reduceConf);JobClient.runJob(job);return 0;}public static void main(String[] args) throws Exception {int res = ToolRunner.run(new ChainJobs(), args);System.exit(res);}}


输入的数据:

BROWN CORPUSA Standard Corpus of Present-Day Edited AmericanEnglish, for use with Digital Computers.by W. N. Francis and H. Kucera (1964)Department of Linguistics, Brown UniversityProvidence, Rhode Island, USARevised 1971, Revised and Amplified 1979http://www.hit.uib.no/icame/brown/bcm.htmlDistributed with the permission of the copyright holder,redistribution permitted.

输出结果:

(1964)  11971,   11979    1A       1AMERICAN        1AMPLIFIED       1AND     2BROWN   2BY      1COMPUTERS.      1COPYRIGHT       1CORPUS  2DEPARTMENT      1DIGITAL 1DISTRIBUTED     1EDITED  1ENGLISH,        1FOR     1FRANCIS 1H.      1HOLDER, 1HTTP://WWW.HIT.UIB.NO/ICAME/BROWN/BCM.HTML      1ISLAND, 1KUCERA  1LINGUISTICS,    1N.      1OF      3PERMISSION      1PERMITTED.      1PRESENT-DAY     1PROVIDENCE,     1REDISTRIBUTION  1REVISED 2RHODE   1STANDARD        1THE     2UNIVERSITY      1USA     1USE     1W.      1WITH    2



运行的log

14/01/11 18:52:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable14/01/11 18:52:10 WARN snappy.LoadSnappy: Snappy native library not loaded14/01/11 18:52:10 INFO mapred.FileInputFormat: Total input paths to process : 114/01/11 18:52:10 INFO mapred.JobClient: Running job: job_201312251053_5309214/01/11 18:52:11 INFO mapred.JobClient:  map 0% reduce 0%14/01/11 18:52:15 INFO mapred.JobClient:  map 100% reduce 0%14/01/11 18:52:23 INFO mapred.JobClient:  map 100% reduce 100%14/01/11 18:52:23 INFO mapred.JobClient: Job complete: job_201312251053_5309214/01/11 18:52:23 INFO mapred.JobClient: Counters: 2814/01/11 18:52:23 INFO mapred.JobClient:   Job Counters 14/01/11 18:52:23 INFO mapred.JobClient:     Launched reduce tasks=114/01/11 18:52:23 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=797514/01/11 18:52:23 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=014/01/11 18:52:23 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=014/01/11 18:52:23 INFO mapred.JobClient:     Rack-local map tasks=314/01/11 18:52:23 INFO mapred.JobClient:     Launched map tasks=414/01/11 18:52:23 INFO mapred.JobClient:     Data-local map tasks=114/01/11 18:52:23 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=837914/01/11 18:52:23 INFO mapred.JobClient:   FileSystemCounters14/01/11 18:52:23 INFO mapred.JobClient:     FILE_BYTES_READ=39814/01/11 18:52:23 INFO mapred.JobClient:     HDFS_BYTES_READ=142314/01/11 18:52:23 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=28109014/01/11 18:52:23 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=514/01/11 18:52:23 INFO mapred.JobClient:   Map-Reduce Framework14/01/11 18:52:23 INFO mapred.JobClient:     Map input records=1514/01/11 18:52:23 INFO mapred.JobClient:     Reduce shuffle bytes=41614/01/11 18:52:23 INFO mapred.JobClient:     Spilled Records=9814/01/11 18:52:23 INFO mapred.JobClient:     Map output bytes=29414/01/11 18:52:23 INFO mapred.JobClient:     CPU time spent (ms)=443014/01/11 18:52:23 INFO mapred.JobClient:     Total committed heap usage (bytes)=125829120014/01/11 18:52:23 INFO mapred.JobClient:     Map input bytes=38714/01/11 18:52:23 INFO mapred.JobClient:     Combine input records=014/01/11 18:52:23 INFO mapred.JobClient:     SPLIT_RAW_BYTES=44814/01/11 18:52:23 INFO mapred.JobClient:     Reduce input records=4914/01/11 18:52:23 INFO mapred.JobClient:     Reduce input groups=114/01/11 18:52:23 INFO mapred.JobClient:     Combine output records=014/01/11 18:52:23 INFO mapred.JobClient:     Physical memory (bytes) snapshot=95995494414/01/11 18:52:23 INFO mapred.JobClient:     Reduce output records=114/01/11 18:52:23 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=443677900814/01/11 18:52:23 INFO mapred.JobClient:     Map output records=49




0 0
原创粉丝点击