MapReduce实例——ChainMapper的使用
来源:互联网 发布:126邮箱ssl端口号 编辑:程序博客网 时间:2024/05/21 17:56
按照API上的说明:
/** * The ChainMapper class allows to use multiple Mapper classes within a single * Map task. * <p/> * The Mapper classes are invoked in a chained (or piped) fashion, the output of * the first becomes the input of the second, and so on until the last Mapper, * the output of the last Mapper will be written to the task's output. * <p/> * The key functionality of this feature is that the Mappers in the chain do not * need to be aware that they are executed in a chain. This enables having * reusable specialized Mappers that can be combined to perform composite * operations within a single task. * <p/> * Special care has to be taken when creating chains that the key/values output * by a Mapper are valid for the following Mapper in the chain. It is assumed * all Mappers and the Reduce in the chain use maching output and input key and * value classes as no conversion is done by the chaining code. * <p/> * Using the ChainMapper and the ChainReducer classes is possible to compose * Map/Reduce jobs that look like <code>[MAP+ / REDUCE MAP*]</code>. And * immediate benefit of this pattern is a dramatic reduction in disk IO. * <p/> * IMPORTANT: There is no need to specify the output key/value classes for the * ChainMapper, this is done by the addMapper for the last mapper in the chain. * <p/>**/
实例代码:
package com.joey.mapred.chainjobs;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.lib.ChainMapper;import org.apache.hadoop.mapred.lib.ChainReducer;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class ChainJobs extends Configured implements Tool {public static class TokenizerMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);}}}public static class UppercaseMapper extends MapReduceBase implements Mapper<Text, IntWritable, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Text key, IntWritable value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {String line = key.toString();word.set(line.toUpperCase());output.collect(word, one);}}public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();}output.collect(key, new IntWritable(sum));}}public int run(String[] args) throws IOException {Configuration conf = getConf();JobConf job = new JobConf(conf);job.setJarByClass(ChainJobs.class);job.setJobName("TestforChainJobs");FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));JobConf map1Conf = new JobConf(false);ChainMapper.addMapper(job, TokenizerMapper.class, LongWritable.class, Text.class, Text.class, IntWritable.class, true, map1Conf);JobConf map2Conf = new JobConf(false);ChainMapper.addMapper(job, UppercaseMapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, map2Conf);JobConf reduceConf = new JobConf(false);ChainReducer.setReducer(job, Reduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf);JobClient.runJob(job);return 0;}public static void main(String[] args) throws Exception {int res = ToolRunner.run(new ChainJobs(), args);System.exit(res);}}
输入的数据:
BROWN CORPUSA Standard Corpus of Present-Day Edited AmericanEnglish, for use with Digital Computers.by W. N. Francis and H. Kucera (1964)Department of Linguistics, Brown UniversityProvidence, Rhode Island, USARevised 1971, Revised and Amplified 1979http://www.hit.uib.no/icame/brown/bcm.htmlDistributed with the permission of the copyright holder,redistribution permitted.
输出结果:
(1964) 11971, 11979 1A 1AMERICAN 1AMPLIFIED 1AND 2BROWN 2BY 1COMPUTERS. 1COPYRIGHT 1CORPUS 2DEPARTMENT 1DIGITAL 1DISTRIBUTED 1EDITED 1ENGLISH, 1FOR 1FRANCIS 1H. 1HOLDER, 1HTTP://WWW.HIT.UIB.NO/ICAME/BROWN/BCM.HTML 1ISLAND, 1KUCERA 1LINGUISTICS, 1N. 1OF 3PERMISSION 1PERMITTED. 1PRESENT-DAY 1PROVIDENCE, 1REDISTRIBUTION 1REVISED 2RHODE 1STANDARD 1THE 2UNIVERSITY 1USA 1USE 1W. 1WITH 2
运行的log
14/01/11 18:52:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable14/01/11 18:52:10 WARN snappy.LoadSnappy: Snappy native library not loaded14/01/11 18:52:10 INFO mapred.FileInputFormat: Total input paths to process : 114/01/11 18:52:10 INFO mapred.JobClient: Running job: job_201312251053_5309214/01/11 18:52:11 INFO mapred.JobClient: map 0% reduce 0%14/01/11 18:52:15 INFO mapred.JobClient: map 100% reduce 0%14/01/11 18:52:23 INFO mapred.JobClient: map 100% reduce 100%14/01/11 18:52:23 INFO mapred.JobClient: Job complete: job_201312251053_5309214/01/11 18:52:23 INFO mapred.JobClient: Counters: 2814/01/11 18:52:23 INFO mapred.JobClient: Job Counters 14/01/11 18:52:23 INFO mapred.JobClient: Launched reduce tasks=114/01/11 18:52:23 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=797514/01/11 18:52:23 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=014/01/11 18:52:23 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=014/01/11 18:52:23 INFO mapred.JobClient: Rack-local map tasks=314/01/11 18:52:23 INFO mapred.JobClient: Launched map tasks=414/01/11 18:52:23 INFO mapred.JobClient: Data-local map tasks=114/01/11 18:52:23 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=837914/01/11 18:52:23 INFO mapred.JobClient: FileSystemCounters14/01/11 18:52:23 INFO mapred.JobClient: FILE_BYTES_READ=39814/01/11 18:52:23 INFO mapred.JobClient: HDFS_BYTES_READ=142314/01/11 18:52:23 INFO mapred.JobClient: FILE_BYTES_WRITTEN=28109014/01/11 18:52:23 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=514/01/11 18:52:23 INFO mapred.JobClient: Map-Reduce Framework14/01/11 18:52:23 INFO mapred.JobClient: Map input records=1514/01/11 18:52:23 INFO mapred.JobClient: Reduce shuffle bytes=41614/01/11 18:52:23 INFO mapred.JobClient: Spilled Records=9814/01/11 18:52:23 INFO mapred.JobClient: Map output bytes=29414/01/11 18:52:23 INFO mapred.JobClient: CPU time spent (ms)=443014/01/11 18:52:23 INFO mapred.JobClient: Total committed heap usage (bytes)=125829120014/01/11 18:52:23 INFO mapred.JobClient: Map input bytes=38714/01/11 18:52:23 INFO mapred.JobClient: Combine input records=014/01/11 18:52:23 INFO mapred.JobClient: SPLIT_RAW_BYTES=44814/01/11 18:52:23 INFO mapred.JobClient: Reduce input records=4914/01/11 18:52:23 INFO mapred.JobClient: Reduce input groups=114/01/11 18:52:23 INFO mapred.JobClient: Combine output records=014/01/11 18:52:23 INFO mapred.JobClient: Physical memory (bytes) snapshot=95995494414/01/11 18:52:23 INFO mapred.JobClient: Reduce output records=114/01/11 18:52:23 INFO mapred.JobClient: Virtual memory (bytes) snapshot=443677900814/01/11 18:52:23 INFO mapred.JobClient: Map output records=49
0 0
- MapReduce实例——ChainMapper的使用
- MapReduce练习二:ChainMapper和ChainReducer的使用
- MapReduce基础开发之十二ChainMapper和ChainReducer使用
- 如何使用Hadoop的ChainMapper和ChainReducer
- 如何使用Hadoop的ChainMapper和ChainReducer
- 如何使用Hadoop的ChainMapper和ChainReducer
- 链式MapReduce:ChainMapper和ChainReducer
- hadoop 2.0中ChainMapper与ChainReducer的使用
- [Hadoop] Hadoop 链式任务 : ChainMapper and ChainReducer的使用
- 关于ChainMapper的测试
- Hadoop的ChainMapper/ChainReducer
- 使用mapreduce计算环比的实例
- 使用python实现MapReduce的wordcount实例
- MongoDB使用mapReduce实例
- MongoDB使用mapReduce实例
- 一些算法的MapReduce实现——MapReduce Job的单元测试实例
- ChainMapper/ChainReducer 的实现原理
- ChainMapper/ChainReducer的实现原理
- D - 最小公倍数
- oralce数据泵
- cookie编码(中文乱码)
- 简单抽奖软件java程序设计
- 使用 spring web 时候 web.xml 的配置
- MapReduce实例——ChainMapper的使用
- UFLDL学习笔记2(Preprocessing: PCA and Whitening)
- JavaScript前台判空
- servlet 中获取 spring 管理的 bean
- 2D斜视角游戏的绘制次序总结(云风的blog)
- 开发工作中使用的敏捷开发模式
- qt-connect
- 全球电信运营商排名中国移动利润仍为第一
- Android 技巧 - 自动生成 Action Bar Theme