Hadoop Combiner 操作
来源:互联网 发布:站长域名查询 编辑:程序博客网 时间:2024/06/04 23:32
近期看了一本书:Data-intensive Text Processing with MapReduce,是讲如何设计MR程序的,看到一个例子是Combiner的设计模式,然后就动手实现了下。具体问题如下:
现有输入数据如下:
one3.9one4.0one3.8two44two44two44three9898four2323four2323five2323six23six2323four232five2323
第一列代表用户,第二列代表用户在一个网站上所停留的时间,现在想求每个用户在这个网站的平均停留时间。如果不用combine操作的话,那么其MR伪代码如下(复制书上的内容):
class Mappermethod Map(string t, integer r)Emit(string t, integer r)class Reducermethod Reduce(string t, integers [r1 , r2 , . . .])sum ← 0cnt ← 0for all integer r ∈ integers [r1 , r2 , . . .] dosum ← sum + rcnt ← cnt + 1ravg ← sum/cntEmit(string t, integer ravg )如果要加combine怎么操作呢?Combiner和Reducer一样么(求最大气温的例子或许是一样的,但这里却不是,而且现实中的很多例子都不是一样的),如果一样的话那么就会变成下面的错误操作了:
Mean(1, 2, 3, 4, 5) = Mean(Mean(1, 2), Mean(3, 4, 5))正确的伪代码如下(书上摘录):
class Mappermethod Map(string t, integer r)Emit(string t, pair (r, 1))class Combinermethod Combine(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .])sum ← 0cnt ← 0for all pair (s, c) ∈ pairs [(s1 , c1 ), (s2 , c2 ) . . .] dosum ← sum + scnt ← cnt + cEmit(string t, pair (sum, cnt))class Reducermethod Reduce(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .])sum ← 0cnt ← 0for all pair (s, c) ∈ pairs [(s1 , c1 ), (s2 , c2 ) . . .] dosum ← sum + scnt ← cnt + cravg ← sum/cntEmit(string t, integer ravg )由于Combiner的输入和输出格式要一样,即Combiner的输入要和Mapper的输出格式一样,Combiner的输出要和Reducer的输入格式一样。所以上面有pairs。参考上面的伪代码编写的代码如下:
Driver:
package org.fansy.date922;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class AverageDriver3 {public static void main(String[] args) throws Exception{// TODO Auto-generated method stubConfiguration conf1 = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf1, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: AverageDriver<in> <out>"); System.exit(2); } Job job1 = new Job(conf1, "AverageDriver job "); job1.setInputFormatClass(KeyValueTextInputFormat.class); job1.setNumReduceTasks(1); job1.setJarByClass(AverageDriver3.class); job1.setMapperClass(AverageM2.class); job1.setMapOutputKeyClass(Text.class);job1.setMapOutputValueClass(TextPair.class);job1.setCombinerClass(AverageC3.class); job1.setReducerClass(AverageR2.class); job1.setOutputKeyClass(Text.class); job1.setOutputValueClass(DoubleWritable.class); KeyValueTextInputFormat.addInputPath(job1, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job1, new Path(otherArgs[1])); if(!job1.waitForCompletion(true)){ System.exit(1); // run error then exit } System.out.println("************************");}}Mapper:
package org.fansy.date922;import java.io.IOException;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class AverageM3 extends Mapper<Text,Text,Text,TextPair>{//private Text newkey=new Text();private TextPair newvalue=new TextPair();private DoubleWritable r=new DoubleWritable();private IntWritable number=new IntWritable(1);public void map(Text key,Text value,Context context)throws IOException,InterruptedException {// TODO Auto-generated method stubSystem.out.println(key.toString());double shuzhi=Double.parseDouble(value.toString());r.set(shuzhi);newvalue.set(r, number);context.write(key, newvalue);}}Combiner:
package org.fansy.date922;import java.io.IOException;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class AverageC3 extends Reducer<Text,TextPair,Text,TextPair>{private DoubleWritable newvalued=new DoubleWritable();private IntWritable newvaluei=new IntWritable();private TextPair newvalue=new TextPair();public void reduce(Text key,Iterable<TextPair> values,Context context) throws IOException,InterruptedException{// TODO Auto-generated method stubdouble sum= 0.0;int num=0;for(TextPair val:values){sum+=val.getFirst().get();num+=val.getSecond().get();}newvalued.set(sum);newvaluei.set(num);newvalue.set(newvalued,newvaluei);context.write(key, newvalue);}}
Reducer:
package org.fansy.date922;import java.io.IOException;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class AverageR3 extends Reducer<Text,TextPair,Text,DoubleWritable>{private DoubleWritable newvalue=new DoubleWritable();public void reduce(Text key,Iterable<TextPair> values,Context context) throws IOException,InterruptedException{// TODO Auto-generated method stubdouble sum= 0.0;int num=0;for(TextPair val:values){sum+=val.getFirst().get();num+=val.getSecond().get();}double aver=sum/num;newvalue.set(aver);context.write(key, newvalue);}}
TextPair:
package org.fansy.date922;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.WritableComparable;public class TextPair implements WritableComparable<TextPair> {private DoubleWritable first;private IntWritable second;public TextPair(){set(new DoubleWritable(),new IntWritable());}public void set(DoubleWritable longWritable, IntWritable intWritable) {// TODO Auto-generated method stubthis.first=longWritable;this.second=intWritable;}public DoubleWritable getFirst(){return first;}public IntWritable getSecond(){return second;}@Overridepublic void readFields(DataInput arg0) throws IOException {// TODO Auto-generated method stubfirst.readFields(arg0);second.readFields(arg0);}@Overridepublic void write(DataOutput arg0) throws IOException {// TODO Auto-generated method stubfirst.write(arg0);second.write(arg0);}@Overridepublic int compareTo(TextPair o) {// TODO Auto-generated method stubint cmp=first.compareTo(o.first);if(cmp!=0){return cmp;}return second.compareTo(o.second);}}
查看终端中的显示也可以看出的确是有combine操作的:
12/09/22 15:55:45 INFO mapred.JobClient: Job complete: job_local_000112/09/22 15:55:45 INFO mapred.JobClient: Counters: 2212/09/22 15:55:45 INFO mapred.JobClient: File Output Format Counters 12/09/22 15:55:45 INFO mapred.JobClient: Bytes Written=6512/09/22 15:55:45 INFO mapred.JobClient: FileSystemCounters12/09/22 15:55:45 INFO mapred.JobClient: FILE_BYTES_READ=46612/09/22 15:55:45 INFO mapred.JobClient: HDFS_BYTES_READ=24412/09/22 15:55:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8275812/09/22 15:55:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=6512/09/22 15:55:45 INFO mapred.JobClient: File Input Format Counters 12/09/22 15:55:45 INFO mapred.JobClient: Bytes Read=12212/09/22 15:55:45 INFO mapred.JobClient: Map-Reduce Framework12/09/22 15:55:45 INFO mapred.JobClient: Map output materialized bytes=11812/09/22 15:55:45 INFO mapred.JobClient: Map input records=1412/09/22 15:55:45 INFO mapred.JobClient: Reduce shuffle bytes=012/09/22 15:55:45 INFO mapred.JobClient: Spilled Records=1212/09/22 15:55:45 INFO mapred.JobClient: Map output bytes=23112/09/22 15:55:45 INFO mapred.JobClient: Total committed heap usage (bytes)=30172774412/09/22 15:55:45 INFO mapred.JobClient: CPU time spent (ms)=012/09/22 15:55:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=10812/09/22 15:55:45 INFO mapred.JobClient: Combine input records=1412/09/22 15:55:45 INFO mapred.JobClient: Reduce input records=612/09/22 15:55:45 INFO mapred.JobClient: Reduce input groups=612/09/22 15:55:45 INFO mapred.JobClient: Combine output records=612/09/22 15:55:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=012/09/22 15:55:45 INFO mapred.JobClient: Reduce output records=612/09/22 15:55:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=012/09/22 15:55:45 INFO mapred.JobClient: Map output records=14************************那本书上面其实最后还有提到一个 in-Mapper Combining的一个编程,但是看的不是很明白,伪代码如下:
class Mappermethod InitializeS ← new AssociativeArrayC ← new AssociativeArraymethod Map(string t, integer r)S{t} ← S{t} + rC{t} ← C{t} + 1method Closefor all term t ∈ S doEmit(term t, pair (S{t}, C{t}))
继续学习 MR编程中。。
- Hadoop Combiner 操作
- hadoop-combiner
- combiner操作
- hadoop的规约操作Combiner(规约) 步骤1.5
- Hadoop的combiner尝试
- hadoop中combiner分析
- Hadoop深入学习:Combiner
- Hadoop的Combiner
- hadoop setCombinerClass Combiner Reduce
- Hadoop Combiner组件
- Hadoop Combiner 使用问题
- 【hadoop】 4002-Combiner组件
- Hadoop 中的 Combiner 过程
- Hadoop学习:Combiner
- Hadoop Combiner组件
- hadoop之 mapreduce Combiner
- Hadoop 中的 Combiner 过程
- Hadoop(14) MR Combiner
- android 横竖屏等设置
- Flash - textarea组件的背景边框设置
- Deverpress之TreeList、Grid
- Android禁止横屏竖屏切换
- linux下的g++编译器安装
- Hadoop Combiner 操作
- 伸长的守候
- php+mysql+apache环境搭建
- 【phpcms-v9】如何实现在含有子栏目的栏目下添加内容?
- linux闲话&&FHS标准下linux目录结构
- Linux下的软件安装与卸载
- 学习 java annotation
- POJ 2318
- 设计模式六大原则(1):单一职责原则