Hadoop Combiner 操作

来源：互联网发布：站长域名查询编辑：程序博客网时间：2024/06/04 23:32

近期看了一本书：Data-intensive Text Processing with MapReduce，是讲如何设计MR程序的，看到一个例子是Combiner的设计模式，然后就动手实现了下。具体问题如下：

现有输入数据如下：

one3.9one4.0one3.8two44two44two44three9898four2323four2323five2323six23six2323four232five2323

第一列代表用户，第二列代表用户在一个网站上所停留的时间，现在想求每个用户在这个网站的平均停留时间。如果不用combine操作的话，那么其MR伪代码如下（复制书上的内容）：

class Mappermethod Map(string t, integer r)Emit(string t, integer r)class Reducermethod Reduce(string t, integers [r1 , r2 , . . .])sum ← 0cnt ← 0for all integer r ∈ integers [r1 , r2 , . . .] dosum ← sum + rcnt ← cnt + 1ravg ← sum/cntEmit(string t, integer ravg )

如果要加combine怎么操作呢？Combiner和Reducer一样么(求最大气温的例子或许是一样的，但这里却不是，而且现实中的很多例子都不是一样的)，如果一样的话那么就会变成下面的错误操作了：

Mean(1, 2, 3, 4, 5) = Mean(Mean(1, 2), Mean(3, 4, 5))

正确的伪代码如下（书上摘录）：

class Mappermethod Map(string t, integer r)Emit(string t, pair (r, 1))class Combinermethod Combine(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .])sum ← 0cnt ← 0for all pair (s, c) ∈ pairs [(s1 , c1 ), (s2 , c2 ) . . .] dosum ← sum + scnt ← cnt + cEmit(string t, pair (sum, cnt))class Reducermethod Reduce(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .])sum ← 0cnt ← 0for all pair (s, c) ∈ pairs [(s1 , c1 ), (s2 , c2 ) . . .] dosum ← sum + scnt ← cnt + cravg ← sum/cntEmit(string t, integer ravg )

由于Combiner的输入和输出格式要一样，即Combiner的输入要和Mapper的输出格式一样，Combiner的输出要和Reducer的输入格式一样。所以上面有pairs。参考上面的伪代码编写的代码如下：

Driver:

package org.fansy.date922;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class AverageDriver3 {public static void main(String[] args) throws Exception{// TODO Auto-generated method stubConfiguration conf1 = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf1, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: AverageDriver<in> <out>");      System.exit(2);    }    Job job1 = new Job(conf1, "AverageDriver  job ");    job1.setInputFormatClass(KeyValueTextInputFormat.class);        job1.setNumReduceTasks(1);    job1.setJarByClass(AverageDriver3.class);     job1.setMapperClass(AverageM2.class);    job1.setMapOutputKeyClass(Text.class);job1.setMapOutputValueClass(TextPair.class);job1.setCombinerClass(AverageC3.class);    job1.setReducerClass(AverageR2.class);    job1.setOutputKeyClass(Text.class);    job1.setOutputValueClass(DoubleWritable.class);    KeyValueTextInputFormat.addInputPath(job1, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job1, new Path(otherArgs[1]));        if(!job1.waitForCompletion(true)){    System.exit(1); // run error then exit    }  System.out.println("************************");}}

Mapper:

package org.fansy.date922;import java.io.IOException;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class AverageM3 extends Mapper<Text,Text,Text,TextPair>{//private Text newkey=new Text();private TextPair newvalue=new TextPair();private DoubleWritable r=new DoubleWritable();private IntWritable number=new IntWritable(1);public  void map(Text key,Text value,Context context)throws IOException,InterruptedException {// TODO Auto-generated method stubSystem.out.println(key.toString());double shuzhi=Double.parseDouble(value.toString());r.set(shuzhi);newvalue.set(r, number);context.write(key, newvalue);}}

Combiner:

package org.fansy.date922;import java.io.IOException;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class AverageC3 extends Reducer<Text,TextPair,Text,TextPair>{private DoubleWritable newvalued=new DoubleWritable();private IntWritable newvaluei=new IntWritable();private TextPair newvalue=new TextPair();public  void reduce(Text key,Iterable<TextPair> values,Context context) throws IOException,InterruptedException{// TODO Auto-generated method stubdouble sum= 0.0;int num=0;for(TextPair val:values){sum+=val.getFirst().get();num+=val.getSecond().get();}newvalued.set(sum);newvaluei.set(num);newvalue.set(newvalued,newvaluei);context.write(key, newvalue);}}

Reducer:

package org.fansy.date922;import java.io.IOException;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class AverageR3 extends Reducer<Text,TextPair,Text,DoubleWritable>{private DoubleWritable newvalue=new DoubleWritable();public  void reduce(Text key,Iterable<TextPair> values,Context context) throws IOException,InterruptedException{// TODO Auto-generated method stubdouble sum= 0.0;int num=0;for(TextPair val:values){sum+=val.getFirst().get();num+=val.getSecond().get();}double aver=sum/num;newvalue.set(aver);context.write(key, newvalue);}}

TextPair:

package org.fansy.date922;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.WritableComparable;public class TextPair implements WritableComparable<TextPair> {private DoubleWritable first;private IntWritable second;public TextPair(){set(new DoubleWritable(),new IntWritable());}public  void set(DoubleWritable longWritable, IntWritable intWritable) {// TODO Auto-generated method stubthis.first=longWritable;this.second=intWritable;}public DoubleWritable getFirst(){return first;}public IntWritable getSecond(){return second;}@Overridepublic void readFields(DataInput arg0) throws IOException {// TODO Auto-generated method stubfirst.readFields(arg0);second.readFields(arg0);}@Overridepublic void write(DataOutput arg0) throws IOException {// TODO Auto-generated method stubfirst.write(arg0);second.write(arg0);}@Overridepublic int compareTo(TextPair o) {// TODO Auto-generated method stubint cmp=first.compareTo(o.first);if(cmp!=0){return cmp;}return second.compareTo(o.second);}}

查看终端中的显示也可以看出的确是有combine操作的：

12/09/22 15:55:45 INFO mapred.JobClient: Job complete: job_local_000112/09/22 15:55:45 INFO mapred.JobClient: Counters: 2212/09/22 15:55:45 INFO mapred.JobClient:   File Output Format Counters 12/09/22 15:55:45 INFO mapred.JobClient:     Bytes Written=6512/09/22 15:55:45 INFO mapred.JobClient:   FileSystemCounters12/09/22 15:55:45 INFO mapred.JobClient:     FILE_BYTES_READ=46612/09/22 15:55:45 INFO mapred.JobClient:     HDFS_BYTES_READ=24412/09/22 15:55:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=8275812/09/22 15:55:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=6512/09/22 15:55:45 INFO mapred.JobClient:   File Input Format Counters 12/09/22 15:55:45 INFO mapred.JobClient:     Bytes Read=12212/09/22 15:55:45 INFO mapred.JobClient:   Map-Reduce Framework12/09/22 15:55:45 INFO mapred.JobClient:     Map output materialized bytes=11812/09/22 15:55:45 INFO mapred.JobClient:     Map input records=1412/09/22 15:55:45 INFO mapred.JobClient:     Reduce shuffle bytes=012/09/22 15:55:45 INFO mapred.JobClient:     Spilled Records=1212/09/22 15:55:45 INFO mapred.JobClient:     Map output bytes=23112/09/22 15:55:45 INFO mapred.JobClient:     Total committed heap usage (bytes)=30172774412/09/22 15:55:45 INFO mapred.JobClient:     CPU time spent (ms)=012/09/22 15:55:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=10812/09/22 15:55:45 INFO mapred.JobClient:     Combine input records=1412/09/22 15:55:45 INFO mapred.JobClient:     Reduce input records=612/09/22 15:55:45 INFO mapred.JobClient:     Reduce input groups=612/09/22 15:55:45 INFO mapred.JobClient:     Combine output records=612/09/22 15:55:45 INFO mapred.JobClient:     Physical memory (bytes) snapshot=012/09/22 15:55:45 INFO mapred.JobClient:     Reduce output records=612/09/22 15:55:45 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=012/09/22 15:55:45 INFO mapred.JobClient:     Map output records=14************************

那本书上面其实最后还有提到一个 in-Mapper Combining的一个编程，但是看的不是很明白，伪代码如下：

class Mappermethod InitializeS ← new AssociativeArrayC ← new AssociativeArraymethod Map(string t, integer r)S{t} ← S{t} + rC{t} ← C{t} + 1method Closefor all term t ∈ S doEmit(term t, pair (S{t}, C{t}))

继续学习 MR编程中。。