WordCount详解

来源：互联网发布：软件自动安装管理器编辑：程序博客网时间：2024/05/22 17:36

转载自：http://www.cnblogs.com/xia520pi/archive/2012/05/16/2504205.html

为自己梳理

会添加自己的理解

代码如下

package testMapReduce;import java.io.File;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static class dataMapper extends Mapper<Object,Text,Text,IntWritable>{private final static IntWritable one=new IntWritable(1);private Text word=new Text();public void map(Object key,Text value,Context context)throws IOException,InterruptedException{//StringTokenizer itr=new StringTokenizer(value.toString());//while(itr.hasMoreTokens()){//word.set(itr.nextToken());//context.write(word,one);String[] sp=value.toString().split(" ");for(int i=0;i<sp.length;i++){word.set(sp[i]);context.write(word,one);}}}public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{private IntWritable result=new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException,InterruptedException{int sum=0;for(IntWritable val:values){sum+=val.get();}result.set(sum);context.write(key, result);}}public static void delFile(File file){if(file.exists()){if(file.isFile()){file.delete();}else{File files[]=file.listFiles();for(int i=0;i<files.length;i++){delFile(files[i]);}}file.delete();}}public static void main(String args[])throws Exception{Configuration conf=new Configuration();String[] otherArgs=new GenericOptionsParser(conf,args).getRemainingArgs();File out=new File(otherArgs[1]);if(out.isDirectory()){delFile(out);}if(otherArgs.length!=2){System.err.println("Usage:wordcount <in> <out>");System.exit(2);}Job job=new Job(conf,"wordcount");job.setJarByClass(WordCount.class);job.setMapperClass(dataMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));System.exit(job.waitForCompletion(true)?0:1);}}

程序分3大部分 map reduce 和main方法

main方法。

在MapReduce中，由Job对象负责管理和运行一个计算任务，并通过Job的一些方法对任务的参数进行相关的设置。

此处设置了使用TokenizerMapper完成Map过程中的处理和使用IntSumReducer完成Combine和Reduce过程中的处理。还设置了Map过程和Reduce过程的输出类型：key的类型为Text，value的类型为IntWritable。任务的输出和输入路径则由命令行参数指定，并由FileInputFormat和FileOutputFormat分别设定。完成相应任务的参数设定后，即可调用job.waitForCompletion()方法执行任务。

public static class dataMapper extends Mapper<Object,Text,Text,IntWritable>{private final static IntWritable one=new IntWritable(1);private Text word=new Text();public void map(Object key,Text value,Context context)throws IOException,InterruptedException{//StringTokenizer itr=new StringTokenizer(value.toString());//while(itr.hasMoreTokens()){//word.set(itr.nextToken());//context.write(word,one);String[] sp=value.toString().split(" ");for(int i=0;i<sp.length;i++){word.set(sp[i]);context.write(word,one);}}}

Map过程需要继承org.apache.hadoop.mapreduce包中Mapper类，并重写其map方法。它有4种形式的参数，分别用来指定map的输入key值类型、输入value值类型、输出key值类型和输出value值类型。通过在map方法中添加两句把key值和value值输出到控制台的代码，可以发现map方法中value值存储的是文本文件中的一行（以回车符为行结束标记），而key值为该行的首字母相对于文本文件的首地址的偏移量。然后StringTokenizer类将每一行拆分成为一个个的单词，并将<word,1>作为map方法的结果输出，其余的工作都交有MapReduce框架处理。

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{private IntWritable result=new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException,InterruptedException{int sum=0;for(IntWritable val:values){sum+=val.get();}result.set(sum);context.write(key, result);}}

Reduce过程需要继承org.apache.hadoop.mapreduce包中Reducer类，并重写其reduce方法。Map过程输出<key,values>中key为单个单词，而values是对应单词的计数值所组成的列表，Map的输出就是Reduce的输入，所以reduce方法只要遍历values并求和，即可得到某个单词的总次数。

0 0