Hadoop 例子之 WordCount

来源：互联网发布：软件退税计算方法编辑：程序博客网时间：2024/05/22 14:35

WordCount

对hadoop例子WordCount进行代码分析学习。

注：本文仅为学习笔记，中间会包含从网络或其他出处获取的资料，文后会标注出处，若有遗漏，麻烦提醒以便修订，敬请原谅

作用

计算文件中各个词出现的次数。

Map

publicstaticclass TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

privatefinalstatic IntWritableone =new IntWritable(1);

private Textword = new Text();

publicvoid map(Objectkey, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizeritr = newStringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word,one);

}

1. 继承自org.apache.haddop.mapreduce.Mapper类，覆盖实现public voidmap(Object key,Text value,Context context) throwsIOException,InterruptedException方法。

2. 输入的key为偏移量，value为每行文本，context为上下文操作对象。

3. 输出为每一个词<word,1>的键值对。

4. 泛型Object,Text,Text,IntWritable分别为输入键类型，输入值类型，输出键类型，输出值类型。

Combiner和reducer

publicstaticclass IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritableresult = new IntWritable();

publicvoid reduce(Textkey, Iterable<IntWritable>values,

Context context

) throws IOException, InterruptedException {

intsum = 0;

for (IntWritableval : values) {

sum +=val.get();

}

result.set(sum);

context.write(key,result);

}

1. 继承自org.apache.hadoop.mapreduce.Reducer,覆盖实现了public voidreduce(Text key,Iterable<IntWritable> values, Context context) throwsIOException, InterruptedException方法

2. 输入的key 为词，values为map中生成的该词每出现一次的值1的列表。

3. 输出为<word,count>,count即为最终结果

4. Text,IntWritable,Text,IntWritable，分别为reduce或combiner输入的键类型，输入值类型，输出键类型，输出值类型。

主函数

publicstaticvoid main(String[]args) throws Exception {

Configuration conf = new Configuration(); //1

String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();

if (otherArgs.length < 2) {

System.err.println("Usage: wordcount <in> [<in>...] <out>");

System.exit(2);

}

Job job = Job.getInstance(conf,"word count"); //2

job.setJarByClass(WordCount.class); //3

job.setMapperClass(TokenizerMapper.class); //4

job.setCombinerClass(IntSumReducer.class); //5

job.setReducerClass(IntSumReducer.class); //6

job.setOutputKeyClass(Text.class); //7

job.setOutputValueClass(IntWritable.class); //8

for (inti = 0; i < otherArgs.length - 1; ++i) {

FileInputFormat.addInputPath(job,new Path(otherArgs[i])); //9

}

FileOutputFormat.setOutputPath(job,

new Path(otherArgs[otherArgs.length - 1]));//10

System.exit(job.waitForCompletion(true) ? 0 : 1); //11

}

1. 创建conf实例用于生成Job实例

2. 使用conf实例创建Job实例

3. 通过类来设置应用的Jar

4. 设置mapper类

5. 设置Combiner类

6. 设置Reducer类

7. 设置最终输出的键类型

8. 设置最终输出的值类型

9. 添加输入文件路径

10. 设置输出文件路径

11. 等待Job完成

总结

Map-reduce应用组成：

1. 确认输入和输出的最终键类型和值类型，开发Mapper和Reducer

2. 入口函数配置Mapper,Combiner,Reducer,最终输出键类型，最终输出值类型，添加输入文件路径，设置输出文件路径，提交任务

输入和输出文件的设置

FileInputFormat.addInputPath(job,new Path(otherArgs[i]));

FileOutputFormat.setOutputPath(job,newPath(otherArgs[otherArgs.length - 1]))

任务的提交

Job.submit()

Job.waitForCompletion(Boolean)

引用：

1. 文中代码引用自Hadoop 2.7.0 自带例子：org.apache.hadoop.examples.WordCount

0 0