hadoop学习笔记（三）mapreduce程序wordcount

来源：互联网发布：多玩魔盒for mac 编辑：程序博客网时间：2024/05/16 18:44

Mapreduce程序WordCount

参考：

http://www.cnblogs.com/xia520pi/archive/2012/05/16/2504205.html

http://www.cnblogs.com/taven/archive/2012/11/03.html

http://luluq1987.blog.163.com/blog/static/40790681201121934352484/

http://luluq1987.blog.163.com/blog/static/407906812011267347477/

程序源码：

Tips：不同版本的hadoop的WordCount可能不同，存在一些过期的接口或类，本版本是2.20.2版本，最好找自己版本的例子做参考，Wordcount在hadoop目录的 src/org/apache/hadoop/examples中

package com.ptrdu.test;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordcount <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

程序分为三个部分：

TokenizerMapper类

IntSumReducer类

一个主方法

先了解mapreduce的工作流程：

TokenizerMapper类继承Mapper类，重写map方法,对输入的<key,value>进行处理生成<key,list<value>>作为reduce的输入。

IntSumReducer类继承Reducer类，重写了reduce方法，对得到的来字map的<key,lsit<value>>进行处理。

主函数是函数的入口，其中生成了一个Job类型的对象job，job作为这次工作的主体，可以使用hadoop提供的相关接口，对mapreduce的参数进行设置。

工作流程如下图：

源码详细分析：

一、先看TokenizerMapper类：

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

TokesizerMapper类继承了Mapper类并对map函数进行了重写。这是hadoop帮助手册中对于class Mapper类的描述：

Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

其中<KEYIN,VALUEIN,KEYOUT,VALUEOUT>分别代表输入键/值的类型和输出键 /值的类型。

对照本例中的TokesizerMapper类，可以得知本例输入的键/值类型为：Object和Text,输出的k/v对类型为<Text，IntWritable>。

private final static IntWritable one = new IntWritable(1);

这段代码的意思是把每个单词的数量都置为one，即1，具体什么作用后面执行会体现。

private Text word = new Text();

一个变量用来存储传来文件的键值。

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException

这一段即为具体的map函数，也是关键函数，它在每个节点上执行，产生<key,list<value>>为后面reduce提供输入，其中Context context就是为reduce保存数据，以前版本为outputcollector类型。

StringTokenizer itr = new StringTokenizer(value.toString());

为分词提供准备，参照StringTokenizer作用。首先将将获得的Text类型的value转成String型，然后再转变成StringTokenizer为后面的分词做准备。

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

对于一行Text进行扫描，以空格作为分隔符，然后word.set(itr.nextToken());将String型转换成Text型，context.write(word, one);再生成对应的<key,value>对。

比如一行文本hello world hello hadoop（为什么是一行，以设置的Inputformat有关系，本例是默认的，即TextInputFormat,这个类型是以文本文件中的每一行作为一个记录），该段文本扫描后，能得到的结果是：

Hello 1

World 1

Hello 1

Hadoop 1

为什么全是一，即context.write(word, one)，中都是以one作为value值的，得到的输出k/v对为<hello ,(1,1)>,<world,1>,<hadoop,1>.

二、再看IntSumReducer类

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

IntSumReducer类继承reducer类，并实现reduce函数。

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable>

对比帮助手册，<Text,IntWritable,Text,IntWritable>各个字段同样对应<KEYIN,VALUEIN,KEYOUT,VALUEOUT>。

private IntWritable result = new IntWritable();

这段代码产生一个result用来存放结果。即<key,value>中的value。

public void reduce(Text key, Iterable<IntWritable> values,

Context context

)

Reduce函数的参数定义，Text key, Iterable<IntWritable> values,即对应map函数传来的<key,list<value>>，context同样用来保存数据。

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

计算每个key的数量，IntWritable val : values相当于for(Intwritable val=0,val<values.length(),val++),得出总数保存在sum中，如传入的<hello,(1,1)>对，sum即等于2.

result.set(sum);

context.write(key, result)；

将sum转换成IntWritable型，再使用context.write(key, result)；保存。

主函数：

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordcount <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

主函数，mapreduce的入口。

Configuration conf = new Configuration();

获取配置参数，即前面配置的一些参数，比如core-site.xml，hdfs-site.xml这些里面的一些参数。

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordcount <in> <out>");

System.exit(2);

}

String数组作为路径，如果少于两条就退出，因为只是要有一个输入和输出路径。

Job job = new Job(conf, "word count");

Job对象，以word count作为这次job的名字。

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

Mapreduce的关于map和reduce的设置，job.setCombinerClass(IntSumReducer.class);相当于本地的一次reduce，暂不太了解。

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class)

输出<k,v>的类型.

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

输入输出路径。

System.exit(job.waitForCompletion(true) ? 0 : 1);

判断程序是否等待什么的，是否需要退出。

Ps：本例中因为是使用默认的TextInputFormat，所以没有设置输入输出形式，对于其他形式需要使用job.setInputFormat(object.class);和job.setOutputFormat（object.class）作设置。