hadoop的统计单词程序WordCount

来源：互联网发布：阿里云域名管理登录编辑：程序博客网时间：2024/06/05 07:58

hadoop 是用 java 的实现的一个分布式系统框架，最主要的两个部分可能就是 MapReduce 和 HDFS 了，前者是个编程模型，后者是存储模型。算了，不多说了，拣重点。WordCount 是 MapReduce 的经典程序， Hadoop 官网有这个程序，现在 me 们就是要跑通这个程序。当然首先的第一步就是安装 Hadoop，这一点，本篇不多做介绍。其次，就是按照指定的步骤去执行就好，不过中途会出一些问题，这是本篇要介绍的重点。

假定 hadoop 已经安装了，of course JAVA 也安装了，然后按照下面的执行，应该就可以跑通程序了。

$ vim WordCount.java : 编辑 WordCount.java 文件，文件内容可以看后面的代码；
$ vim input.txt : 编辑输入文件 input.txt，内容随便了，现在程序就是要统计单词数；
$ mkdir class : 创建一个放 class 文件的目录；
$ javac -classpath /opt/hadoop/hadoop/hadoop-core-1.2.1.jar -d class WordCount.java : 编译 WordCount 源文件，class 文件放在 class 目录下；
$ jar -cvf wordcount.jar -C class . : 将程序打包；
$ hadoop jar wordcount.jar test.WordCount file:///home/hadoop/input.txt /tmp/output : 运行程序，file:/// 可以指定本地的文件作为输入文件， /tmp/output 是 HDFS 路径，存放输出结果；
$ hadoop fs -cat /tmp/output/part-r-00000 : 查看程序的结果；

如果是第一次按照上面的步骤执行，应该没有神马太大问题，不过不是第一次，或是程序 me 们有改动的话，上面的执行可能就要稍微变一下了，下面记录三个可能出现的问题，以及解决方案：

Exception in thread "main" java.lang.ClassNotFoundException: WordCount : 执行 WordCount 程序，也就是上面的第 6 步，需要添加 test 包名限制；
java.lang.ClassNotFoundException: test.WordCount$Map : 执行 WordCount 程序，抛出找不到 WordCount$Map 的异常，实际上该类和 WordCount.class 是在同一个文件夹下丫；官网的 hadoop 程序没有上面代码的有注释的那一行，就会抛出这个异常，加上去就可以了；
Output path already exists : Output director : 这个易理解，就是输出文件夹 /tmp/output 已经存在，可以指定一个其他的，或是删除该文件夹 $ hadoop fs -rmr /tmp/output；

程序代码

package test;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class); // +++++
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}