Word Count

来源:互联网 发布:淘宝网店成功经验总结 编辑:程序博客网 时间:2024/05/20 08:27

http://tlyxy228.blog.163.com/blog/static/181090120105208322823/

map(String key1, String value1):

// key1: doc name
// value1: doc contents(words)
  for each word w in value:
    EmitIntermediate(w, "1")

reduce(String key2, Iterator values2):
// key2: a word
// values2: a list of counts for intermediate values
  int result = 0
  for each v in values:
    result += ParseInt(v)
  Emit(AsString(result))

示例:

输入
hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
输出
hello: 3
world: 2
bye: 3
hadoop: 4

过程演示:

hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
1. Split

hello world bye world --> (key, value)

hello hadoop bye hadoop --> (key, value)

bye hadoop hello hadoop --> (key, value)

2. Map

hello world bye world 
--> 
<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

hello hadoop bye hadoop
 --> 
<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

bye hadoop hello hadoop 
-->
<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

{ 可选:Combine
将中间结果合并成<key, list(value)>,减少元组数目和网络流量
用不用Combine,一方面取决于数据的特征(重 复Key的多寡);另一方面就是网络带宽(网络速度很快时,Combine提高的性能有限,甚至不会提高 性能)。

<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

-->

<hello, 1>
<world, 1>
<world, 1>
<bye, 1>


<hello, 1>
<hadoop , 1>
<hadoop , 1>
<bye, 1>


<bye, 1>
<hadoop , 1>
<hadoop , 1>
<hello, 1>
}

3. Fold/Shuffle

<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

-->

<hello, 1>
<hello, 1>
<hello, 1>

<world, 1>
<world, 1>

<bye, 1>
<bye, 1>
<bye, 1>

<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>

按key2 partition成R个分区(与决定Reduce task节点的hash相同)并排序产生{key,{value}},各自存在本地磁盘

4. Reduce
读取{key,{value}},为每个(key,{value}),调用应用程序自定义的reduce函数。根据中间结果的value1(即key2)决定Reduce task的node(如 hash(key2)mod R),一个reduce可能读取多个map节点的中间数据

<hello, 1>
<hello, 1>
<hello, 1>

<world, 1>
<world, 1>

<bye, 1>
<bye, 1>
<bye, 1>

<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>

-->
<hello, 3>
<world, 2>
<bye, 3>
<hadoop , 4>

源代码 WordCount.java

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {

    public static class Map extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);
            }
        }
    }

    public static class Reduce extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }

            output.collect(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }
}
原创粉丝点击