Word Count
来源:互联网 发布:淘宝网店成功经验总结 编辑:程序博客网 时间:2024/05/20 08:27
http://tlyxy228.blog.163.com/blog/static/181090120105208322823/
map(String key1, String value1):
// key1: doc name// value1: doc contents(words)
for each word w in value:
EmitIntermediate(w, "1")
reduce(String key2, Iterator values2):
// key2: a word
// values2: a list of counts for intermediate values
int result = 0
for each v in values:
result += ParseInt(v)
Emit(AsString(result))
示例:
输入
hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
输出
hello: 3
world: 2
bye: 3
hadoop: 4
过程演示:
hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
1. Split
hello world bye world --> (key, value)
hello hadoop bye hadoop --> (key, value)
bye hadoop hello hadoop --> (key, value)
2. Map
hello world bye world
-->
<hello, 1>
<world, 1>
<bye, 1>
<world, 1>
hello hadoop bye hadoop
-->
<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>
bye hadoop hello hadoop
-->
<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>
{ 可选:Combine
将中间结果合并成<key, list(value)>,减少元组数目和网络流量
(用不用Combine,一方面取决于数据的特征(重 复Key的多寡);另一方面就是网络带宽(网络速度很快时,Combine提高的性能有限,甚至不会提高 性能)。)
<hello, 1>
<world, 1>
<bye, 1>
<world, 1>
<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>
-->
<hello, 1>
<world, 1>
<world, 1>
<bye, 1>
<hello, 1>
<hadoop , 1>
<hadoop , 1>
<bye, 1>
<bye, 1>
<hadoop , 1>
<hadoop , 1>
<hello, 1>
}
3. Fold/Shuffle
<hello, 1>
<world, 1>
<bye, 1>
<world, 1>
<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>
-->
<hello, 1>
<hello, 1>
<hello, 1>
<world, 1>
<world, 1>
<bye, 1>
<bye, 1>
<bye, 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
按key2 partition成R个分区(与决定Reduce task节点的hash相同)并排序产生{key,{value}},各自存在本地磁盘
4. Reduce
读取{key,{value}},为每个(key,{value}),调用应用程序自定义的reduce函数。根据中间结果的value1(即key2)决定Reduce task的node(如 hash(key2)mod R),一个reduce可能读取多个map节点的中间数据
<hello, 1>
<hello, 1>
<hello, 1>
<world, 1>
<world, 1>
<bye, 1>
<bye, 1>
<bye, 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
-->
<hello, 3>
<world, 2>
<bye, 3>
<hadoop , 4>
源代码 WordCount.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
- Word Count
- [Memo] restricted word count
- python word count
- hadoop word count example
- Word Count on hadoop
- Word Count示例
- Spark word count 实例
- Scala 版 word count
- Spark-Word Count实例
- Spark---Word Count
- Python Word Count
- spark word count
- Word Count (Map Reduce)
- Word Count (Map Reduce)
- [pySpark][note]Word Count Lab: Building a word count application
- hadoop php streming word count
- Hadoop AWS Word Count 例子
- linux wc(word count)命令
- android 进程优先级
- 使用FFMPEG SDK解码流数据获得YUV数据及其大小
- UIButton详解--转
- AsyncQueryHandler学习
- C#高级编程(第7版)读书笔记(二)
- Word Count
- Android NDK开发入门实例
- 互联网产品需要的人员
- C#高级编程(第7版)读书笔记(三)
- 分组 查询前几条数据
- java call filenet ejb api
- 好的iphone开发博客(继续更新中....)
- CGLIB包
- C语言入门基础