MapReduce小结

来源：互联网发布：淄博用友软件编辑：程序博客网时间：2024/06/05 02:05

1、MapReduce Provides：

-Automatic parallelization & distribution；

-Fault-tolerance；

-Status and monitoring tools；

-A clean abstraction for programmers

（1）map (in_key, in_value) ->(out_key, intermediate_value) list：

-Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g.,(filename, line).

-map() produces one or more intermediate values along with an output key from the input.

（2）reduce (out_key, intermediate_value list) ->out_value list：

-After the map phase is over, all the intermediate values for a given output key are combined together into a list；

-reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

2、Parallelism

（1）map() functions run in parallel,creating different intermediate values from different input data sets

（2）reduce() functions also run in parallel,each working on a different output key

（3）All values are processed independently

（4）Bottleneck: reduce phase can’t start until map phase is completely finished.

3、MapReduce Conclusions

（1）MapReduce has proven to be a useful abstraction in many areas

（2）Greatly simplifies large-scale computations

（3）Functional programming paradigm can be applied to large-scale applications

（4）You focus on the “real” problem, library deals with messy details

4、Example Word Count ：Map（）

public static class MapClass extends MapReduceBase implements Mapper {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(WritableComparable key, Writable value,OutputCollector output, Reporter reporter)throws IOException {

String line = ((Text)value).toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

}

Reduce（）

public static class Reduce extends MapReduceBase implements Reducer {

public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter)throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += ((IntWritable) values.next()).get();

}

output.collect(key, new IntWritable(sum));

}

public static void main(String[] args) throws IOException {

JobConf conf = new JobConf();

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputPath(new Path(args[0]));

conf.setOutputPath(new Path(args[1]));

JobClient.runJob(conf);

}

5、One time setup

-set hadoop-site.xml and slaves

-Initiate namenode

-Run Hadoop MapReduce and DFS

-Upload your data to DFS

-Run your process…

-Download your data from DFS

*A simple programming model for processing large dataset on large set of computer cluster

*Fun to use, focus on problem, and let the library deal with the messy detail

6、References

- Original paper (http://labs.google.com/papers/mapreduce.html)

-On wikipedia (http://en.wikipedia.org/wiki/MapReduce)

-Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/)

-Starfish - MapReduce in Ruby (http://rufy.com/starfish/)