MapReduce

来源：互联网发布：佛山编程培训学校编辑：程序博客网时间：2024/04/27 20:15

MapReduce

MapReduce: 先映射（即从一种形式转化到另一种形式）（map）后合并结果(reduce)，就这么个东西。

拆成多个子任务（map）--->然后合并结果（reduce）。

还有容错功能？一台机器挂了，咋办？

映射---化简

提供这么一个框架。

baidu百科

http://baike.baidu.com/view/2902.htm

我是如何向老婆解释MapReduce的很不错的解释

http://blog.jobbole.com/1321/

MapReduce架构设计

老师使用了一个西红柿炒鸡蛋的例子讲解的挺好。其实可以扩展下，比如饭店。

术语：

Job ：工作，用户的每一个计算请求

JobTracker ：用户提交作业的服务器，同时负责各个作业任务的分配，管理所有的任务管理器。

TaskTracker：任劳任怨的工蜂，负责执行具体任务的。

Task：每一个作业，都需要拆分开，交由多个服务器来完成，拆分出来的执行单元，就是任务

MapReduce开发

1、配置文件 core-site.xml 跟hdfs一样

2、

public class TestTool extends Configured implements Tool {

public int run(String[] args ) throws Exception {

for(Entry<String, String> entry : cfg) {

System.out.println(entry.getKey() + “:” + entry.getValue());

}

public static void main(String[] args) {

int exitCode = ToolRunner.run(new TestTool(), args);

}

配置文件管理

在不同的环境可能需要不同的配置文件

hadoop fs –conf 配置文件切换配置文件

统计过去50年毕业论文中出现最多的几个单词：

1、单机器顺序遍历

2、单机器多线程并发遍历

3、多台机器并发遍历

4、MapReduce

map函数、reduce函数

map函数接受一个键值对，产生一组中间键值对将中间键值对中键相同的传给一个reduce

reduce函数接受一个键，以及相关的一组值，将这组值进行合并并产生一组规模更小的值（通常一个/零个值）

Word Counter

计算词频

开发步骤：

1、编写Mapper

继承MapReduceBase Mapper<输入的key，输入的value，输出的key，输出的value>

2、编写Reducer

继承MapReduceBase Reducer<输入key，输入value，输出key，输出value>

3、编写一个Driver类（即Job），来将Mapper与Reducer类进行组合

具体代码如下：

1、Mapper

pulic class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private static final IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

for(String word : line.split(“\\w+”)) {

if(word.length() > 0) {

output.collect(new Text(word), new IntWritable(1));

}

2、Reducer

public class WordReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) {

int sum = 0;

while(values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

3、Driver

public class WordCount extends Configured implements Tool {

public int run(String[] args) {

if(args.length != 2) {

System.err.println(“请使用 WordCount 输入路径输出路径”);

System.exit(-1);

}

//任务配置

JobConf conf = new JobConf(getConf(), WordCount.class); //设置任务类

conf.setJobName(“wordcount”); //任务名字

//设置输出key/value的数据类型

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(WordCountMapper.class);//设置Mapper类

conf.setReducerClass(WordCountReducer.class);//设置Reducer类

FileInputFormat.addInputPaths(conf, new Path(args[0]);); //设置输入路径

FileInputFormat.setOutputPaths(conf, new Path(args[1]);); //设置输出路径

JobClient.runJob(conf);

return 0;

}

public static void main(String args) {

ToolRunner.run(new WordCount(), args); //运行

}

Mapper/Reducer API

老版的在org.apache.hadoop.mapred Interface Mapper<Input key, Input value, Output key, Output value>

新版的在 org.apache.hadoop.mapreduce Class Mapper<Input key, Input value, Output key, Output value>

老版的在org.apache.hadoop.mapred Interface Reducer<Input key, Input value, Output key, Output value>

新版的在 org.apache.hadoop.mapreduce Class Reducer<Input key, Input value, Output key, Output value>

新版是完全重做的api，比较悲剧

MapReduce工作模式：

1、Local(Standalone) Mode：一个JVM跑，无分布式，不使用HDFS，使用本地Linux文件系统

2、Pseudo-distrubuted Mode：伪分布式，一台机器虚拟多个JVM进程

3、Fully-distributed Mode ：真正的分布式，多台机器分布式

为hadoop设置默认的文件系统：

在core-site.xml中修改fs.default.name--->

file:/// 本地 hdfs://localhost:9000 伪分布式 hdfs://namenode 真正的分布式

HDFS 客户端使用这个属性决定NameNode的位置，

mapred.job.tracker： local（本地） localhost:9001（伪分布式） jobtracker:9001（真正的分布式）

集群上运行Word Count

1、打包，打成jar包即可

2、启动

hadoop jar jar包位置 mainClass –conf 输入目录输出目录

MapReduce 网络用户界面 url：http://localhost:50030

输入目录和输出目录（必须不存在）在hdfs上

3、获取结果

hadoop fs –ls 输出目录

hadoop fs –cat 输出目录/part-NNNN 结果

还有 _SUCCESS(成功的信息) _logs（日志）

如何分解问题为MapReduce

复杂需求：

在Word Count程序中，求出单词出现的频率总和

单词中包含大写字母H的则转换为小写

在Word Count程序中，求出单词出现频率的总和与单词的个数

运行独立的job：

假设有Job1、Job2，需要运行

JobClient.rumJob(job1);

JobClient.rumJob(job2); 线性的

我的博客：http://sishuok.com/forum/blogPost/list/0/6918.html