Hadoop实战(四) 编写MR程序

来源：互联网发布：手机怎么给淘宝改评价编辑：程序博客网时间：2024/05/16 05:15

Hadoop实战(四) 编写MR程序

正文之前，先放出本章重点概括：

一、MR程序框架
MyJob类，内含–
1.Mapper和Reducer作为内部类
2.run()作为driver，以实例化和配置作业

二、Hadoop Streaming
使用Unix中的流域程序进行交互

三、MR框架扩展
Combiner：解决shuffle任务过于繁重、数据倾斜（数据非均匀分布，某些reducer任务重）等问题。

样例数据集

1. 下载数据集

Address: http://www.nber.org/patents/
wget http://www.nber.org/patents/acite75_99.zip
wget wget http://www.nber.org/patents/apat63_99.zip
解压: unzip acite75_99.zip
上传到hdfs: hadoop fs -copyFromLocal ./cite75_99.txt /wttttt/chap4

2. 数据内容描述

专利引用数据：
每一行: CITING, CITED（逗号分隔，专利引用图）
专利描述数据：
每一行：专利号，批准年，批准日，申请年，国家，州，专利权人…(有缺失值)

MR程序框架

1. 对每个专利找到引用它的专利

输出样式:
CITED CITING1,CITING2,…
示例代码（见附录一）
用单个类MyJob来定义每个MR作业，Hadoop要求Mapper和Reducer是它们自身的静态类，这些类非常小，模板中将其作为MyJob的内部类。
因此MyJob中有如下几个对象：
- run(): driver,它实例化、配置并传递一个JobConf对象命名的作业给JobClient.runJob()以启动MR作业。
  - JobConf: 该对象保持作业运行所需的全部配置参数，包括输入/出路径、Mapper类、Reducer类等。
    当然，我们不希望通过其来配置所有的参数，所以安装Hadoop时的配置文件可以帮我们简化每次运行MR的配置。同时用户也可以在启动作业时传递额外的参数。
- Mapper: 核心为map()
- Reducer:核心为reduce()

wordcount基础程序

只需修改上述程序框架中的Reducer即可。
- 修改输出value类型： IntWritable

Hadoop Streaming

Hadoop Streaming使用Unix中的流与程序进行交互，从STDIN输入数据，输出到STDOUT。

1. 通过Unix命令使用Streaming

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.2.jar -input /wttttt/chap4/cite75_99.txt -output /wttttt/chap4/output -mapper 'cut -f 2 -d ,' -reducer 'uniq'

2. 通过脚本使用Streaming

允许任何脚本来处理按行组织的数据流，数据取自STDIN，并输出到STDOUT。
在标准Java模式中，每次仅传递一个记录给map()方法，而对streaming而言得到的是完整的数据流。
- 每个Mapper得到的是该分片所有数据的数据流
- 每个Reducer得到的是分配给其的排好序的数据流
Mapper的一个例子，见附录二
- 运行这个例子：

Combiner用于提升性能

MR中存在瓶颈：
1. 若输入10亿条输入记录，mapper就会生成10亿个键值对在网络在洗牌；
2. 数据倾斜：如使用专利数据集中的国家作为键，那么数据远非均匀分布，绝大多数键都是美国。那么大多数中间的键值对最终会进入单一的reducer中，使其不堪重负。
为应对上述瓶颈，Hadoop扩展MR框架，在其中增加了一个Combiner解决了这些瓶颈。

代码附录

附录一

public class MyJob extends Configured implements Tool {  // static class    public static class MapClass extends MapReduceBase        implements Mapper<Text, Text, Text, Text> {        public void map(Text key, Text value,                        OutputCollector<Text, Text> output,                        Reporter reporter) throws IOException {            output.collect(value, key);  // simply exchange key and value for each sample        }    }    public static class Reduce extends MapReduceBase        implements Reducer<Text, Text, Text, Text> {  // static class        public void reduce(Text key, Iterator<Text> values,                        OutputCollector<Text, Text> output,                        Reporter reporter) throws IOException {            String csv = "";            while (values.hasNext()) {  // handle all the samples that share same key                if (csv.length() > 0) csv += ",";                csv += values.next().toString();  // concat all the CITINGs that of same CITED            }            output.collect(key, new Text(csv));        }    }    public int run(String[] args) throws Exception {        Configuration conf = getConf();        JobConf job = new JobConf(conf, MyJob.class);        Path in = new Path(args[0]);        Path out = new Path(args[1]);        FileInputFormat.setInputPaths(job, in);        FileOutputFormat.setOutputPath(job, out);        job.setJobName("MyJob");        job.setMapperClass(MapClass.class);        job.setReducerClass(Reduce.class);        job.setInputFormat(KeyValueTextInputFormat.class);        job.setOutputFormat(TextOutputFormat.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(Text.class);        job.set("key.value.separator.in.input.line", ",");        JobClient.runJob(job);        return 0;    }    public static void main(String[] args) throws Exception {         int res = ToolRunner.run(new Configuration(), new MyJob(), args);        System.exit(res);    }}

附录二

import sysindex = int(sys.argv[1])  max = 0  for line in sys.stdin:   # read all data of a block from STDIN,     fields = line.strip().split(",")      if fields[index].isdigit():          val = int(fields[index])      if (val > max):          max = val  else:      print max

1 0