Hadoop之mapreduce
来源:互联网 发布:孟加拉 各月进出口数据 编辑:程序博客网 时间:2024/05/20 14:25
MapReduce是一种可用于数据处理的编程模型。该模型比较简单,但是编写有用的程序并不简单。
Hadoop的MapReduce任务过程被分为两个阶段:map阶段和reduce阶段。每个阶段均以键值对作为输入和输出,类型由程序员选择。程序员还需要编写这两个函数的具体实现:map函数和reduce函数。
下面就以文本类型的文件作为输入进行详细说明:
map函数
输入:到的输入key是文本的行偏移量,下表从0开始,value则是行偏移量对应的一行文本。
输入:根据需要定制key和value
reduce函数
输入:类型必须与map函数的数据类型相匹配。
输出:根据需要定制key和value
注意:map函数和reduce的函数输出类型通常相同,这样只需要调用JoConf的setOutputKeyClass()和setOutputValueClass()即可;
如果不同那么还需要单独设定map函数的setMapOutputKeyClass()和setMapOutputValueClass()函数。
MapReduce执行过程:
Hadoop首先将输入数据划分成相同大小的数据块成为输入分片简称分片,Hadoop为每个分片构建一个map任务并由该任务运行用户定义的map函数,从而处理分片中的每一条记录。然后hadoop会将map的输入结果进行排序和重新分组,并复制到reduce程序所在的节点机器上进行合并,将reduce处理后的结果进行合并输出。
注意:map函数处理的结果是在本地,而reduce的处理结果在HDFS文件系统中。
举例说明:我们有三个文件,存放在hdfs文件系统的/usr/zhy/input 路径下,现在需要去掉重复的记录
file1
2012-3-1 a
2012-3-2 b
2012-3-3 c
2012-3-1 a
2012-3-2 b
2012-3-3 c
file2
2012-3-2 b
2012-3-3 c
2012-3-4 d
file3
2012-3-5 a
2012-3-3 c
2012-3-4 d
2012-3-5 a
如上所示部分是重复记录
预期输出结果如下
2012-3-1 a
2012-3-2 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
- package com.test;
- import java.io.IOException;
- import java.util.Iterator;
- import org.apache.hadoop.*;
- public class RemoveRepeat {
- public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
- public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
- String line = value.toString();//这里是行文本
- if(line!=null && !"".equals(line.trim()))//去掉空行
- output.collect(new Text(line), new Text(""));//直接以行内容作为key,空字符串作为value
- }
- }
- public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
- public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
- output.collect(new Text(key), new Text(""));//这里直接输出key和空value
- }
- }
- public static void main(String[] args) throws Exception {
- JobConf conf = new JobConf(RemoveRepeat.class);
- conf.setJobName("removerepeat");
- conf.setOutputKeyClass(Text.class);
- conf.setOutputValueClass(Text.class);
- conf.setMapperClass(Map.class);
- conf.setCombinerClass(Reduce.class);
- conf.setReducerClass(Reduce.class);
- conf.setInputFormat(TextInputFormat.class);
- conf.setOutputFormat(TextOutputFormat.class);
- FileInputFormat.setInputPaths(conf, new Path(args[0]));
- FileOutputFormat.setOutputPath(conf, new Path(args[1]));
- JobClient.runJob(conf);
- }
- }
将上述文件类文件打包成remove.jar
拷贝到虚拟机上hadoop的execlib目录
执行命令
zhy@ubuntu:~/Desktop/hadoop-1.0.3$ bin/hadoop jar execlib/remove.jar com.test.RemoveRepeat /usr/zhy/input /usr/zhy/output/remove
12/10/22 00:11:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/10/22 00:11:47 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/22 00:11:47 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/22 00:11:47 INFO mapred.FileInputFormat: Total input paths to process : 3
12/10/22 00:11:47 INFO mapred.JobClient: Running job: job_201210212344_0005
12/10/22 00:11:48 INFO mapred.JobClient: map 0% reduce 0%
12/10/22 00:12:02 INFO mapred.JobClient: map 66% reduce 0%
12/10/22 00:12:08 INFO mapred.JobClient: map 100% reduce 0%
12/10/22 00:12:20 INFO mapred.JobClient: map 100% reduce 100%
12/10/22 00:12:25 INFO mapred.JobClient: Job complete: job_201210212344_0005
12/10/22 00:12:25 INFO mapred.JobClient: Counters: 30
12/10/22 00:12:25 INFO mapred.JobClient: Job Counters
12/10/22 00:12:25 INFO mapred.JobClient: Launched reduce tasks=1
12/10/22 00:12:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=26395
12/10/22 00:12:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/10/22 00:12:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/10/22 00:12:25 INFO mapred.JobClient: Launched map tasks=3
12/10/22 00:12:25 INFO mapred.JobClient: Data-local map tasks=3
12/10/22 00:12:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15077
12/10/22 00:12:25 INFO mapred.JobClient: File Input Format Counters
12/10/22 00:12:25 INFO mapred.JobClient: Bytes Read=132
12/10/22 00:12:25 INFO mapred.JobClient: File Output Format Counters
12/10/22 00:12:25 INFO mapred.JobClient: Bytes Written=60
12/10/22 00:12:25 INFO mapred.JobClient: FileSystemCounters
12/10/22 00:12:25 INFO mapred.JobClient: FILE_BYTES_READ=132
12/10/22 00:12:25 INFO mapred.JobClient: HDFS_BYTES_READ=414
12/10/22 00:12:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=86229
12/10/22 00:12:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=60
12/10/22 00:12:25 INFO mapred.JobClient: Map-Reduce Framework
12/10/22 00:12:25 INFO mapred.JobClient: Map output materialized bytes=144
12/10/22 00:12:25 INFO mapred.JobClient: Map input records=12
12/10/22 00:12:25 INFO mapred.JobClient: Reduce shuffle bytes=144
12/10/22 00:12:25 INFO mapred.JobClient: Spilled Records=18
12/10/22 00:12:25 INFO mapred.JobClient: Map output bytes=144
12/10/22 00:12:25 INFO mapred.JobClient: Total committed heap usage (bytes)=497430528
12/10/22 00:12:25 INFO mapred.JobClient: CPU time spent (ms)=3410
12/10/22 00:12:25 INFO mapred.JobClient: Map input bytes=132
12/10/22 00:12:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=282
12/10/22 00:12:25 INFO mapred.JobClient: Combine input records=0
12/10/22 00:12:25 INFO mapred.JobClient: Reduce input records=0
12/10/22 00:12:25 INFO mapred.JobClient: Reduce input groups=5
12/10/22 00:12:25 INFO mapred.JobClient: Combine output records=9
12/10/22 00:12:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=470962176
12/10/22 00:12:25 INFO mapred.JobClient: Reduce output records=5
12/10/22 00:12:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1497731072
12/10/22 00:12:25 INFO mapred.JobClient: Map output records=12
查看运行结果
zhy@ubuntu:~/Desktop/hadoop-1.0.3$ bin/hadoop fs -cat /usr/zhy/output/remove/*
2012-3-1 a
2012-3-2 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
cat: File does not exist: /usr/zhy/output/remove/_logs
原文地址:点击打开链接
- Hadoop之Mapreduce------>Mapreduce原理
- Hadoop--Hadoop核心之MapReduce
- hadoop之mapreduce
- Hadoop之MapReduce
- Hadoop 之 mapreduce
- Hadoop之mapreduce
- hadoop之mapreduce实例
- Hadoop之MapReduce
- hadoop之mapReduce
- Hadoop之MapReduce 分析
- Hadoop之MapReduce概念
- HADOOP之MAPREDUCE
- Hadoop之MapReduce
- Hadoop之谈谈MapReduce
- Hadoop之MapReduce命令
- hadoop之MAPREDUCE
- 初学Hadoop之MapReduce
- Hadoop之MapReduce & HDFS
- 前端应用的后端LINUX架构服务器
- JAVA JIT 性能优化
- live555
- linux中top与ps区别
- linux内核编译选项详解
- Hadoop之mapreduce
- 开始了
- Android学习List布局及注意项
- linux之sed用法
- JRadioButton单选按钮代码1
- Linux 单看当前文件目录有多少个文件 以及当前目录占用空间的大小
- 懒 + 笨 = 优秀的程序员
- Have you ever chatted with a Hacker within a virus?
- dynaTrace Ajax版使用指南[译]