hadoop编程入门学习笔记-5 reduce-side join
来源:互联网 发布:热负荷计算软件 编辑:程序博客网 时间:2024/05/26 19:18
Reduce-side join
账号信息文件 accounts.txt
账户ID,姓名,类型,开户日期
001JohnAllenStandard2012-03-15002AbigailSmithPremium2004-07-13003AprilStevenStandard2010-12-20004NasserHafezPremium2001-04-23
销售记录文件 sales.txt
购买者的账户ID,购买量,购买日期
00135.992012-03-1500212.492004-07-0200413.422005-12-20003499.992010-12-2000221.992006-11-3000178.952012-04-0200293.452008-09-100019.992012-05-17
按账户分类统计用户的购买次数、购买量,输出:用户名、购买次数和购买量
要实现上述功能,就需要包上面两个文件链接起来。reduce-side join的优点是实现简单,缺点是数据经过shuffle(洗牌)阶段传递到reduce阶段,如果数据量大的话会增加传输负担。
用MultipleInputs实现reduce-side join
定义两个mapper,SalesRecordMapper处理sales.txt,AccountRecordMapper处理accounts.txt ,这两个mapper的输出均采用账户Id作为key。
SalesRecordMapper的输出: <账户ID,"sales 购买次数 购买量">
AccountRecordMapper的输出: <账户ID,"accounts 账户姓名">
在ReduceJoinReduce中通过账户ID关联起来,详见代码。
ReduceJoin.java
import java.io.*;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.*;import org.apache.hadoop.mapreduce.lib.output.*;public class ReduceJoin{ public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text>{ public void map(Object key, Text value, Context context) throws IOException, InterruptedException{ String record = value.toString(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("sales\t" + parts[1])); } } public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text>{ public void map(Object key, Text value, Context context) throws IOException, InterruptedException{ String record = value.toString(); String[] parts = record.split("\t"); context.write(new Text(parts[0]),new Text("accounts\t" + parts[1])); } } public static class ReduceJoinReducer extends Reducer<Text, Text, Text, Text>{ public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{ String name = ""; double total = 0.0; int count = 0; for(Text t: values){ String parts[] = t.toString().split("\t"); if(parts[0].equals("sales")){ count++; total += Float.parseFloat(parts[1]); }else if(parts[0].equals("accounts")){ name = parts[1]; } } String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str)); } } public static void main(String[] args) throws Exception{ Configuration conf = new Configuration(); Job job = new Job(conf, "Reduce-side join"); job.setJarByClass(ReduceJoin.class); job.setReducerClass(ReduceJoinReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SalesRecordMapper.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AccountRecordMapper.class); Path outputPath = new Path(args[2]); FileOutputFormat.setOutputPath(job, outputPath); outputPath.getFileSystem(conf).delete(outputPath); System.exit(job.waitForCompletion(true) ? 0 :1); }}
运行测试
hadoop dfs -mkdir saleshadoop dfs -put sales.txt sales/sales.txthadoop dfs -mkdir accountshadoop dfs -put accounts.txt accounts/accounts.txthadoop dfs -put sales.txt sales/sales.txthadoop jar ReduceJoin.jar ReduceJoin sales accounts outputshadoop dfs -cat outputs/part-r-00000[hadoop@cld-srv-01 ch05]$ hadoop dfs -cat outputs/part-r-00000DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.15/12/08 12:25:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableJohn3124.929998Abigail3127.929996April1499.989990Nasser113.420000[hadoop@cld-srv-01 ch05]$
0 0
- hadoop编程入门学习笔记-5 reduce-side join
- (hadoop学习-4)Reduce side join
- Hadoop的Map-side join和Reduce-side join
- Reduce Side Join实现
- BloomFilter 简介及在 Hadoop reduce side join 中的应用
- BloomFilter 简介及在 Hadoop reduce side join 中的应用
- BloomFilter 简介及在 Hadoop reduce side join 中的应用
- BloomFilter 简介及在 Hadoop reduce side join 中的应用
- map-side-join /Reduce-side-join
- (hadoop学习-5)Map Side Join
- MapReduce的Reduce side Join
- 深入理解 Reduce-side Join
- mapreduce实例-Join连接 (reduce Side Join)
- Hadoop学习:Map-Reduce入门
- Hadoop Map-Reduce入门学习
- spark实现Map-side Join和Reduce-side Join
- hadoop join之map side join
- Hadoop 多表 join:map side join 范例
- 【Maven】Maven Plugin示例:自己动手编写Maven插件
- [leetcode 278] First Bad Version
- 安装完CentOS 7 后必做的七件事
- HTML4与HTML5之间的10个本质区别
- Ext中 get、getDom、getCmp的区别 •
- hadoop编程入门学习笔记-5 reduce-side join
- POJ--1050--To the Max(线性动规,最大子矩阵和)
- Android控件---CheckBox
- ios的touch ID 验证开发
- linux 搭建zookeeper注册中心(单节点)
- 自考实践课——数据库
- 习题一-绪论
- JAVA的静态变量、静态方法、静态类
- 安卓——Toast的使用