hadoop编程入门学习笔记-5 reduce-side join

来源：互联网发布：热负荷计算软件编辑：程序博客网时间：2024/05/26 19:18

Reduce-side join

账号信息文件 accounts.txt

账户ID，姓名，类型，开户日期

001JohnAllenStandard2012-03-15002AbigailSmithPremium2004-07-13003AprilStevenStandard2010-12-20004NasserHafezPremium2001-04-23

销售记录文件 sales.txt

购买者的账户ID，购买量，购买日期

00135.992012-03-1500212.492004-07-0200413.422005-12-20003499.992010-12-2000221.992006-11-3000178.952012-04-0200293.452008-09-100019.992012-05-17

按账户分类统计用户的购买次数、购买量，输出：用户名、购买次数和购买量

要实现上述功能，就需要包上面两个文件链接起来。reduce-side join的优点是实现简单，缺点是数据经过shuffle（洗牌）阶段传递到reduce阶段，如果数据量大的话会增加传输负担。

用MultipleInputs实现reduce-side join

定义两个mapper，SalesRecordMapper处理sales.txt，AccountRecordMapper处理accounts.txt ，这两个mapper的输出均采用账户Id作为key。

SalesRecordMapper的输出： <账户ID，"sales 购买次数购买量">

AccountRecordMapper的输出： <账户ID，"accounts 账户姓名">

在ReduceJoinReduce中通过账户ID关联起来，详见代码。

ReduceJoin.java

import java.io.*;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.*;import org.apache.hadoop.mapreduce.lib.output.*;public class ReduceJoin{    public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text>{        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{            String record = value.toString();            String[] parts = record.split("\t");            context.write(new Text(parts[0]), new Text("sales\t" + parts[1]));       }    }    public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text>{        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{            String record = value.toString();            String[] parts = record.split("\t");            context.write(new Text(parts[0]),new Text("accounts\t" + parts[1]));        }    }    public static class ReduceJoinReducer extends Reducer<Text, Text, Text, Text>{        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{            String name = "";            double total = 0.0;            int count = 0;                        for(Text t: values){                String parts[] = t.toString().split("\t");                if(parts[0].equals("sales")){                    count++;                    total += Float.parseFloat(parts[1]);                }else if(parts[0].equals("accounts")){                    name = parts[1];                }            }            String str = String.format("%d\t%f", count, total);            context.write(new Text(name), new Text(str));        }    }    public static void main(String[] args) throws Exception{        Configuration conf = new Configuration();        Job job = new Job(conf, "Reduce-side join");        job.setJarByClass(ReduceJoin.class);        job.setReducerClass(ReduceJoinReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(Text.class);        MultipleInputs.addInputPath(job, new Path(args[0]),            TextInputFormat.class, SalesRecordMapper.class);        MultipleInputs.addInputPath(job, new Path(args[1]),            TextInputFormat.class, AccountRecordMapper.class);        Path outputPath = new Path(args[2]);        FileOutputFormat.setOutputPath(job, outputPath);        outputPath.getFileSystem(conf).delete(outputPath);        System.exit(job.waitForCompletion(true) ? 0 :1);    }}

运行测试

hadoop dfs -mkdir saleshadoop dfs -put sales.txt sales/sales.txthadoop dfs -mkdir accountshadoop dfs -put accounts.txt accounts/accounts.txthadoop dfs -put sales.txt sales/sales.txthadoop jar ReduceJoin.jar ReduceJoin sales accounts outputshadoop dfs -cat outputs/part-r-00000[hadoop@cld-srv-01 ch05]$  hadoop dfs -cat outputs/part-r-00000DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.15/12/08 12:25:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableJohn3124.929998Abigail3127.929996April1499.989990Nasser113.420000[hadoop@cld-srv-01 ch05]$

0 0