MapReduce/Hadoop的左外连接解决方案

来源：互联网发布：51单片机用的电机驱动编辑：程序博客网时间：2024/05/17 06:56

要解决的问题：

假设有两类数据：用户和交易。用户数据包括用户的地址信息，交易数据包括用户身份信息，但是不包括应乎地址的直接信息。给定users和transactions如下：

users(user_id,location_id)

transactions(transaction_id,product_id,user_id,quantity,amount)

我们的目标是得出每个商品对应的唯一用户地址数。

输入：

users.txt

u1UTu2GAu3CAu4CAu5GA

transactions.txt

t1 p3 u1 3 330t2 p1 u2 1 440t3 p1 u1 3 600t4 p2 u2 10 1000t5 p4 u4 9 90t6 p1 u1 4 120t7 p4 u1 8 160t8 p4 u5 2 40

如何进行左外连接？

根据user_id进行左外连接

与左外连接相关的一些SQL查询示例：

查询1：select * from transactions left outer join users on transactions.user_id = users.user_id

结果：

t1 p3 u1 3 330 u1 UT
t2 p1 u2 1 440u2 GA
t3 p1 u1 3 600u1 UT
t4 p2 u2 10 1000 u2GA
t5 p4 u4 9 90u4 CA
t6 p1 u1 4 120u1 UT
t7 p4 u1 8 160u1 UT
t8 p4 u5 2 40u5 GA

查询2：select product_id,location_id from transactions left outer join users on transactions.user_id = users.user_id

结果：

p3 UT
p1 GA
p1 UT
p2 GA
p4 CA
p1 UT
p4 UT
p4 GA

查询3：select product_id,count(location_id) from transactions left outer join users on transactions.user_id = users.user_id group by product_id

p1 3
p2 1
p3 1
p4 3

查询4：select product_id,count(distinct location_id) from transactions left outer join users on transactions.user_id = users.user_id group by product_id

p1 2
p2 1
p3 1
p4 3

MapReduce左外连接实现：

阶段1：找出所有售出的商品以及关联的地址，类似于SQL查询2

阶段2：找出所有售出的商品以及关联的唯一地址数

阶段1：找出所有售出的商品以及关联的地址，类似于SQL查询2

这个阶段将用一个MapReduce作业完成左外连接，这里会利用两个映射器（一个对应用户，一个对应交易），这个作业的规约器会发出一个键值对，键为product_id，值为location_id。

那么问题来了：两个映射器，不就意味着要有两个规约器吗？而且两个映射器不能用时使用、两个规约器也不能同时使用，大大降低了大数据的并行效率。能不能同时使用多个映射器？

利用MutipleInputs类可以使用多个映射器。

首先设计两个映射器如下：

交易map()读取(transaction_id,product_id,user_id,quantity,amount),发出一个键值对(user_id,product_id)

用户map()读取(user_id,location_id)，发出一个键值对(user_id,location_id)

那么问题来了：同时使用多个映射器的时候规约器如何区分是(user_id,product_id)还是(user_id,location_id)，也就是如何区分product_id还是location_id？

解决方案：对于原来的键和值都设置一个标记，已表示区分。同样是user_id,在交易map里面设个标记为2，在用户map里设个标记为1；locatopn_id设个标记为L，product_id设个标记为P

于是设计两个映射器V2如下：

交易map()读取(transaction_id,product_id,user_id,quantity,amount),发出一个键值对(Pair(user_id,"2"),Pair("P",product_id))

用户map()读取(user_id,location_id)，发出一个键值对(Pair(user_id,"1"),Pair("L",location_id))

那么问题又来了：现在有了这些自定义格式的键值对了？Hadoop就不能按照默认的分区器和排序器进行分区和排序了。

解决方案：自定义分区器和比较器。

如何分区比较好？按照用户名进行分区。也就算按照两种键值对【(Pair(user_id,"2"),Pair("P",product_id))和(Pair(user_id,"1"),Pair("L",location_id))】中的键【Pair(user_id,"2")和Pair(user_id,"1")】的左边元素【user_id】进行分区。

如何排序比较好？参照二次排序，先按user_id进行排，相同user_id的按照"1"和"2"的顺序排

于是阶段1产生（product_id,location_id）的键值对集合

阶段2：统计唯一地址

比较简单：以阶段1的输出为输入，product_id为键，location_id为值进行map，product_id相同的键值对会被分发到相同的reduce中，reduce中去掉重复的值后计算不重复的值的个数即可（使用set可以方便去重，也方便计数）

过程图如下：

代码实现：

1、LeftJoinUserMapper，用户映射类

package LeftOutJoin_hadoop;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.mapreduce.Mapper;import org.apache.commons.lang.StringUtils;import edu.umd.cloud9.io.pair.PairOfStrings;public class LeftJoinUserMapper    extends Mapper<LongWritable, Text, PairOfStrings, PairOfStrings> {   PairOfStrings outputKey = new PairOfStrings();   PairOfStrings outputValue = new PairOfStrings();   public void map(LongWritable key, Text value, Context context)       throws java.io.IOException, InterruptedException {      String[] tokens = StringUtils.split(value.toString(), "\t");      if (tokens.length == 2) {       // tokens[0] = user_id         // tokens[1] = location_id         // to make sure location arrives before products         outputKey.set(tokens[0], "1");    // set user_id         outputValue.set("L", tokens[1]);  // set location_id         context.write(outputKey, outputValue);      }   }}

2、LeftJoinTransactionMapper交易映射类

package LeftOutJoin_hadoop;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.mapreduce.Mapper;import org.apache.commons.lang.StringUtils;import edu.umd.cloud9.io.pair.PairOfStrings;public class LeftJoinTransactionMapper extends Mapper<LongWritable, Text, PairOfStrings, PairOfStrings> {   PairOfStrings outputKey = new PairOfStrings();   PairOfStrings outputValue = new PairOfStrings();     @Override    /**    * @param key: system generated, ignored here    * @param value: <transaction_id><TAB><product_id><TAB><user_id><TAB><quantity><TAB><amount>    */   public void map(LongWritable key, Text value, Context context)       throws java.io.IOException, InterruptedException {      String[] tokens = StringUtils.split(value.toString(), "\t");      System.out.println("tokens size:" + tokens.length);      String productID = tokens[1];      String userID = tokens[2];      // make sure products arrive at a reducer after location      outputKey.set(userID, "2");      outputValue.set("P", productID);      context.write(outputKey, outputValue);   }}

3、分区类SecondarySortPartitioner

package LeftOutJoin_hadoop;import org.apache.hadoop.mapreduce.Partitioner;import edu.umd.cloud9.io.pair.PairOfStrings;public class SecondarySortPartitioner extends Partitioner<PairOfStrings, Object> {    @Override    public int getPartition(PairOfStrings key,                             Object value,                             int numberOfPartitions) {       return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numberOfPartitions;    }}

4、比较器

package LeftOutJoin_hadoop;import org.apache.hadoop.io.RawComparator;import edu.umd.cloud9.io.pair.PairOfStrings;import org.apache.hadoop.io.DataInputBuffer;public class SecondarySortGroupComparator implements RawComparator<PairOfStrings> {    /**     *  Group only by userID     */    @Override    public int compare(PairOfStrings first, PairOfStrings second) {       return first.getLeftElement().compareTo(second.getLeftElement());    }        @Override    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2 ) {    DataInputBuffer buffer = new DataInputBuffer();    PairOfStrings a = new PairOfStrings();    PairOfStrings b = new PairOfStrings();      try {        buffer.reset(b1, s1, l1);        a.readFields(buffer);        buffer.reset(b2, s2, l2);        b.readFields(buffer);        return compare(a,b);        }       catch(Exception ex) {        return -1;      }      }}

5、规约器

package LeftOutJoin_hadoop;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import edu.umd.cloud9.io.pair.PairOfStrings;import java.util.Iterator;public class LeftJoinReducer extends Reducer<PairOfStrings, PairOfStrings, Text, Text> {   Text productID = new Text();   Text locationID = new Text("undefined");      @Override   public void reduce(PairOfStrings key, Iterable<PairOfStrings> values, Context context)       throws java.io.IOException, InterruptedException {      System.out.println("key=" + key);      Iterator<PairOfStrings> iterator = values.iterator();      System.out.println("values");      if (iterator.hasNext()) {       // firstPair must be location pair       PairOfStrings firstPair = iterator.next();        System.out.println("firstPair="+firstPair.toString());         if (firstPair.getLeftElement().equals("L")) {            locationID.set(firstPair.getRightElement());         }      }                      while (iterator.hasNext()) {       // the remaining elements must be product pair       PairOfStrings productPair = iterator.next();        System.out.println("productPair="+productPair.toString());         productID.set(productPair.getRightElement());         context.write(productID, locationID);      }   }}

6、main

package LeftOutJoin_hadoop;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.mapred.TextOutputFormat;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;//import edu.umd.cloud9.io.pair.PairOfStrings;public class LeftJoinDriver {  public static void main( String[] args ) throws Exception {      Path transactions =    new Path("input/transactions.txt");// input      Path users =  new Path("input/users.txt");        // input      Path output = new Path("output/1");        // output          Configuration conf = new Configuration();      Job job = new Job(conf);      job.setJarByClass(LeftJoinDriver.class);      job.setJobName("Phase-1: Left Outer Join");  // "secondary sort" is handled by setting the following 3 plug-ins:  // 1. how the mapper generated keys will be partitioned      job.setPartitionerClass(SecondarySortPartitioner.class);          // 2. how the natural keys (generated by mappers) will be grouped      job.setGroupingComparatorClass(SecondarySortGroupComparator.class);      // 3. how PairOfStrings will be sorted      job.setSortComparatorClass(PairOfStrings.Comparator.class);      job.setReducerClass(LeftJoinReducer.class);          job.setOutputKeyClass(Text.class);      job.setOutputValueClass(Text.class);      job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.class);      // define multiple mappers: one for users and one for transactions      MultipleInputs.addInputPath(job, transactions, TextInputFormat.class, LeftJoinTransactionMapper.class);      MultipleInputs.addInputPath(job, users, TextInputFormat.class, LeftJoinUserMapper.class);      job.setMapOutputKeyClass(PairOfStrings.class);      job.setMapOutputValueClass(PairOfStrings.class);      FileOutputFormat.setOutputPath(job, output);          if (job.waitForCompletion(true)) {     return;      }      else {         throw new Exception("Phase-1: Left Outer Join Job Failed");      }  }}

阶段1的输出：

p4 UT
p1 UT
p1 UT
p3 UT
p2 GA
p1 GA
p4 CA
p4 GA

阶段2代码：

package LeftOutJoin_hadoop;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.io.IOException;public class LocationCountMapper extends Mapper<LongWritable, Text, Text, Text> {private Text outputKey = new Text();private Text outputValue = new Text();    @Override    public void map(LongWritable key, Text value, Context context)     throws IOException, InterruptedException {    System.out.println(value);    String [] tokens = value.toString().split("\t");    System.out.println("tokens:" + tokens.length);        outputKey.set(tokens[0]);    outputValue.set( tokens[1]);       context.write(outputKey,outputValue);    }}

package LeftOutJoin_hadoop;import java.io.IOException;//import java.util.Set;import java.util.HashSet;//import org.apache.hadoop.io.Text;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.mapreduce.Reducer;public  class LocationCountReducer     extends Reducer<Text, Text, Text, LongWritable> {    @Override    public void reduce(Text productID, Iterable<Text> locations, Context context)        throws  IOException, InterruptedException {        //        Set<String> set = new HashSet<String>();        //        for (Text location: locations) {           set.add(location.toString());        }        //        context.write(productID, new LongWritable(set.size()));    }}

package LeftOutJoin_hadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class LocationCountDriver {   public static void main( String[] args ) throws Exception {         Path input = new Path("output/1/");      Path output = new Path("output/2");      Configuration conf = new Configuration();      Job job = new Job(conf);      job.setJarByClass(LocationCountDriver.class);      job.setJobName("Phase-2: LocationCountDriver");          FileInputFormat.addInputPath(job, input);      job.setInputFormatClass(TextInputFormat.class);              job.setMapperClass(LocationCountMapper.class);      job.setReducerClass(LocationCountReducer.class);      job.setMapOutputKeyClass(Text.class);      job.setMapOutputValueClass(Text.class);      job.setOutputFormatClass(TextOutputFormat.class);      job.setOutputKeyClass(Text.class);      job.setOutputValueClass(LongWritable.class);          FileOutputFormat.setOutputPath(job, output);      if (job.waitForCompletion(true)) {      return;      }      else {      throw new Exception("LocationCountDriver Failed");      }   }}

阶段2输出：

p1 2
p2 1
p3 1
p4 3

阅读全文

0 0