MapReduce的Map side join

来源：互联网发布：微软数据库工程师编辑：程序博客网时间：2024/05/17 05:51

当有一个大表join小表的时候，可以选择用Map side join。该方式只用到了map阶段，不需要reduce。

适用场景：

1-小表很小，可以放在内存中，不会导致JVM的堆溢出；

2-内连接或者大数据在左边的左外连接。

原理：

在mapper类中新建一个HashMap对象，在setup中加载小表的文件到HashMap中，然后与map输入的value（大数据文件的值）做join操作。结构图如下：

举例：

两组数据分别为

-------------------------------------

author: 作者id和作者对应表

-------------------------------------

s201002017:::R. R. Thomys
s201002023:::Klaus R. Dittrich
s201002024:::Wolfgang Gentzsch
s201002025:::Rainer K?nig
s201002018:::Georg Walch
s201002019:::Hans J. Becker
s201002020:::Hagen Vogel
s201002011:::Jan-Peter Hazebrouck
s201002012:::Herbert L?the
s201002015:::Matthias Rinschede
s201002016:::Heiner Fuhrmann
s201002021:::Norbert Braun
s201002022:::H. Henseler
s201002026:::Richard Vahrenkamp
s201002013:::Roman Winkler
s201002027:::Niels Grabe
s201002014:::Marianne Winslett

----------------------------------------
book：图书名字和作者id

----------------------------------------

<linux study>:::s201002017
<linux study>:::s201002023
<linux study>:::s201002024
<hadoop study>:::s201002024
<hadoop study>:::s201002023
<English second publish>:::s201002025
<data structure>:::s201002018
<hbase study>:::s201002019
<hbase study>:::s201002020

<hive study>:::s201002016
<zookeeper>:::s201002030
<zookeeper>:::s201002038
<java>:::s201002040
<factory>:::s201002041
<deep study in python>:::s201002020
<how to learn shell>:::s201002033
<J2EE learn>:::s201002030
<made in china>:::s201002039

数据的分割符为“:::”要求做内连接inner join，

代码如下：

package com.inspur.mapreduce.join;/************************************* * @author:caolch * @date:2013-12-31 * @note:利用mapper写的表连接，小表读到内存里 *************************************/import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.util.HashMap;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class MapJoin extends Configured implements Tool {public static class myMapper extends Mapper<Object, Text, Text, Text> {// TODO Auto-generated constructor stubprivate HashMap<String,String> authorMap = new HashMap<String,String>();@Overridepublic void map(Object key, Text value, Context context)throws IOException, InterruptedException {// TODO Auto-generated method stubString []tokens = value.toString().split(":::");String joinData = authorMap.get(tokens[1]);if (joinData!=null) {context.write(new Text(tokens[0]),new Text(joinData));}}//setup会先于map执行@Overridepublic void setup(Context context) throws IOException,InterruptedException {// TODO Auto-generated method stub//得到要缓存的文件的路径Path []cacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());//将文件内容读到分布式缓存if (cacheFiles!=null && cacheFiles.length > 0) {String line;String []tokens;for(Path path:cacheFiles){if(path.toString().contains("author")){BufferedReader br = new BufferedReader(new FileReader(path.toString()));try{                      while((line = br.readLine()) != null){                          tokens = line.split(":::", 2);                          authorMap.put(tokens[0], tokens[1]);                                   }                  }finally{                      br.close();                  }  }}}}}@Overridepublic int run(String[] args) throws Exception {// TODO Auto-generated method stubConfiguration conf = getConf();Job job = new Job(conf,"MapJoin");job.setJarByClass(MapJoin.class);job.setMapperClass(myMapper.class);job.setNumReduceTasks(0);/*添加要加入到缓存中的文件*/Path cachefilePath = new Path(args[0]);FileSystem hdfs = FileSystem.get(conf);FileStatus fileStatus = hdfs.getFileStatus(cachefilePath); //判断输入的路径是文件还是文件夹if(fileStatus.isDir()==false){//如果输入的路径是文件,添加文件到缓存DistributedCache.addCacheFile(cachefilePath.toUri(), job.getConfiguration());}if(fileStatus.isDir()==true)//如果输入的路径是文件夹，获取文件夹中的文件列表{//获取文件夹元数据，并一一添加内部所有文件for (FileStatus fs : hdfs.listStatus(cachefilePath)) {DistributedCache.addCacheFile(fs.getPath().toUri(), job.getConfiguration());}}Path in = new Path(args[1]);Path out = new Path(args[2]);//设置输入输出格式FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);System.exit(job.waitForCompletion(true)? 0 : 1);return 0;}public static void main(String[] args) throws Exception {// TODO Auto-generated method stubint res = ToolRunner.run(new Configuration(), new MapJoin(), args);System.exit(res);}}

1 0