Map side join

来源：互联网发布：网络编辑工具箱编辑：程序博客网时间：2024/05/16 18:53

1、原理：

之所以存在reduce side join，是因为在map阶段不能获取所有需要的join字段，即：同一个key对应的字段可能位于不同map中。但 Reduce side join是非常低效的，因为shuffle阶段要进行大量的数据传输。Map side join是针对以下场景进行的优化：两个待连接表中，有一个表非常大，而另一个表非常小，以至于小表可以直接存放到内存中。这样，我们可以将小表复制多份，让每个map task内存中存在一份（比如存放到hash table中），然后只扫描大表：对于大表中的每一条记录key/value，在hash table中查找是否有相同的key的记录，如果有，则连接后输出即可。为了支持文件的复制，Hadoop提供了一个类DistributedCache，使用该类的方法如下：

（1）用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件，它的参数是文件的URI（如果是HDFS上的文件，可以这样：hdfs://jobtracker:50030/home/XXX/file）。JobTracker在作业启动之前会获取这个URI列表，并将相应的文件拷贝到各个TaskTracker的本地磁盘上。

（2）用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录，并使用标准的文件读写API读取相应的文件。

2、环境：

本实例需要的测试文件及 hdfs 文件存放目录如下：

hadoop fs -ls /test/decli
Found 4 items
-rw-r--r-- 2 root supergroup 152 2013-03-06 02:05 /test/decli/login
drwxr-xr-x - root supergroup 0 2013-03-06 02:45 /test/decli/output
-rw-r--r-- 2 root supergroup 12 2013-03-06 02:12 /test/decli/sex
-rw-r--r-- 2 root supergroup 72 2013-03-06 02:44 /test/decli/user

测试文件内容分别为：

root@master 192.168.120.236 02:58:03 ~/test/table >
cat login # 登录表，需要判断 uid 列是否有效，并得到对应用户名、性别、访问次数
1 0 20121213
2 0 20121213
3 1 20121213
4 1 20121213
1 0 20121114
2 0 20121114
3 1 20121114
4 1 20121114
1 0 20121213
1 0 20121114
9 0 20121114
root@master 192.168.120.236 02:58:08 ~/test/table >
cat sex # 性别表
0 男
1 女
root@master 192.168.120.236 02:58:13 ~/test/table >
cat user # 用户属性表
1 张三 hubei
3 王五 tianjin
4 赵六 guangzhou
2 李四 beijing
root@master 192.168.120.236 02:58:16 ~/test/table >

测试环境 hadoop 版本：

?
1
2
echo$HADOOP_HOME
/work/hadoop-0.20.203.0

好了，废话少说，上代码：

3、代码：

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
importjava.io.BufferedReader;
importjava.io.FileReader;
importjava.io.IOException;
importjava.util.HashMap;
importjava.util.Map;
 
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.conf.Configured;
importorg.apache.hadoop.filecache.DistributedCache;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.Mapper;
importorg.apache.hadoop.mapreduce.Reducer;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.input.TextInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
importorg.apache.hadoop.util.GenericOptionsParser;
importorg.apache.hadoop.util.Tool;
importorg.apache.hadoop.util.ToolRunner;
 
publicclassMultiTableJoin extendsConfiguredimplementsTool {
    publicstaticclass MapClass extendsMapper<LongWritable, Text, Text, Text> {
 
        // 用于缓存 sex、user 文件中的数据
        privateMap<String, String> userMap = newHashMap<String, String>();
        privateMap<String, String> sexMap = newHashMap<String, String>();
 
        privateText oKey = newText();
        privateText oValue = newText();
        privateString[] kv;
 
        // 此方法会在map方法执行之前执行
        @Override
        protectedvoidsetup(Context context) throwsIOException,
                InterruptedException {
            BufferedReader in = null;
 
            try{
                // 从当前作业中获取要缓存的文件
                Path[] paths = DistributedCache.getLocalCacheFiles(context
                        .getConfiguration());
                String uidNameAddr = null;
                String sidSex = null;
                for(Path path : paths) {
                    if(path.toString().contains("user")) {
                        in = newBufferedReader(newFileReader(path.toString()));
                        while(null!= (uidNameAddr = in.readLine())) {
                            userMap.put(uidNameAddr.split("\t", -1)[0],
                                    uidNameAddr.split("\t", -1)[1]);
                        }
                    }elseif(path.toString().contains("sex")) {
                        in = newBufferedReader(newFileReader(path.toString()));
                        while(null!= (sidSex = in.readLine())) {
                            sexMap.put(sidSex.split("\t", -1)[0], sidSex.split(
                                    "\t", -1)[1]);
                        }
                    }
                }
            }catch(IOException e) {
                e.printStackTrace();
            }finally{
                try{
                    if(in != null) {
                        in.close();
                    }
                }catch(IOException e) {
                    e.printStackTrace();
                }
            }
        }
 
        publicvoidmap(LongWritable key, Text value, Context context)
                throwsIOException, InterruptedException {
 
            kv = value.toString().split("\t");
            // map join: 在map阶段过滤掉不需要的数据
            if(userMap.containsKey(kv[0]) && sexMap.containsKey(kv[1])) {
                oKey.set(userMap.get(kv[0]) + "\t"+ sexMap.get(kv[1]));
                oValue.set("1");
                context.write(oKey, oValue);
            }
        }
 
    }
 
    publicstaticclass Reduce extendsReducer<Text, Text, Text, Text> {
 
        privateText oValue = newText();
 
        publicvoidreduce(Text key, Iterable<Text> values, Context context)
                throwsIOException, InterruptedException {
            intsumCount = 0;
 
            for(Text val : values) {
                sumCount += Integer.parseInt(val.toString());
            }
            oValue.set(String.valueOf(sumCount));
            context.write(key, oValue);
        }
 
    }
 
    publicintrun(String[] args) throwsException {
        Job job = newJob(getConf(),"MultiTableJoin");
 
        job.setJobName("MultiTableJoin");
        job.setJarByClass(MultiTableJoin.class);
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);
 
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
 
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
 
        String[] otherArgs = newGenericOptionsParser(job.getConfiguration(),
                args).getRemainingArgs();
 
        // 我们把第1、2个参数的地址作为要缓存的文件路径
        DistributedCache.addCacheFile(newPath(otherArgs[1]).toUri(), job
                .getConfiguration());
        DistributedCache.addCacheFile(newPath(otherArgs[2]).toUri(), job
                .getConfiguration());
 
        FileInputFormat.addInputPath(job,newPath(otherArgs[3]));
        FileOutputFormat.setOutputPath(job,newPath(otherArgs[4]));
 
        returnjob.waitForCompletion(true) ? 0:1;
    }
 
    publicstaticvoid main(String[] args) throwsException {
        intres = ToolRunner.run(newConfiguration(),newMultiTableJoin(),
                args);
        System.exit(res);
    }
 
}

运行命令：

?
1
hadoop jar MultiTableJoin.jar MultiTableJoin /test/decli/sex /test/decli/user /test/decli/login /test/decli/output

4、结果：

运行结果：

root@master 192.168.120.236 02:47:18 ~/test/table >
hadoop fs -cat /test/decli/output/*|column -t
cat: File does not exist: /test/decli/output/_logs
张三男 4
李四男 2
王五女 2
赵六女 2
root@master 192.168.120.236 02:47:26 ~/test/table >

0 0