多数据源的MapReduce作业(一)--Reduce侧的联结

来源：互联网发布：相亲吃饭谁买单知乎编辑：程序博客网时间：2024/05/16 04:29

场景：实现多表的join操作。

select customers.*,orders.* from customers

join orders

on customers.id =orders.id

使用DataJoin软件包进行实现联结操作。

扩展三个类：

1、DataJoinMapperBase

2、DataJoinReducerBase

3、TaggedMapOutput

原理：

1、mapper端输入后，将数据封装成TaggedMapOutput类型，此类型封装数据源(tag)和值(value)；

2、map阶段输出的结果不在是简单的一条数据，而是一条记录。记录=数据源(tag)+数据值(value).

3、combine接收的是一个组合：不同数据源却有相同组键的值；

4、不同数据源的每一条记录只能在一个combine中出现；

如图：

接下来是实例代码。

数据源：

Customers.txt

1,wuminggang,13575468248
2,liujiannan,18965235874
3,wangbo,15986854789
4,tom,15698745862

Orders.txt

3,A,99,2013-03-05
1,B,89,2013-02-05
2,C,69,2013-03-09
3,D,56,2013-06-07

自定义类 TaggedWritable

package com.hadoop.data.join;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;

/*TaggedMapOutput是一个抽象数据类型，封装了标签与记录内容
此处作为DataJoinMapperBase的输出值类型，需要实现Writable接口，所以要实现两个序列化方法
自定义输入类型*/
public class TaggedWritable extends TaggedMapOutput {
private Writable data;

public TaggedWritable() {
this.tag = new Text();
}

public TaggedWritable(Writable data) // 构造函数
{
this.tag = new Text(); // tag可以通过setTag()方法进行设置
this.data = data;
}

@Override
public void readFields(DataInput in) throws IOException {
tag.readFields(in);
String dataClz = in.readUTF();
if (this.data == null
|| !this.data.getClass().getName().equals(dataClz)) {
try {
this.data = (Writable) ReflectionUtils.newInstance(
Class.forName(dataClz), null);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
data.readFields(in);
}

@Override
public void write(DataOutput out) throws IOException {
tag.write(out);
out.writeUTF(this.data.getClass().getName());
data.write(out);
}

@Override
public Writable getData() {
return data;
}
}

Mapper类 JoinMapper

package com.hadoop.data.join;

import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
import org.apache.hadoop.io.Text;

public class JoinMapper extends DataJoinMapperBase {

// 这个在任务开始时调用，用于产生标签
// 此处就直接以文件名作为标签
@Override
protected Text generateInputTag(String inputFile) {
System.out.println("inputFile = " + inputFile);
return new Text(inputFile);
}

// 这里我们已经确定分割符为','，更普遍的，用户应能自己指定分割符和组键。
// 设置组键
@Override
protected Text generateGroupKey(TaggedMapOutput record) {
String tag = ((Text) record.getTag()).toString();
System.out.println("tag = " + tag);
String line = ((Text) record.getData()).toString();
String[] tokens = line.split(",");
return new Text(tokens[0]);
}

// 返回一个任何带任何我们想要的Text标签的TaggedWritable
@Override
protected TaggedMapOutput generateTaggedMapOutput(Object value) {
TaggedWritable retv = new TaggedWritable((Text) value);
retv.setTag(this.inputTag); // 不要忘记设定当前键值的标签
return retv;
}
}

Reduce类 JoinReducer

package com.hadoop.data.join;

import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
import org.apache.hadoop.io.Text;

public class JoinReducer extends DataJoinReducerBase {

// 两个参数数组大小一定相同，并且最多等于数据源个数
@Override
protected TaggedMapOutput combine(Object[] tags, Object[] values) {
if (tags.length < 2)
return null; // 这一步，实现内联结
String joinedStr = "";
for (int i = 0; i < values.length; i++) {
if (i > 0)
joinedStr += ","; // 以逗号作为原两个数据源记录链接的分割符
TaggedWritable tw = (TaggedWritable) values[i];
String line = ((Text) tw.getData()).toString();

String[] tokens = line.split(",", 2); // 将一条记录划分两组，去掉第一组的组键名。
joinedStr += tokens[1];
}
TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
retv.setTag((Text) tags[0]); // 这只retv的组键，作为最终输出键。
return retv;
}

}

驱动类 DataJoinDriver

package com.hadoop.data.join;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class DataJoinDriver extends Configured implements Tool {

public int run(String[] args) throws Exception {
Configuration conf = getConf();
if (args.length != 2) {
System.err.println("Usage:DataJoin <input path> <output path>");
System.exit(-1);
}
Path in = new Path(args[0]);
Path out = new Path(args[1]);
JobConf job = new JobConf(conf, DataJoinDriver.class);
job.setJobName("DataJoin");
FileSystem hdfs = FileSystem.get(conf);
FileInputFormat.setInputPaths(job, in);
if (hdfs.exists(new Path(args[1]))) {
hdfs.delete(new Path(args[1]), true);
}
FileOutputFormat.setOutputPath(job, out);
job.setMapperClass(JoinMapper.class);
job.setReducerClass(JoinReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TaggedWritable.class);
JobClient.runJob(job);
return 0;
}

public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new Configuration(), new DataJoinDriver(),
args);
System.exit(res);
}

}

ok，到此大功告成。

结果如下：

1 wuminggang,13575468248,B,89,2013-02-05
2 liujiannan,18965235874,C,69,2013-03-09
3 wangbo,15986854789,A,99,2013-03-05
3 wangbo,15986854789,D,56,2013-06-07
注意，代码中红色部分一定要加上，有的参考书上没有，原因如下：

使用DataJoin进行Reduce侧连接多数据源时，发生异常：

java.lang.RuntimeException: java.lang.NoSuchMethodException: com.hadoop.reducedatajoin.ReduceDataJoin$TaggedWritable.()

at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)

at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:62)

at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)

at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1271)

at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1211)

at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:249)

at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:245)

at org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.regroup(DataJoinReducerBase.java:106)

at org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.reduce(DataJoinReducerBase.java:129)

at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)

Caused by: java.lang.NoSuchMethodException: com.hadoop.reducedatajoin.ReduceDataJoin$TaggedWritable.()

at java.lang.Class.getConstructor0(Unknown Source)

at java.lang.Class.getDeclaredConstructor(Unknown Source)

at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)

... 11 more

解决法案：

http://stackoverflow.com/questions/10201500/hadoop-reduce-side-join-using-datajoin

You need a default constructor for TaggedWritable (Hadoop uses reflection to create this object, and requires a default constructor (no args).

You also have a problem in that your readFields method, you call data.readFields(in) on the writable interface - but has no knowledge of the actual runtime class of data.

I suggest you either write out the data class name before outputting the data object itself, or look into the GenericWritable class (you'll need to extend it to define the set of allowable writable classes that can be used).

So you could amend as follows:

大概意思是说：你需要为TaggedWritable提供一个默认的无参数构造方法。

您需要一个默认的的构造函数TaggedWritable（Hadoop的使用反射来创建这个对象，需要一个默认的构造函数（无参数）。
你也有一个问题，就是你的ReadFields方法，你可写的接口上调用data.readFields（中） - 但没有知识的实际运行时类的数据。
我建议你要么写出来的数据类的名称，然后输出的数据对象本身，或寻找到GenericWritable类（你需要扩展它定义一组允许可写的类，可以使用）。

优点：需要处理的数据相对较小，使用比较常见。

缺点：效率不高。数据在shuffle阶段重排，而大多数据重排后在reduce端又被丢掉，如果在map阶段就去除不必要的数据，会提升效率。详见多数据源的MapReduce作业(二)