Hadoop MapReduce技巧
来源:互联网 发布:白云机场 知乎 编辑:程序博客网 时间:2024/06/04 19:16
我在使用Hadoop编写MapReduce程序时,遇到了一些问题,通过在Google上查询资料,并结合自己对Hadoop的理解,逐一解决了这些问题。
自定义Writable
Hadoop对MapReduce中Key与Value的类型是有要求的,简单说来,这些类型必须支持Hadoop的序列化。为了提高序列化的性能,Hadoop还为Java中常见的基本类型提供了相应地支持序列化的类型,如IntWritable,LongWritable,并为String类型提供了Text类型。不过,这些Hadoop内建的类型并不足以支持真实遇到的业务。此时,就需要自定义Writable类,使得它既能够作为Job的Key或者Value,又能体现业务逻辑。
假设我已经从豆瓣抓取了书籍的数据,包括书籍的Title以及读者定义的Tag,并以Json格式存储在文本文件中。现在我希望提取这些数据中我感兴趣的内容,例如指定书籍的Tag列表,包括Tag被标记的次数。这些数据可以作为向量,为后面的数据分析提供基础数据。对于Map,我希望读取Json文件,然后得到每本书的Title,以及对应的单个Tag信息。作为Map的输出,我希望是我自己定义的类型BookTag。它只包括Tag的名称和标记次数:
public class BookTag implements Writable { private String name; private int count; public BookTag() { count = 0; } public BookTag(String name, int count) { this.name = name; this.count = count; } @Override public void write(DataOutput dataOutput) throws IOException { if (dataOutput != null) { Text.writeString(dataOutput, name); dataOutput.writeInt(count); } } @Override public void readFields(DataInput dataInput) throws IOException { if (dataInput != null) { name = Text.readString(dataInput); count = dataInput.readInt(); } } public String getName() { return name; } public void setName(String name) { this.name = name; } public int getCount() { return count; } public void setCount(int count) { this.count = count; } @Override public String toString() { return "BookTag{" + "name='" + name + '\'' + ", count=" + count + '}'; }}
注意,在write()与readFields()方法中,对于String类型的处理完全不同于Int、Long等类型,它需要调用Text的相关静态方法。
针对每本书,Map出来的结果可能包含重复的BookTag信息(指Tag Name相同);而我需要得到每个Tag的标记总和,以作为数据分析的向量。因此,作为Reduce的输入,可以是<Text, Iterable>,但输出则应该是合并了相同Tag信息的结果。为此,我引入了BookTags类,在其内部维持了一个BookTag的Map,它同样需要实现Writable。由于BookTags包含了一个集合类型,因此它的实现会略有不同:
public class BookTags implements Writable { private Map<String, BookTag> tags = new HashMap<String, BookTag>(); @Override public void write(DataOutput dataOutput) throws IOException { dataOutput.writeInt(tags.size()); for (BookTag tag : tags.values()) { tag.write(dataOutput); } } @Override public void readFields(DataInput dataInput) throws IOException { int size = dataInput.readInt(); for (int i = 0; i < size; i++) { BookTag tag = new BookTag(); tag.readFields(dataInput); tags.put(tag.getName(), tag); } } public void add(BookTag tag) { String tagName = tag.getName(); if (tags.containsKey(tagName)) { BookTag bookTag = tags.get(tagName); bookTag.setCount(bookTag.getCount() + tag.getCount()); } else { tags.put(tagName, tag); } } @Override public String toString() { StringBuilder resultTags = new StringBuilder(); for (BookTag tag : tags.values()) { resultTags.append(tag.toString()); resultTags.append("|"); } return resultTags.toString(); }}
其实,针对这种嵌套了集合的自定义Writable类型,由于嵌套的类型同样实现了Writable接口,因而同样可以调用嵌套类型的write()与readFields()方法,唯一的区别是需要将集合的Size写入到DataOutput中,以便于在读取时可以遍历集合。这实际上是一种Composite模式。
Iterable的奇怪行为
我需要在reduce()方法中,遍历传入的Iterable,以便于对重复的Tag进行累加操作。在遍历该对象时,我发现了一个奇怪现象,即最终得到的每本书的Tag信息,全部变成了一样的内容。通过对Reduce Job进行调试,发现每当遍历到Iterable的下一个元素时,这个最新的值就会覆盖之前得到的对象,使其变成同一个对象。通过Google,我发现这个问题是Hadoop的奇怪行为,即Iterable对象的next()方法永远会返回同一个对象。解决办法就是在遍历时,创建一个新对象放到我们要存储的集合中,如下第5行代码所示:
public static class BookReduce extends Reducer<Text, BookTag, Text, BookTags> { public void reduce(Text key, Iterable<BookTag> values, Context context) throws IOException, InterruptedException { BookTags bookTags = new BookTags(); for (BookTag tag : values) { bookTags.add(new BookTag(tag.getName(), tag.getCount())); } context.write(key, bookTags); } }
这里得到的一个经验是,在编写MapReduce程序时,通过调试可以帮助你快速地定位问题。调试时,可以在项目的根目录下建立input文件夹,将数据源文件放入到该文件夹中,然后在调试的参数中设置即可。
如何进行单元测试
我们同样可以给MapReduce Job编写单元测试。除了可以使用Mockito进行Mock之外,我认为MRUnit可以更好地完成对MapReduce任务的验证。MRUnit为Map与Reduce提供了对应的Driver,即MapDriver与ReduceDriver。在编写测试用例时,我们只需要为Driver指定Input与Output,然后执行Driver的runTest()方法,即可测试任务的执行是否符合预期。这种预期是针对output输出的结果而言。以WordCounter为例,编写的单元测试如下:
public class WordCounterTest { private MapDriver<LongWritable, Text, Text, IntWritable> mapDriver; private ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver; @Before public void setUp() { WordCounter.Map tokenizerMapper = new WordCounter.Map(); WordCounter.Reduce reducer = new WordCounter.Reduce(); mapDriver = MapDriver.newMapDriver(tokenizerMapper); reduceDriver = ReduceDriver.newReduceDriver(reducer); } @Test public void should_execute_tokenizer_map_job() throws IOException { mapDriver.withInput(new LongWritable(12), new Text("I am Bruce Bruce")); mapDriver.withOutput(new Text("I"), new IntWritable(1)); mapDriver.withOutput(new Text("am"), new IntWritable(1)); mapDriver.withOutput(new Text("Bruce"), new IntWritable(1)); mapDriver.withOutput(new Text("Bruce"), new IntWritable(1)); mapDriver.runTest(); } @Test public void should_execute_reduce_job() { List<IntWritable> values = new ArrayList<IntWritable>(); values.add(new IntWritable(1)); values.add(new IntWritable(3)); reduceDriver.withInput(new Text("Bruce"), values); reduceDriver.withOutput(new Text("Bruce"), new IntWritable(4)); reduceDriver.runTest(); }}
Chaining Job
通过利用Hadoop提供的ChainMapper与ChainReducer,可以较为容易地实现多个Map Job或Reduce Job的链接。例如,我们可以将WordCounter分解为Tokenizer与Upper Case两个Map任务,最后执行Reduce。遗憾的是,ChainMapper与ChainReducer似乎不支持新版本的API,它要链接的Map与Reduce必须派生自MapReduceBase,并实现对应的Mapper或Reducer接口(说明,下面的代码基本上来自于StackOverFlow的一个帖子)。
public class ChainWordCounter extends Configured implements Tool { public static class Tokenizer extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { StringTokenizer tokenizer = new StringTokenizer(value.toString()); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class UpperCaser extends MapReduceBase implements Mapper<Text, IntWritable, Text, IntWritable> { public void map(Text key, IntWritable count, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { collector.collect(new Text(key.toString().toUpperCase()), count); } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); @Override public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } result.set(sum); collector.collect(key, result); } } public int run(String[] args) throws Exception { JobConf jobConf = new JobConf(getConf(), ChainWordCounter.class); FileInputFormat.setInputPaths(jobConf, new Path(args[0])); FileInputFormat.setInputPaths(jobConf, new Path(args[0])); Path outputDir = new Path(args[1]); FileOutputFormat.setOutputPath(jobConf, outputDir); outputDir.getFileSystem(getConf()).delete(outputDir, true); JobConf tokenizerMapConf = new JobConf(false); ChainMapper.addMapper(jobConf, Tokenizer.class, LongWritable.class, Text.class, Text.class, IntWritable.class, true, tokenizerMapConf); JobConf upperCaserMapConf = new JobConf(false); ChainMapper.addMapper(jobConf, UpperCaser.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, upperCaserMapConf); JobConf reduceConf = new JobConf(false); ChainReducer.setReducer(jobConf, Reduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf); JobClient.runJob(jobConf); return 0; } public static void main(String[] args) throws Exception { int ret = ToolRunner.run(new Configuration(), new ChainWordCounter(), args); System.exit(ret); }}
不知道什么时候这种机制能够很好地支持新版的API。
- Hadoop MapReduce技巧
- MapReduce&hadoop
- Hadoop MapReduce
- Hadoop MapReduce
- Hadoop MapReduce
- Hadoop Mapreduce
- Hadoop MapReduce
- hadoop mapreduce
- hadoop mapreduce
- hadoop mapreduce
- Hadoop mapreduce
- Hadoop MapReduce
- Hadoop/mapreduce
- Hadoop - MapReduce
- Hadoop-MapReduce
- hadoop mapreduce
- Hadoop-MapReduce
- hadoop-MapReduce
- 解决jquery与其它库的冲突
- #pragma warning (disable : 4786)
- Spring HttpInvoke实现,以及效率提升!
- C# 计算时间差 用timespan函数
- MyEclipse Build path contains duplicate entry
- Hadoop MapReduce技巧
- 使用 TreeView.HoverStyle 要求 Page.Header 非空
- Android library projects cannot be launched问题的解决
- Memcached笔记——(一)安装&常规错误&监控
- Java中ArrayList和LinkedList区别
- C#适配器模式代码
- Memcached笔记——(二)XMemcached&Spring集成
- [WebKit] JavaScriptCore解析--高级篇(一) SSA (static single assignment)
- Memcached笔记——(三)Memcached使用总结