data-intensive text processing with mapreduce-MapReduce Algorithm Design

来源：互联网发布：淘宝店铺出售骗局编辑：程序博客网时间：2024/06/01 08:59

MapReduce Algorithm Design

in-mapper combining

Main idea：通过借用Map手动实现聚集，在Mapper中实现Combiner。

Example：WordCount

Reason：

1.Hadoop的Combiner机制不管key的分布，都会执行combine，如很多key都只有唯一的value与他对应，则Combiner就没有作用，而且在combine之前，还会有排序等操作。

2.对象的序列化与反序列化会耗费系统资源。

Disadvantages：

1.破坏了MapReduce函数编程思想，但效率大大提高。

2.内存瓶颈，由于借用的Map中的key可能数量巨大，最后可能导致Map不能完全放在内存当中。

Solve the disadvantages：处理N个key-value对之后，或者Map的size达到一定量级就写出一次，但是N和size是很难确定的量。

Summary：With local aggregation (either combiners or in-mapper combining), we substantially reduce the number of values associated with frequently-occurring terms,which alleviates the reduce straggler problem.

the correctness of the aggravated algorithm

Main idea：一些聚集函数（mean）不满足结合律和交换律，需要做适当变换，以满足结合律和交换率

Example：Mean

Optimize：结合之前的in-mapper combiner，进一步提高效率。

Pairs and Strips

Main idea：Pairs：将key用组合形式表示；Strips：针对一个特定的key将value用Map表示。

Examples：语料库中单词同现次数。

Pairs：

Strips：

Discussion：

1.Pairs产生的key-value对数量远大于Strips，Strips的数据压缩更紧密，但是Strips序列化和反序列化的操作较多。

2.虽然Pairs和Strips都可以用Combiner来优化，但是Strips优化的效率更高（key的数量更小）。

3.pairs方法记录了同时出现的事件，然而strips记录了事件a发生时的所有事件。

4.折中的方式是先strips为一定桶，每一个桶内再pairs

Pairs和Strips代码：

package PairsAndStrips;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.JobContext;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import java.io.IOException;import java.util.List;/** * Created with IntelliJ IDEA. * User: ubuntu * Date: 13-11-7 * Time: 下午8:10 * To change this template use File | Settings | File Templates. */public class WholeFileInputFormat extends FileInputFormat {    @Override    protected boolean isSplitable(JobContext context, Path filename) {        return false;    }    @Override    public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {        WholeFileRecordReader recordReader = new WholeFileRecordReader();    //    recordReader.initialize(split,context);        return recordReader;    }}

package PairsAndStrips;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.io.BytesWritable;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileSplit;import java.io.IOException;/** * Created with IntelliJ IDEA. * User: ubuntu * Date: 13-11-7 * Time: 下午8:13 * To change this template use File | Settings | File Templates. */public class WholeFileRecordReader extends RecordReader<NullWritable,BytesWritable> {    private FileSplit split;    private boolean taskProcessed = false;    private BytesWritable value = new BytesWritable();    private TaskAttemptContext context;    @Override    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {        this.split = (FileSplit) split;        this.context = context;    }    @Override    public boolean nextKeyValue() throws IOException, InterruptedException {        if (taskProcessed == false) {            byte[] bytes = new byte[(int) split.getLength()];            FileSystem fileSystem = FileSystem.get(context.getConfiguration());            FSDataInputStream fsDataInputStream = fileSystem.open(split.getPath());            try {                IOUtils.readFully(fsDataInputStream,bytes,0,bytes.length);            } catch (IOException e) {                e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.            } finally {                IOUtils.closeStream(fsDataInputStream);            }            value.set(bytes,0, bytes.length);            taskProcessed = true;            return true;        }        return false;  //To change body of implemented methods use File | Settings | File Templates.    }    @Override    public NullWritable getCurrentKey() throws IOException, InterruptedException {        return NullWritable.get();  //To change body of implemented methods use File | Settings | File Templates.    }    @Override    public BytesWritable getCurrentValue() throws IOException, InterruptedException {        return value;  //To change body of implemented methods use File | Settings | File Templates.    }    @Override    public float getProgress() throws IOException, InterruptedException {        return taskProcessed ? 1 : 0;  //To change body of implemented methods use File | Settings | File Templates.    }    @Override    public void close() throws IOException {        //To change body of implemented methods use File | Settings | File Templates.    }}

package PairsAndStrips;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;/** * Created with IntelliJ IDEA. * User: ubuntu * Date: 13-11-7 * Time: 下午7:49 * To change this template use File | Settings | File Templates. */public class Pairs extends Configured implements Tool {    public static void main(String[] args) throws Exception {        int exitCode = ToolRunner.run(new Pairs(), args);        System.exit(exitCode);    }    public static class Pair implements WritableComparable {        private Text left;        private Text right;        public Pair() {            left = new Text();            right = new Text();        }        public Text getLeft() {            return left;        }        public void setLeft(Text left) {            this.left = left;        }        public Text getRight() {            return right;        }        public void setRight(Text right) {            this.right = right;        }        @Override        public int compareTo(Object o) {            Pair that = (Pair) o;            int cmp = this.getLeft().compareTo(that.getLeft());            if(cmp != 0) {                return  cmp;            }            return this.getRight().compareTo(that.getRight());        }        @Override        public void write(DataOutput out) throws IOException {            left.write(out);            right.write(out);        }        @Override        public void readFields(DataInput in) throws IOException {             left.readFields(in);            right.readFields(in);        }        @Override        public int hashCode() {            return left.hashCode()*13+ right.hashCode();        }        @Override        public boolean equals(Object obj) {            if (obj instanceof Pair) {                Pair that = (Pair)obj;                return this.getLeft().equals(that.getLeft()) && this.getRight().equals(that.getRight());            }            return false;        }        @Override        public String toString() {            return left.toString()+"-"+right.toString();        }    }    @Override    public int run(String[] args) throws Exception {        Configuration conf = new Configuration();        Job job = new Job(conf, "Pairs");        job.setJarByClass(Pairs.class);        job.setInputFormatClass(WholeFileInputFormat.class);        job.setOutputFormatClass(TextOutputFormat.class);        job.setMapperClass(MyMapper.class);        job.setReducerClass(MyReducer.class);        job.setOutputKeyClass(Pair.class);        job.setOutputValueClass(IntWritable.class);       WholeFileInputFormat.addInputPath(job, new Path(args[0]));        TextOutputFormat.setOutputPath(job, new Path(args[1]));        return job.waitForCompletion(true) ? 0 : 1;/*        Configuration conf = new Configuration();        Job job = new Job(conf, "Pairs");        job.setJarByClass(Pairs.class);        job.setInputFormatClass(TextInputFormat.class);        job.setOutputFormatClass(TextOutputFormat.class);        job.setMapperClass(MyMapper.class);        job.setReducerClass(MyReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        FileInputFormat.addInputPath(job, new Path(args[0]));        TextOutputFormat.setOutputPath(job, new Path(args[1]));        return job.waitForCompletion(true) ? 0 : 1;*/    }    public static class MyMapper extends Mapper<NullWritable, BytesWritable, Pair, IntWritable> {        private Pair outKey = new Pair();        private IntWritable outValue = new IntWritable(1);        @Override        protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {            String[] tokens =new String(value.getBytes(),0,value.getLength()).split(" ");          //  String[] tokens = value.toString().split(" ");            //  System.out.println(tokens[0]+tokens[1]);            for (int i = 0; i < tokens.length - 1; i++) {                outKey.setLeft(new Text(tokens[i]));                outKey.setRight(new Text(tokens[i + 1]));                System.out.println(outKey);                context.write(outKey, outValue);            }        }    }//    public static class MyMapper extends Mapper<LongWritable,Text, Text, IntWritable> {//        private Text outKey = new Text();//        private IntWritable outValue = new IntWritable(1);////        @Override//        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {//           String[] strings = value.toString().split(" ");//            for (int i = 0; i < strings.length; i++) {//                outKey.set(strings[i]);//                 context.write(outKey ,outValue);//            }//        }//    }    public static class MyReducer extends Reducer<Pair, IntWritable, Pair, IntWritable> {        private IntWritable outValue = new IntWritable();        @Override        protected void reduce(Pair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {            int sum = 0;            for (IntWritable value : values) {                sum += value.get();            }            outValue.set(sum);            context.write(key, outValue);        }    }    /*public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {        private IntWritable outValue = new IntWritable();        @Override        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {            int sum = 0;            for (IntWritable value : values) {                sum += value.get();            }            outValue.set(sum);            context.write(key, outValue);        }    }*/}

package PairsAndStrips;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.BytesWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import java.util.HashMap;import java.util.Map;import java.util.Set;/** * Created with IntelliJ IDEA. * User: Jack * Date: 13-11-10 * Time: 上午10:17 * To change this template use File | Settings | File Templates. */public class Strips extends Configured implements Tool {    public static void main(String[] args) throws Exception {        int exitCode = ToolRunner.run(new Strips(), args);        System.exit(exitCode);    }    @Override    public int run(String[] args) throws Exception {        Configuration conf = new Configuration();        Job job = new Job(conf, "Strips");        job.setJarByClass(Strips.class);        job.setInputFormatClass(WholeFileInputFormat.class);        job.setOutputFormatClass(TextOutputFormat.class);        FileInputFormat.addInputPath(job, new Path(args[0]));        TextOutputFormat.setOutputPath(job, new Path(args[1]));        job.setMapperClass(StripsMapper.class);        job.setReducerClass(StripsReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(Strip.class);        return job.waitForCompletion(true) ? 0 : 1;  //To change body of implemented methods use File | Settings | File Templates.    }    public static class StripsMapper extends Mapper<NullWritable, BytesWritable, Text, Strip> {        private HashMap<String, Strip> wordStripHashMap = new HashMap<String, Strip>();        private Strip strip;        private Text outkey = new Text();        private Strip outvalue = new Strip();       /* @Override        protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {            String[] tokens = new String(value.getBytes(), 0, value.getLength()).trim().split(" ");           *//* for (int i = 0; i < tokens.length; i++) {                System.out.println(tokens[i]);                outkey.set(tokens[i]);    //            outvalue = new Strip();                outvalue.put(tokens[i],1);                System.out.println(outkey+"-----------------------"+outvalue);                context.write(outkey, outvalue);            }*//*            System.out.println(tokens.length);            for (int i = 0; i < tokens.length; i++) {                System.out.println(tokens[i]);                outkey.set(tokens[i]);                outvalue.put(tokens[i], tokens[i].length());                //outvalue.getStripMap().put(tokens[i],1);                context.write(outkey, outvalue);                System.out.println(i + "+" + outkey + "*" + outvalue);                outvalue.clear();            }        }*/        @Override        protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {            String[] tokens = new String(value.getBytes(), 0, value.getLength()).trim().split(" ");            for (int i = 0; i < tokens.length - 1; i++) {                if (false == wordStripHashMap.containsKey(tokens[i])) {                    wordStripHashMap.put(tokens[i], new Strip());                }                strip = wordStripHashMap.get(tokens[i]);                mapPlus(strip.getStripMap(), tokens[i + 1]);            }            for (Map.Entry<String, Strip> stringStripEntry : wordStripHashMap.entrySet()) {                outkey.set(stringStripEntry.getKey());                outvalue.set(stringStripEntry.getValue());                context.write(outkey, outvalue);                System.out.println(outkey + "====================" + outvalue);            }        }        private Integer mapPlus(HashMap<String, Integer> hashMap, String key) {            if (hashMap.containsKey(key)) {                return hashMap.put(key, hashMap.get(key) + 1);            } else {                return hashMap.put(key, 1);            }        }    }    public static class StripsReducer extends Reducer<Text, Strip, Text, Strip> {        private Strip strip = new Strip();        @Override        protected void reduce(Text key, Iterable<Strip> values, Context context) throws IOException, InterruptedException {            System.out.println("+-------------------------------------------------------------------------------+");            for (Strip value : values) {                System.out.println(key + "+" + value);                Set<Map.Entry<String, Integer>> entries = value.getStripMap().entrySet();                for (Map.Entry<String, Integer> entry : entries) {                    mapPlus(strip, entry.getKey(), entry.getValue());                }            }            context.write(key, strip);            System.out.println(key + "****" + strip);            strip.clear();            System.out.println(strip.size());        }//        @Override//        protected void reduce(Text key, Iterable<Strip> values, Context context) throws IOException, InterruptedException {//            for (Strip value : values) {//                System.out.println(key+"-----"+value);//                context.write(key,value);//            }//        }        private Integer mapPlus(Strip strip, String key, int num) {            HashMap<String, Integer> hashMap = strip.getStripMap();            if (hashMap.containsKey(key)) {                return hashMap.put(key, hashMap.get(key) + num);            } else {                return hashMap.put(key, num);            }        }    }}class Strip implements WritableComparable {    @Override    public int compareTo(Object o) {        Strip that = (Strip) o;        return this.size() > that.size() ? 1 : this.size() < that.size() ? -1 : 0;    }    private HashMap<String, Integer> stripMap;    public Strip() {        stripMap = new HashMap<String, Integer>();    }    @Override    public int hashCode() {        return stripMap.hashCode() * 131 + stripMap.size();    }    @Override    public boolean equals(Object obj) {        return stripMap.equals(((Strip) obj).getStripMap());    //To change body of overridden methods use File | Settings | File Templates.    }    @Override    public String toString() {        String outString = "";        for (String s : stripMap.keySet()) {            outString += s + "-" + stripMap.get(s) + "||";        }        return outString;    }    HashMap<String, Integer> getStripMap() {        return stripMap;    }    @Override    public void write(DataOutput dataOutput) throws IOException {        dataOutput.writeInt(stripMap.size());        for (String s : stripMap.keySet()) {            dataOutput.writeUTF(s);            dataOutput.writeInt(stripMap.get(s));        }    }    @Override    public void readFields(DataInput dataInput) throws IOException {        stripMap.clear();        int size = dataInput.readInt();        for (int i = 0; i < size; i++) {            stripMap.put(dataInput.readUTF(), dataInput.readInt());        }    }    public int size() {        return stripMap.size();    }    public Integer put(String key, int value) {        return stripMap.put(key, value);    }    public int get(Object key) {        return stripMap.get(key);    }    public void set(Strip value) {        this.stripMap = value.getStripMap();    }    public void clear() {        stripMap.clear();    }}

写代码过程中的主要问题 Strip反序列化时需要clear，put(K,V)需要返回V对象，不能因为是Integer写成int，否则会有空指针异常。

Computing Relative Frequencies

MainIdea:利用Hadoop内部机制实现顺序反转，提高计算性能

Example：计算相对词频矩阵

Strips：比较方便，直接在reduce端进行求边缘值（类似于概率中的边缘概率分布），再计算相对词频。

Pairs：借用Hadoop内部机制实现，map每产生一个key-value对，附带产生一个(left,*)-1，用于统计left出现的次数；自定义partiticon，将left相同的key-value划分到同一个reduce；自定义排序方式将(left，*)排在第一个。reduce得到的输入如下：

第一个key-value对用于计算边缘值，后续的value/边缘值即可得到相对词频。

具体代码如下：

package PairsAndStrips;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Partitioner;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;/** * Created with IntelliJ IDEA. * User: ubuntu * Date: 13-11-7 * Time: 下午7:49 * To change this template use File | Settings | File Templates. */public class PairRelativeCal extends Configured implements Tool {    public static void main(String[] args) throws Exception {        int exitCode = ToolRunner.run(new PairRelativeCal(), args);        System.exit(exitCode);    }    public static class Pair implements WritableComparable {        private Text left;        private Text right;        public Pair() {            left = new Text();            right = new Text();        }        public Text getLeft() {            return left;        }        public void setLeft(Text left) {            this.left = left;        }        public Text getRight() {            return right;        }        public void setRight(Text right) {            this.right = right;        }        @Override        public int compareTo(Object o) {            Pair that = (Pair) o;            int cmp = this.getLeft().compareTo(that.getLeft());            if (cmp != 0) {                return cmp;            }            if (this.getRight().toString().equals("*") && that.getRight().toString().equals("*")) {                return 0;            } else if (this.getRight().toString().equals("*")) {                return -1;            } else if (that.getRight().toString().equals("*")) {                return 1;            } else {                return this.getRight().compareTo(that.getRight());            }        }        @Override        public void write(DataOutput out) throws IOException {            left.write(out);            right.write(out);        }        @Override        public void readFields(DataInput in) throws IOException {            left.readFields(in);            right.readFields(in);        }        @Override        public int hashCode() {            return left.hashCode() * 13 + right.hashCode();        }        @Override        public boolean equals(Object obj) {            if (obj instanceof Pair) {                Pair that = (Pair) obj;                return this.getLeft().equals(that.getLeft()) && this.getRight().equals(that.getRight());            }            return false;        }        @Override        public String toString() {            return left.toString() + "-" + right.toString();        }    }    @Override    public int run(String[] args) throws Exception {        Configuration conf = new Configuration();        Job job = new Job(conf, "Pairs");        job.setJarByClass(Pairs.class);        job.setInputFormatClass(WholeFileInputFormat.class);        job.setOutputFormatClass(TextOutputFormat.class);        job.setMapperClass(MyMapper.class);        job.setPartitionerClass(MyPartitioner.class);        job.setReducerClass(MyReducer.class);        job.setMapOutputKeyClass(Pair.class);        job.setMapOutputValueClass(IntWritable.class);        job.setOutputKeyClass(Pair.class);        job.setOutputValueClass(DoubleWritable.class);        WholeFileInputFormat.addInputPath(job, new Path(args[0]));        TextOutputFormat.setOutputPath(job, new Path(args[1]));        return job.waitForCompletion(true) ? 0 : 1;    }    public static class MyMapper extends Mapper<NullWritable, BytesWritable, Pair, IntWritable> {        private Pair outKey = new Pair();        private IntWritable outValue = new IntWritable(1);        @Override        protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {            String[] tokens = new String(value.getBytes(), 0, value.getLength()).trim().split(" ");            //  String[] tokens = value.toString().split(" ");            //  System.out.println(tokens[0]+tokens[1]);            for (int i = 0; i < tokens.length - 1; i++) {                outKey.setLeft(new Text(tokens[i]));                outKey.setRight(new Text(tokens[i + 1]));                System.out.println(outKey);                context.write(outKey, outValue);                outKey.setRight(new Text("*"));                context.write(outKey, outValue);                System.out.println(outKey);            }        }    }    public static class MyPartitioner extends Partitioner<Pair, IntWritable> {        @Override        public int getPartition(Pair pair, IntWritable intWritable, int numPartitions) {            return (pair.getLeft().hashCode() * 13) % numPartitions;        }    }    public static class MyReducer extends Reducer<Pair, IntWritable, Pair, DoubleWritable> {        private DoubleWritable outValue = new DoubleWritable();        private double count;        @Override        protected void reduce(Pair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {            System.out.println(key);            int sum = 0;            for (IntWritable value : values) {                sum += value.get();            }            if (key.getRight().toString().equals("*")) {                count = sum;            } else {                outValue.set(sum / count);                context.write(key, outValue);            }        }    }}

Discussion：Strips有内存瓶颈问题，而Pairs就不存在该问题。下图是原书对Pairs方法的总结，与上述的实现方式一致。

Secondary Sorting

MainIdea："value-to-key conversion" design pattern:The basic idea is to move part of the value into the intermediate key to form a composite key, and let the MapReduce execution framework handle the sorting.(原本按key排序后再按照value排序，将key和value组合成新的key，利用hadoop内部机制排序）