MapReduce TopN问题

来源:互联网 发布:js一句话木马 编辑:程序博客网 时间:2024/05/21 01:49

分析:利用MapReduce如何实现类似Wordcount的TopN问题
数据源:

1   A   10  2   A   40  3   B   30  4   C   20  5   B   10  6   D   40  7   A   30  8   C   20  9   B   10  10  D   40  11  C   30  12  D   20

问题难点:

(1)Reduce端TreeSet方法进阶
(2)Reduce中Iterable迭代数据
引申:Reduce端只能遍历一次

较简单的方法是使用内置的TreeMap或者TreeSet。这两种是基于红黑树的一种数据结构,内部维持的事key的次序,但每次添加新元素,其排序的开销要大于堆调整的开销。找最大的Top N 元素,创建的就是小顶堆。小顶堆的特性是根节点是最小元素。不需要对堆进行排序,当堆的根节点被替换成新的节点时,需要进行堆化,以保持小顶堆的特性。

TreeMap不指定排序器的情况下,默认按照key值进行升序排列。

import java.io.IOException;import java.net.URI;import java.net.URISyntaxException;import java.util.TreeSet;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class TopN implements Tool{    public static class mapper extends Mapper<LongWritable, Text, Text, LongWritable>{        public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{            String[] strings = value.toString().split("\t");            context.write(new Text(strings[1].trim()), new LongWritable(Integer.valueOf(strings[2].trim())));        }    }    public static class reduce extends Reducer<Text,LongWritable,Text,LongWritable>{        public void reduce(Text key,Iterable<LongWritable> values,Context context) throws IOException, InterruptedException{            TreeSet<Long> tSet= new TreeSet<Long>();            for(LongWritable value:values){                tSet.add(value.get());            }            if(tSet.size() > 3){                tSet.remove(tSet.first());            }            for(Long num:tSet){                context.write(key, new LongWritable(num));            }        }    }    static String input = "";    static String output = "";    public int run(String[] str) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {        input = str[0];        output = str[1];        Configuration conf = new Configuration();        FileSystem file = FileSystem.get(new URI(input), conf);        Path outPath = new Path(output);        if (file.exists(outPath)) {            file.delete(outPath, true);        }        Job job = Job.getInstance();        job.setJarByClass(TopN.class);        FileInputFormat.setInputPaths(job, input);        job.setInputFormatClass(TextInputFormat.class);        job.setMapperClass(mapper.class);        job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(LongWritable.class);        job.setPartitionerClass(HashPartitioner.class);        job.setNumReduceTasks(4);        job.setReducerClass(reduce.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(LongWritable.class);        FileOutputFormat.setOutputPath(job, outPath);        job.setOutputFormatClass(TextOutputFormat.class);        //用于提交未提交过得作业        job.waitForCompletion(true);        return 0;    }    public static void main(String[] args) throws Exception {        ToolRunner.run(new TopN(), args);    }    public Configuration getConf() {        return null;    }    public void setConf(Configuration arg0) {    }}
引申:Reduce端只能进行一次iterable(单向迭代一次)

虽然reduce方法会反复执行多次,但是key和value相关的对象只有两个,reduce会反复重用这两个对象(类似String是不可变对象的道理)。所以如果要保存key或者value的结果,只能将其中的值取出另存或者重新clone一个对象

public void reduce(Text host, Iterator<CrawlDatum> values, OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException {    List<CrawlDatum> cache = new LinkedList<CrawlDatum>();    // first loop and caching    while (values.hasNext()) {        CrawlDatum datum = values.next();        doSomethingWithValue();        CrawlDatum copy = new CrawlDatum();        copy.set(datum);        cache.add(copy);    }    // second loop    for(IntWritable value:cache) {        doSomethingElseThatCantBeDoneInFirstLoop(value);    }}

参考博客:
http://blog.csdn.net/zeb_perfect/article/details/53335207
Reduce iterable单向迭代问题
http://www.wangzhe.tech/MapReduce/MapReduce%E4%B8%ADreduce%E9%98%B6%E6%AE%B5iterator%E5%A6%82%E4%BD%95%E9%81%8D%E5%8E%86%E4%B8%A4%E9%81%8D%E5%92%8C%E6%89%80%E9%81%87%E5%88%B0%E7%9A%84%E9%97%AE%E9%A2%98/2016/07/13/