Mapreduce 的简单例子2 多个文件的数字排序

来源:互联网 发布:温度湿度计品牌 知乎 编辑:程序博客网 时间:2024/06/05 19:36


并行算法能算很多东西,不只是计数,wordCount是一个比较简单的例子,很多其他的请参见我上传的基于mapreduce 的并行算法的设计。

今天来实现一个排序的简单例子。实现过程从简,因为具体的流程在我写的wordCount中已经详细的写在注释里了

首先输入是一堆文件file1、file2……里面存着数字,具体的逻辑是先对数字进行分块,比如100-200放在一起,200-300……然后每组分别分发给下面,

算完结果一拼就ok了

具体不啰嗦,直接贴代码

map

package sort;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.commons.lang.StringUtils;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/** * Created by zhangguanlong on 2017/11/15. */public class SortMapper extends Mapper<Object, Text, IntWritable, IntWritable>{    private static IntWritable data = new IntWritable();    public void map(Object key, Text value, Context context)            throws IOException, InterruptedException {        String line = value.toString();        data.set(Integer.parseInt(line));        context.write(data, new IntWritable(1));    }}

reduce

package sort;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import java.io.IOException;/** * Created by zhangguanlong on 2017/11/15. */public class SortReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{    private static IntWritable linenum = new IntWritable(1);    public void reduce(IntWritable key, Iterable<IntWritable> values,                       Context context) throws IOException, InterruptedException {        for (IntWritable val : values) {            context.write(linenum, key);            linenum = new IntWritable(linenum.get() + 1);        }    }}
Runner 为了方便分区也写在这个里。。按照程序设计思想,应该分开的。。但是我就这么写了,感觉这样写比较舒服,也许我是个假的程序员 0.0

package sort;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Partitioner;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;/** * Created by zhangguanlong on 2017/11/15. */public class SortRunner {    public static class Partition extends Partitioner<IntWritable, IntWritable> {        @Override        public int getPartition(IntWritable key, IntWritable value,                                int numPartitions) {            int MaxNumber = 65223;            int bound = MaxNumber / numPartitions + 1;            int keynumber = key.get();            for (int i = 0; i < numPartitions; i++) {                if (keynumber < bound * i && keynumber >= bound * (i - 1))                    return i - 1;            }            return 0;        }    }    /**     * @param args     */    public static void main(String[] args) throws Exception {        // TODO Auto-generated method stub        Configuration conf = new Configuration();        Job job = new Job(conf, "Sort");        job.setJarByClass(SortRunner.class);        job.setMapperClass(SortMapper.class);        job.setPartitionerClass(Partition.class);        job.setReducerClass(SortReducer.class);        job.setOutputKeyClass(IntWritable.class);        job.setOutputValueClass(IntWritable.class);        FileInputFormat.addInputPath(job, new Path("/wc/sort1/"));        FileOutputFormat.setOutputPath(job, new Path("/wc/sort2/"));        System.exit(job.waitForCompletion(true) ? 0 : 1);    }}
代码里路径这么写其实是放在hdfs的文件系统里的 ,所以我们把文件上传到hdfs

注意这里的file本来是放在linux根目录下的,不懂得可以去看hadoop 和linux的shell指令,然后运行

[hadoop@zhang ~]$ hadoop jar SProject.jar sort.SortRunner

注意这里因为输出目录写死了,如果目录已存在会报错。。。

成功跑起来是这个样子


还要注意的是文件里一定不要有空格,内容必须是数字,因为这只是个简单的demo。。。。

跑完看下结果



over