MapReduce Sort

来源：互联网发布：mac 搭建代理服务器编辑：程序博客网时间：2024/06/05 08:38

排序可分为四种排序：普通排序、部分排序、全局排序、二次排序

一、普通排序

（1）Mapreduce本身自带排序功能；

（2）Text对象是不适合排序的；

（3）IntWritable，LongWritable等实现了WritableComparable类型的对象都是可以排序的；

二、部分排序

map和reduce处理过程中包含了默认对key的排序，那么如果不要求全排序，可以直接把结果输出，每个输出文件中包含的就是安装key执行排序的结果；

三、全局排序

（1）Hadoop平台没有提供全局数据排序，而在大规模数据处理中进行数据的全局排序是非常普遍的需求；

（2）使用hadoop进行大量的数据排序最直观的方法是把文件所以内容给map之后，map不做任何处理，直接输出给一个reduce，利用hadoop自己的shuffle机制，对所有数据进行排序，而后由reduce直接输出；

（3）主要思路就是将数据按照区间进行分割，比如对整数排序，[0,10000]的在partition 0中，(10000，20000]在partition 1中，在数据分布均匀的情况下，每个分区内的数据量基本相同，这种就是比较理想的情况了，但是实际中数据往往分布不均匀，出现了数据倾斜的情况，这时按照之前的分区划分数据就不合适了，此时就需要一定的帮助——采样器；

新建文件，名为data，里面每行存放一个整数数据，乱序排放，具体不举例了，放代码。

新建项目TestSort，新建包com.sort，

MyMapper.java：

package com.sort;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Mapper.Context;public class MyMapper extends Mapper<LongWritable, Text,LongWritable, NullWritable> {@Overrideprotected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {           String[] values = value.toString().split("\\s+");           context.write(new LongWritable(Long.parseLong(values[0])),NullWritable.get());}}

MyReducer.java：

package com.sort;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.Reducer.Context;public class MyReducer extends Reducer<LongWritable, NullWritable, LongWritable, NullWritable> {@Overrideprotected void reduce(LongWritable key, Iterable<NullWritable> value,Context context)throws IOException, InterruptedException {          context.write(key, NullWritable.get());}}

MyPartitioner.java：

package com.sort;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.Partitioner;public class MyPartioner extends Partitioner<LongWritable, NullWritable> {@Overridepublic int getPartition(LongWritable key, NullWritable value, int numPartitions) {if(key.get() <= 100)    return 0 % numPartitions;if(key.get() > 100 && key.get() < 1000)return 1 % numPartitions;return 2 % numPartitions;}}

TestSort.java：

package com.sort;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class TestSort {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: wordcount <in> <out>");      System.exit(2);    }    Job job = new Job(conf, "Test sort");    job.setJarByClass(TestSort.class);    job.setMapperClass(MyMapper.class);    job.setReducerClass(MyReducer.class);        job.setPartitionerClass(MyPartioner.class);    job.setNumReduceTasks(3);    job.setMapOutputKeyClass(LongWritable.class);    job.setMapOutputValueClass(NullWritable.class);        job.setOutputKeyClass(LongWritable.class);    job.setOutputValueClass(NullWritable.class);        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    System.exit(job.waitForCompletion(true) ? 0 : 1);}}

打包为jar，

$hadoop jar TestSort.jar com.sort.TestSort /input /output

（1）Hadoop提供了Sampler接口可以返回一组样本，该接口为Hadoop的采样器；

（2）Hadoop提供了一个TotalOrderPartitioner类，可以用来实现全局排序；

使用采样器：

conf.setPartitionerClass(TotalOrderPartitioner.class) //关于partitioner可以这个实现 使用采样器产生的文件；InputSampler.RandomSampler<IntWritable,NullWritable> sampler = new InputSampler.RandomSampler<IntWritable,NullWritable>(0.1,10000,10);Path partitionFile = new Path(input,”_partitions”);TotalOrderPartitioner.setPartitionFile(conf,partitionFile);InputSampler.writePartitionFile(conf,sampler);//一般都将该文件做distribute cache处理URI partitionURI = new URI(partitionFile.toString()+”#_partitions”);DistributedCache.addCacheFile(partitionURI,conf);DistributedCache.createSymlink(conf);//从上面可以看出 采样器是在map阶段之前进行的 在提交job的client端完成的

四、二次排序

举例：

key1 1

key2 2

key3 3

key2 1

key1 3

。。。。

中间结果：

<key1,1> 1

<key1,3> 3

<key2,1> 1

<key2,2> 2

<key3,3> 3

。。。。

排序结果：

key1 1

key1 3

key2 1

key2 2

key3 3

。。。。

（1）MapReduce默认会对key进行排序；

（2）主要思路：

①重写Partitioner，完成key分区，形成第一次排序；

②实现WritableComparator，完成自己的排序逻辑，完成key的第2次排序；

0 0