基于MapReduce的二次排序

来源:互联网 发布:三星手机 数据恢复 编辑:程序博客网 时间:2024/06/07 03:05

1.需求

现给出一系列订单数据,要求用“mapreduce自己的排序机制”将每条订单数据中成交额最大的数据排在第一位显示出来。
数据源:
订单id 商品id 成交金额
Order_0000001Pdt_01222.8Order_0000001Pdt_0525.8Order_0000002Pdt_03522.8Order_0000002Pdt_04122.4Order_0000002Pdt_05722.4Order_0000003Pdt_01222.8

2.思路

1)利用“订单id”与“成交金额”作为联合主键(以bean的形式),如此一来可以将map阶段读取到的所有订单数据按订单id分区(利用partitioner),以金额排序(WritableComparable中的compareTo方法),并发送到reduce
2)在reduce端利用 GroupingComparator将订单id相同的<k,v>聚合成组,后之间输出

3.代码

1)OrderBean类,实现WritableComparatable接口
public class OrderBean implements WritableComparable<OrderBean> {private Text orderId;private DoubleWritable price;public OrderBean(){}public OrderBean(Text itemid, DoubleWritable amount) {set(itemid, amount);}public void set(Text orderId, DoubleWritable price) {this.orderId = orderId;this.price = price;}public Text getOrderId() {return orderId;}public void setOrderId(Text orderId) {this.orderId = orderId;}public DoubleWritable getPrice() {return price;}public void setPrice(DoubleWritable price) {this.price = price;}@Overridepublic void write(DataOutput out) throws IOException {out.writeUTF(orderId.toString());out.writeDouble(price.get());}@Overridepublic void readFields(DataInput in) throws IOException {String readUTF = in.readUTF();double readDouble = in.readDouble();this.orderId = new Text(readUTF);this.price = new DoubleWritable(readDouble);}@Overridepublic int compareTo(OrderBean o) {int cmp = this.orderId.compareTo(o.getOrderId());if(cmp == 0){//当orderId相同时cmp = -this.price.compareTo(o.getPrice()); //从大到小的逆序}return cmp;}@Overridepublic String toString() {return this.orderId.toString() + "\t" + this.price.get();}}

2)Mapper类
//拿到orderId与成交金额,并赋值到bean对象中,最后输出该对象static class SecondarySortMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable>{OrderBean ob = new OrderBean();Text t = new Text();@Overrideprotected void map(LongWritable key,Text value,Context context)throws IOException, InterruptedException {String line = value.toString();String[] fields = line.split("\t");String orderId = fields[0];double price = Double.parseDouble(fields[2]);ob.set(new Text(orderId), new DoubleWritable(price));t.set(ob.toString());context.write(ob,NullWritable.get());}}

3)Partitioner类
//将不同orderId的bean交给不同的reduceTask处理public class SecondarySortPartitioner extends Partitioner<OrderBean, NullWritable>{@Overridepublic int getPartition(OrderBean key, NullWritable value, int numPartitions) {//相同id的bean 会发往相同的partition//产生的分区数会跟用户设置的reduce任务数一致return (key.g//将不同orderId的bean交给不同的reduceTask处理public class SecondarySortPartitioner extends Partitioner<OrderBean, NullWritable>{@Overridepublic int getPartition(OrderBean key, NullWritable value, int numPartitions) {//相同id的bean 会发往相同的partition//产生的分区数会跟用户设置的reduce任务数一致return (key.getOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions ;}}etOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions ;}}

4)GroupingComparator类
/*GroupingComparator的作用是调用reduce时对数据进行分组 * *reduce的工作机制:  *reduce任务会接收map阶段输出的key与经过shuffle阶段整合过的values(集合) ,当reduce任务处理完当前的<key,values>后, *他会判断下一条记录的key是不是和当前的key在同一组中。如果是,那么reduce任务会继续处理这条记录。如果不是则当前reduce任务结束 * *话说回来,如果不用GroupingComparator的分组的话,那么同一组记录要在reduce方法中独立处理,那么有些数据可能需要传递,因此为增加复杂度。 *因此设置GroupingComparator的目的就是降低复杂度 */public class SecondarySortGC extends WritableComparator{//传入作为key的bean的class类型,以及制定要让框架作反射获取的实例对象protected SecondarySortGC() {super(OrderBean.class, true);}@Overridepublic int compare(Object a, Object b) {OrderBean abean = (OrderBean) a;OrderBean bbean = (OrderBean) b;//对两个bean作比较时,只比较他们的orderidreturn abean.getOrderId().compareTo(bbean.getOrderId());}}

5)Reducer类

static class SecondarySortReducer extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable>{@Overrideprotected void reduce(OrderBean key,Iterable<NullWritable> values,Context context)throws IOException, InterruptedException {context.write(key, NullWritable.get());}}

6)main方法

public static void main(String[] args)throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(SecondarySort.class);job.setMapperClass(SecondarySortMapper.class);job.setReducerClass(SecondarySortReducer.class);job.setPartitionerClass(SecondarySortPartitioner.class);job.setGroupingComparatorClass(SecondarySortGC.class);job.setOutputKeyClass(OrderBean.class);job.setOutputValueClass(NullWritable.class);FileInputFormat.setInputPaths(job, new Path("H:/大数据/mapreduce/secondarysort/input"));FileOutputFormat.setOutputPath(job, new Path("H:/大数据/mapreduce/secondarysort/output"));job.setNumReduceTasks(3);job.waitForCompletion(true);}

4.输出

1)part-r-00000
Order_0000003222.8

2)part-r-00001
Order_0000001222.8Order_000000125.8

3)part-r-00002
Order_0000002722.4Order_0000002522.8Order_0000002122.4



原创粉丝点击