（Hadoop学习－2）mapreduce实现二次排序

来源：互联网发布：深圳淘宝代运营编辑：程序博客网时间：2024/06/06 09:11

翻译：blog.ditullio.fr/2015/12/28/hadoop-basics-secondary-sort-in-mapreduce/

数据源：donor数据集。具体详见官方网站（http://data.donorschoose.org/open-data/overview/）。

检索要求：查询id、donor state、donor city、捐赠总金额。其中，

- donor state：按照字典序排升序，不区分大小写

- donor city：按照字典排升序，不区分大小写

- 总金额：按数值排降序

检索SQL：

SELECT donation_id, donor_state, donor_city, totalFROM donationsWHERE donor_state IS NOT NULL AND donor_city IS NOT NULLORDER BY lower(donor_state) ASC, lower(donor_city) ASC, total DESC;

理解shuffle过程

1、map()通过InputFormat，获得来自split的(key, value)，作为map()输入(inputKey, inputValue)。如果某个(inputKey, inputValue)执行完毕，map()会调用nextKeyValue()获得来自InputFormat的下一个(inputKey, inputValue)。MapReduce对每个split启动一个Map实例，而实例中的map()在被调用多次，调用次数为split的(key, value)数量。

2、map()实现对inputKey/inputValue过滤（filter）、映射（project），相关结果送入特定分区（Parttioner）。默认使用HashPartition获得分区。该类用key的hashCode()与Reducer数量取模，获得每个key的分区号。这可以使得每个分区内的key/value分布尽可能随机化。需要注意，同一个key的所有key/value需要送给同一个分区，进而被同一个Reducer处理。

3、数据在写入磁盘前，通过自定义key的compareTo()（优先），或者job.SortComparator()指定的方法进行排序。分区内所有数据将写入本地（非HDFS）同一个临时文件。

4、Reducer从所有Mapper的对应分区（Partition）拉取数据。分区数据或者写入本地磁盘，或者当十分小时存入内存。这个过程被称为”shuffling“，所有分区内数据被重组。

5、Reducer合并从Partition获取的数据（内存中，或者太大已溢写硬盘），将再次使用compreTo()或者SortComparator()，对所有数据进行排序，形成按照key有序的(key, value)。

6、在Reducer完成数据合并后，使用GroupComparator()对有序的(key, value)数据进行组合，将相邻的(key, value)按照GroupComparator()形成（key, list<values>）。注意，GroupComparator仅对相邻记录实现组合，不能对不相邻数据进行组合，也就是说，GroupComparator作用在局部而非全局。

二次排序

"Secondary Sort is a technique to control how* a reducer's input pairs are passed to the reduce function.

*how = in which order (Sort Comparator) and which way of grouping values basedon key (Group Comparator)"

二次排序本质是利用自定义key（包含需要排序的value），实现数据集多维度排序。本案例使用donations数据集自定义key，通过map()实现数据过滤、聚合，并按照donation state、donation city、total进行排序。

- 自定义key：利用下面3个字段构成新key，其中，

donation state：String。第一排序的字段，按照字典序升序排列。

donation city：String。第二排序的字段，按照字典序升序排列。

total：double。第三排序的字段，按照数据排降序。

- 排序因子（Sort Comparator）

排序因子第1种实现：在自定义key中使用compareTo()，对两个key大小进行比较。当使用this.key.compareTo(arg0.key)，comepareTo()返回值大于零、等于零、小于零，分别表示this.key小于、等于、大于arg0.key。需要格外注意，this.key.compareTo(arg0.key)实现升序排列，而arg0.key.compareTo(this.key)实现降序排列。

    public int compareTo(Donator arg0) {        // TODO Auto-generated method stub        int result=0;        result=(result!=0)?result:state.compareTo(arg0.getState());        result=(result!=0)?result:city.compareTo(arg0.getCity());        result=(result!=0)?result:(arg0.getTotal()>total?1:-1);        result=(result!=0)?result:donationid.compareTo(arg0.getDonationid());        return result;    }

排序因子第2种实现：创建WritableComparator类，实现compareTo()，对两个key大小进行比较。然后中run()中添加job.setSortComparatorClass(Doncomparator.class);，声明排序因子。需要格外注意，WritableComparator需要重载无参数的构造函数，并添加super(Donation.class,true)，用于排序时候分配内存空间。

public static class Doncomparator extends WritableComparator {        public Doncomparator() {            super(Donator.class,true);            // TODO Auto-generated constructor stub        }        @Override        public int compare(WritableComparable a, WritableComparable b) {            // TODO Auto-generated method stub            Donation lhs=(Donation)a;            Donation rhs=(Donation)b;            int result=0;            result=(result!=0)?result:a.getState().compareTo(b.getState());            result=(result!=0)?result:a.getCity().compareTo(b.getCity());            result=(result!=0)?result:(a.getTotal()>b.getTotal?1:-1);            result=(result!=0)?result:a.getDonationid.compareTo(b.getDonationid());            return result;        }}

排序因子第3种实现：创建WriableComparator类，用rawcomparator实现。按照第2种实现，在shuffle过程Reducer对数据进行合并时，需要对数据排序，这时候会将从Map分区拉取的map输出文件（注意是sequencefile）反序列化，对key进行比较大小后，再序列化并写入Reducer本地磁盘，作为Reducer的输入文件。序列化、反序列化过程会耗费额外计算资源。在第3种实现中，通过rawcomparator，直接从sequencefile中取出对应字段值进行比较，省略了序列化、反序列化过程，节省运算时间。

public static class Doncomparator extends WritableComparator {        @Override        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {            /*             sequencefile存储格式为list(len, value)。其中，list为UnsignedShort类型，2个字节；value为对应字段值，可以通过readInt、readDouble、readFloat读取数据。对于String类型，直接用compareBytes(b1, s1, l1, b2, s2, l2)比较即可。             */            int strlen11=readUnsignedShort(b1,s1);            int strlen21=readUnsignedShort(b2,s2);            int strlen12=readUnsignedShort(b1,s1+2+strlen11);            int strlen22=readUnsignedShort(b2,s2+2+strlen21);            int strlen13=readUnsignedShort(b1,s1+2+strlen11+2+strlen12);            int strlen23=readUnsignedShort(b2,s2+2+strlen21+2+strlen22);            double double1=readDouble(b1,s1+2+strlen11+2+strlen12+2+strlen13);            double double2=readDouble(b2,s2+2+strlen21+2+strlen22+2+strlen23);            int result=0;            result=(result!=0)?result:compareBytes(b1,s1+2+strlen11+2,strlen12,b2,s2+2+strlen21+2,strlen22);            result=(result!=0)?result:compareBytes(b1,s1+2+strlen11+2+strlen12+2,strlen13,b2,s2+2+strlen21+2+strlen22+2,strlen23);            result=(result!=0)?result:((double2>double1)?1:-1);            result=(result!=0)?result:compareBytes(b1,s1+2,strlen11,b2,s2+2,strlen21);            return result;        }        public Doncomparator() {            super(Donator.class,true);        }    }同时，在Donation.class类中，需要注册这个Comparatorstatic{        WritableComparator.define(Donator.class, new Doncomparator());}

- 分区（Partitioner）

在mapreduce框架中，默认分区为HashPartitioner，这使用key对象在内存地址的hashcode与numReducer取模来确定分区。同时，可以通过重置hashCode()，实现自定义getPartition()。

需要注意，如果自定义getPartition()定义不当，会导致数据倾斜（data skew），部分Reducer处理的数据量过大，影响mapreduce处理效率。

如果使用多个Reducer，需要让同一个key的(key, value)送给同一个Reducer。最简单情况，

对key使用自定义的Partitioner，实现NaturalKeyPartitioner，类似HashPartitioner。

    public static class Dopartition extends Partitioner<Donator, NullWritable> {        @Override        public int getPartition(Donator key, NullWritable value,                int numPartitions) {            // TODO Auto-generated method stub            String str=key.getState();            return str.hashCode()%numPartitions;        }}

在run()中需要设置，job.setPartitionerClass();

- 组合因子（Group Comparator）

组合因子（Group Comparator）实现每个Reducer中相邻的value如何合并。假设Reducer 0

有如下(key, value)。

Pair Key (CompositeKey) Value (DonationWritable)

A [state=”AZ“, city=”Phoenix”, total=10.00] DonationWritable@abd30

B [state=”TX“, city=”Dallas“, total=7.00] DonationWritable@51f123

C [state=”TX“, city=”Dallas“, total=5.00] DonationWritable@00t87

D [state=”TX“, city=”Houston”, total=10.00] DonationWritable@057n1

各种组合方式及结果如下，

Grouping Calls to reduce(key, [values])

Default

reduce(A.key, [A.value])

reduce(B.key, [B.value])

reduce(C.key, [C.value])

reduce(D.key, [D.value])

Group by “state,city”

reduce(A.key, [A.value])

reduce(B.key, [B.value, C.value])

reduce(D.key, [D.value])

Group by “state”

reduce(A.key, [A.value])

reduce(B.key, [B.value, C.value, D.value])

说明如下：

1）按照(state, city， total)组合，各个key都不同。所以在reduce中分开呈现。

2）按照(state, city）组合，B、C的key有相同state和city。所以B、C组合在一起，A、D

分开。

3）按照state组合，B、C、D的key有相同的state。所以B、C、D组合在一起，A单独分开。

组合因子在数据合并计算时候十分有效。首先，通过compareTo()，实现所有key排序，达到(key, value)全局有序。然后，通过组合因子，对相邻的(key, value)按照WritableComparator定义的compare()进行合并。

如果没有设定组合因子，默认使用sortComparator()中的compareteTo()。在run()中，需要定义job.setGroupingComparatorClass();

二次排序控制Reducer的输入数据排序和组合。Map输出数据排序方式，需要使用TotalOrder Partitioner确定。

0 0