MapReduce 二级排序

来源：互联网发布：禁止普通用户安装软件编辑：程序博客网时间：2024/06/03 20:32

在这篇文章里，我们将继续实现《利用MapReduce玩转数据密集型文本处理》这本书中提到的算法。本系列的其它文章如下:

利用MapReduce实现数据密集型文本处理
利用MapReduce实现数据密集型文本处理 - 本地汇聚第二部分
利用MapReduce实现共生矩阵(译者注: 共生矩阵，Co-Occurrence Matrix，见Wikipedia或百度)
MapReduce算法 - 反序模式(Order Inversion)

这篇文章将要介绍的是书中第三章提到的二级排序。大家知道，Hadoop在将Mapper产生的数据输送给Reducer之前，会自动对它们进行排序，那么，如果我们还希望按值排序，应该怎么做呢？答案当然是: 二级排序。通过对key对象的格式进行小小的修改，二级排序可以在排序阶段将值的作用也施加进去。我们有两种不同的方法可以实现它。

AlfredCheung
翻译于 3年前

3人顶

顶翻译的不错哦!

第一种方法是，Reducer将给定key的所有值都缓存起来，然后对它们再做一个Reducer内排序。但是，由于Reducer需要保存给定key的所有值，可能会导致出现内存耗尽的错误。

第二种方法是，将值的一部分或整个值加入原始key，生成一个合成key。这两种方法各有优势，第一种方法可能会更快一些(但有内存耗尽的危险)，第二种方法则是将排序的任务交给MapReduce框架，更符合Hadoop/Reduce的设计思想。这篇文章里选择的是第二种。我们将编写一个Partitioner，确保拥有相同key(原始key，不包括添加的部分)的所有数据被发往同一个Reducer，还将编写一个Comparator，以便数据到达Reducer后即按原始key分组。

AlfredCheung
翻译于 3年前

2人顶

顶翻译的不错哦!

从值到key的转换

生成组合key的过程很简单。我们需要先分析一下，在排序时需要把值的哪些部分考虑在内，然后，把它们加进key里去。随后，再修改key类的compareTo方法或是Comparator类，确保排序的时候使用这个组合而成的key。为了便于说明，我们将重新访问气候数据集，把温度加入到key里去(原始key是年月的组合)。这样，我们就可以得到一个给定月最冷天的列表。这个例子的灵感来自于Hadoop, The Definitive Guide这本书的二级排序示例。对于这个目标，可能会有其它一些更好的方案，但用来演示二级排序已经足够了。

AlfredCheung
翻译于 3年前

2人顶

顶翻译的不错哦!

Mapper代码

在我们的Mapper代码里，已经将年和月组合在key里，现在需要将温度也放进去。因为这样一来，值被放进了key里，所以Mapper输出的将是一个NullWritable，而不是温度。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
publicclass SecondarySortingTemperatureMapper extendsMapper<LongWritable, Text, TemperaturePair, NullWritable> {
 
    privateTemperaturePair temperaturePair = newTemperaturePair();
    privateNullWritable nullValue = NullWritable.get();
    privatestatic final int MISSING = 9999;
@Override
    protectedvoid map(LongWritable key, Text value, Context context) throwsIOException, InterruptedException {
        String line = value.toString();
        String yearMonth = line.substring(15, 21);
 
        inttempStartPosition = 87;
 
        if(line.charAt(tempStartPosition) == '+') {
            tempStartPosition += 1;
        }
 
        inttemp = Integer.parseInt(line.substring(tempStartPosition, 92));
 
        if(temp != MISSING) {
            temperaturePair.setYearMonth(yearMonth);
            temperaturePair.setTemperature(temp);
            context.write(temperaturePair, nullValue);
        }
    }
}

AlfredCheung
翻译于 3年前

1人顶

顶翻译的不错哦!

到目前为止，我们已经把温度加到了key里，为二级排序搭好了发挥的舞台。现在需要写点代码，以便在排序时把温度考虑进去。我们有两种选择，一是写一个Comparator类，二是修改TemperaturePair类的compareTo方法(TemperaturePair实现WritableComparable)。一般建议大家选择前者，不过考虑到这里的TemperaturePair就是写来演示二级排序的，所以我们这里选择了后者。

?
1
2
3
4
5
6
7
8
@Override
    publicint compareTo(TemperaturePair temperaturePair) {
        intcompareValue = this.yearMonth.compareTo(temperaturePair.getYearMonth());
        if(compareValue == 0) {
            compareValue = temperature.compareTo(temperaturePair.getTemperature());
        }
        returncompareValue;
    }

如果需要按降序排，只要把结果乘于-1就行了。好，现在我们已经完成了排序的部分，接下来是Partitioner。

AlfredCheung
翻译于 3年前

0人顶

顶翻译的不错哦!

Partitioner代码

为了确保在发送数据给Reducer时只有原始key起作用(译者注: 组合key中的值部分只用在排序)，我们需要再写一个Partitioner。代码很简单，在计算需要将数据送往哪个Reducer时，只将yearMonth放进去。

?
1
2
3
4
5
6
publicclass TemperaturePartitioner extendsPartitioner<TemperaturePair, NullWritable>{
    @Override
    publicint getPartition(TemperaturePair temperaturePair, NullWritable nullWritable, intnumPartitions) {
        returntemperaturePair.getYearMonth().hashCode() % numPartitions;
    }
}

现在，我们已经通过Partitioner，确保了相同年月的数据抵达同一个Reducer。下面需要考虑分组的情况。

AlfredCheung
翻译于 3年前

0人顶

顶翻译的不错哦!

分组比较器

数据抵达Reducer时，按key分组。我们需要确保分组时仅仅依据原始key的部分，通过自定义GroupingComparator来实现。在这个Comparator对象里，我们在只使用TemperaturePair类的yearMonth字段。

?
1
2
3
4
5
6
7
8
9
10
11
publicclass YearMonthGroupingComparator extendsWritableComparator {
    publicYearMonthGroupingComparator() {
        super(TemperaturePair.class, true);
    }
    @Override
    publicint compare(WritableComparable tp1, WritableComparable tp2) {
        TemperaturePair temperaturePair = (TemperaturePair) tp1;
        TemperaturePair temperaturePair2 = (TemperaturePair) tp2;
        returntemperaturePair.getYearMonth().compareTo(temperaturePair2.getYearMonth());
    }
}

AlfredCheung
翻译于 3年前

0人顶

顶翻译的不错哦!

结果

我们二级排序的结果如下:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
new-host-2:sbin bbejeck$ hdfs dfs -cat secondary-sort/part-r-00000
190101 -206
190102 -333
190103 -272
190104 -61
190105 -33
190106 44
190107 72
190108 44
190109 17
190110 -33
190111 -217
190112 -300

结论

虽然按值排序并不是很常用，但居安思危、有备无患总是没错的。我们也通过对Partitioner和GroupPartitioner的学习，对Hadoop的内部运作有了一些了解。感谢大家的耐心。

资源

Jimmy Lin和Chris Dyer所写的: 利用MapReduce实现数据密集型处理
Tom White所写的: Hadoop: The Definitive Guide
本文的源代码与测试用例
Hadoop API
测试Apache Hadoop MapReduce任务的MRUnit

0 0

MapReduce 二级排序

AlfredCheung翻译于 3年前

AlfredCheung翻译于 3年前

从值到key的转换

AlfredCheung翻译于 3年前

Mapper代码

AlfredCheung翻译于 3年前

AlfredCheung翻译于 3年前

Partitioner代码

AlfredCheung翻译于 3年前

分组比较器

AlfredCheung翻译于 3年前

结果

结论

资源

AlfredCheung
翻译于 3年前

AlfredCheung
翻译于 3年前

AlfredCheung
翻译于 3年前

AlfredCheung
翻译于 3年前

AlfredCheung
翻译于 3年前

AlfredCheung
翻译于 3年前

AlfredCheung
翻译于 3年前