setPartitionerClass, setOutputKeyComparatorClass and setOutputValueGroupingComparator

来源：互联网发布：淘宝闺蜜网址编辑：程序博客网时间：2024/05/17 22:34

from http://autofei.wordpress.com/2012/10/18/setpartitionerclass-setoutputkeycomparatorclass-and-setoutputvaluegroupingcomparator/

setPartitionerClass, setOutputKeyComparatorClass and setOutputValueGroupingComparator

Partitioner decides which mapper output goes to which reduer based on mapper output key. In general, different key is in different group (Iterator at the reducer side). But sometimes, we want different key is in the same group. This is the time for Output Value Grouping Comparator, which is used to group mapper output. For easy understanding, think this is the group by condition in SQL. I will give a detail example for time serial analysis later. Output Key Comparator is used during sort stage for the mapper output key.

The above looks pretty straight forward. But there is one thing to remember: if you use setOutputValueGroupingComparator, all the key in the same group at reducer side will be same now even they are not the same at the mapper output.

You can download the example from: https://www.assembla.com/spaces/autofei_public/documents

record.txt is the input (three fields, year, an random number, place)
MaxTemperatureUsingSecondarySort.java is the main hadoop code
IntPair.java is the mapper output key object
output.txt is the output

You will notice that number for the same year is the same now, the max one.

Note: the code is modified from book “Hadoop The Tefinitive Guide”

注意，new MapReduce API中，setOutputKeyComparatorsClass 对应为setSortComparatorsClass

setOutputValueGroupingComparatorsClass对应为setGroupingComparatorClass

0 0