Hadoop系列之Aggregate用法

来源：互联网发布：大富豪3.29全套源码编辑：程序博客网时间：2024/05/16 06:07

1. aggregate简介
aggregate是Hadoop提供的一个软件包，其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using Map/Reduce model, the developer needs to implement Map and Reduce functions (and possibly Combine function). However, for a lot of applications related to counting and statistics computing, these functions have very similarcharacteristics. This provides a package implementing those patterns. In particular,the package provides a generic mapper class,a reducer class and a combiner class, and a set of built-in value aggregators.It also provides a generic utility class, ValueAggregatorJob, that offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。

2. aggregate class summary

DoubleValueSum

This class implements a value aggregator that sums up a sequence of double values.

可利用来统计Top K记录，类似LongValueSum

LongValueMaxThis class implements a value aggregator that maintain the maximum of a sequence of long values.LongValueMinThis class implements a value aggregator that maintain the minimum of a sequence of long values.LongValueSumThis class implements a value aggregator that sums up a sequence of long values.StringValueMaxThis class implements a value aggregator that maintain the biggest of a sequence of strings.StringValueMinThis class implements a value aggregator that maintain the smallest of a sequence of strings.UniqValueCountThis class implements a value aggregator that dedupes a sequence of objects.UserDefinedValueAggregatorDescriptorThis class implements a wrapper for a user defined value aggregator descriptor.ValueAggregatorBaseDescriptorThis class implements the common functionalities of the subclasses of ValueAggregatorDescriptor class.ValueAggregatorCombinerThis class implements the generic combiner of Aggregate.ValueAggregatorJobThis is the main class for creating a map/reduce job using Aggregate framework.ValueAggregatorJobBaseThis abstract class implements some common functionalities of the the generic mapper, reducer and combiner classes of Aggregate.ValueAggregatorMapperThis class implements the generic mapper of Aggregate.ValueAggregatorReducerThis class implements the generic reducer of Aggregate.ValueHistogramThis class implements a value aggregator that computes the histogram of a sequence of strings

3. streaming中使用aggregate

在mapper任务的输出中添加控制，如下：
function：key\tvalue
eg：
LongValueSum：key\tvalue
此外，置-reducer = aggregate。此时，Reducer使用aggregate中对应的function类对相同key的value进行操作，例如，设置function为LongValueSum则将对每个键值对应的value求和。

下面是一个python的例子：

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \    -input myInputDirs \    -output myOutputDir \    -mapper myAggregatorForKeyCount.py \    -reducer aggregate \    -file myAggregatorForKeyCount.py \    -jobconf mapred.reduce.tasks=12python程序myAggregatorForKeyCount.py例子：#!/usr/bin/pythonimport sys;def generateLongCountToken(id):    return "LongValueSum:" + id + "\t" + "1"def main(argv):    line = sys.stdin.readline();    try:        while line:            line = line[:-1];            fields = line.split("\t");            print generateLongCountToken(fields[0]);            line = sys.stdin.readline();    except "end of file":        return Noneif __name__ == "__main__":     main(sys.argv)

0 0