Hadoop系列之Aggregate用法
来源:互联网 发布:大富豪3.29全套源码 编辑:程序博客网 时间:2024/05/16 06:07
1. aggregate简介
aggregate是Hadoop提供的一个软件包,其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using Map/Reduce model, the developer needs to implement Map and Reduce functions (and possibly Combine function). However, for a lot of applications related to counting and statistics computing, these functions have very similarcharacteristics. This provides a package implementing those patterns. In particular,the package provides a generic mapper class,a reducer class and a combiner class, and a set of built-in value aggregators.It also provides a generic utility class, ValueAggregatorJob, that offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。
2. aggregate class summary
This class implements a value aggregator that sums up a sequence of double values.
可利用来统计Top K记录,类似LongValueSum
LongValueMaxThis class implements a value aggregator that maintain the maximum of a sequence of long values.LongValueMinThis class implements a value aggregator that maintain the minimum of a sequence of long values.LongValueSumThis class implements a value aggregator that sums up a sequence of long values.StringValueMaxThis class implements a value aggregator that maintain the biggest of a sequence of strings.StringValueMinThis class implements a value aggregator that maintain the smallest of a sequence of strings.UniqValueCountThis class implements a value aggregator that dedupes a sequence of objects.UserDefinedValueAggregat3. streaming中使用aggregate
在mapper任务的输出中添加控制,如下:
function:key\tvalue
eg:
LongValueSum:key\tvalue
此外,置-reducer = aggregate。此时,Reducer使用aggregate中对应的function类对相同key的value进行操作,例如,设置function为LongValueSum则将对每个键值对应的value求和。
下面是一个python的例子:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper myAggregatorForKeyCount.py \ -reducer aggregate \ -file myAggregatorForKeyCount.py \ -jobconf mapred.reduce.tasks=12python程序myAggregatorForKeyCount.py例子:#!/usr/bin/pythonimport sys;def generateLongCountToken(id): return "LongValueSum:" + id + "\t" + "1"def main(argv): line = sys.stdin.readline(); try: while line: line = line[:-1]; fields = line.split("\t"); print generateLongCountToken(fields[0]); line = sys.stdin.readline(); except "end of file": return Noneif __name__ == "__main__": main(sys.argv)
- Hadoop系列之Aggregate用法
- hadoop Streaming之aggregate
- Hadoop系列之DistributedCache用法
- Hadoop系列之FieldSelectionMapReduce用法
- Hadoop系列之ToolRunner与GenericOptionsParser用法
- Hadoop系列之InputFormat,OutputFormat用法
- Hadoop Streaming 实战: aggregate
- Hadoop Streaming 实战: aggregate
- mongodb的aggregate 用法
- java mongodb aggregate用法
- java mongodb aggregate用法
- java mongodb aggregate用法
- Hadoop --Aggregate 包使用 Streaming
- hadoop学习;Streaming,aggregate;combiner
- Hadoop系列之初始Hadoop
- Hadoop系列 之Terasort
- Hadoop系列之OutputCollector
- hadoop系列之hive
- js事件绑定,事件流,事件代理的一些理解
- Android WebView 与 javascript交互
- Python2.7安装Numpy库、SciPy库、Matplotlib库,各种全解
- Spring是如何管理Bean
- Java 中使用递归遍历文件目录
- Hadoop系列之Aggregate用法
- 使用Jackson和artTemplate来实现List-->json-->html之间的整体转换
- 【项目经验】Java web 页面跳转中文乱码
- opencv的pyrUp()函数
- 关于list中存储map数据重复问题
- 【JSON解析】JSON解析
- Spring是如何管理Bean
- C#父类中获取子类的类名
- maven项目如何启动运行---发布到tomcat中