Hadoop Streaming 实战: aggregate
来源:互联网 发布:淘宝客推广在哪里展示 编辑:程序博客网 时间:2024/05/21 03:25
1. aggregate概述
aggregate是Hadoop提供的一个软件包,其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using Map/Reduce model, the developer needs to implement Map and Reduce functions (and possibly Combine function). However, for a lot of applications related to counting and statistics computing, these functions have very similarcharacteristics. This provides a package implementing those patterns. In particular,the package provides a generic mapper class,a reducer class and a combiner class, and a set of built-in value aggregators.It also provides a generic utility class, ValueAggregatorJob, that offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。
2. aggregate class summary
DoubleValueSum
This class implements a value aggregator that sums up a sequence of double values.
LongValueMax
This class implements a value aggregator that maintain the maximum of a sequence of long values.
LongValueMin
This class implements a value aggregator that maintain the minimum of a sequence of long values.
LongValueSum
This class implements a value aggregator that sums up a sequence of long values.
StringValueMax
This class implements a value aggregator that maintain the biggest of a sequence of strings.
StringValueMin
This class implements a value aggregator that maintain the smallest of a sequence of strings.
UniqValueCount
This class implements a value aggregator that dedupes a sequence of objects.
UserDefinedValueAggregatorDescriptor
This class implements a wrapper for a user defined value aggregator descriptor.
ValueAggregatorBaseDescriptor
This class implements the common functionalities of the subclasses of ValueAggregatorDescriptor class.
ValueAggregatorCombiner<K1 extends WritableComparable,V1 extends Writable>
This class implements the generic combiner of Aggregate.
ValueAggregatorJob
This is the main class for creating a map/reduce job using Aggregate framework.
ValueAggregatorJobBase<K1 extends WritableComparable,V1 extends Writable>
This abstract class implements some common functionalities of the the generic mapper, reducer and combiner classes of Aggregate.
ValueAggregatorMapper<K1 extends WritableComparable,V1 extends Writable>
This class implements the generic mapper of Aggregate.
ValueAggregatorReducer<K1 extends WritableComparable,V1 extends Writable>
This class implements the generic reducer of Aggregate.
ValueHistogram
This class implements a value aggregator that computes the histogram of a sequence of strings
3. streaming中使用aggregate
在mapper任务的输出中添加控制,如下:
function:key\tvalue
eg:
LongValueSum:key\tvalue
此外,置-reducer = aggregate。此时,Reducer使用aggregate中对应的function类对相同key的value进行操作,例如,设置function为LongValueSum则将对每个键值对应的value求和。
4. 实例1(value求和)
测试文件test.txt
a 15 1
a 17 1
a 18 1
a 19 1
a 19 1
a 19 1
a 19 1
b 20 1
c 15 1
c 15 1
d 16 1
a 16 1
mapper程序:
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char** argv)
{
string a,b,c;
while(cin >> a >> b >> c)
{
cout << "LongValueSum:"<< a << "\t" << b << endl;
}
return 0;
}
运行:
$hadoop streaming -input /app/test.txt -output /app/test -mapper ./mapper -reducer aggregate -file mapper -jobconf mapred.reduce.tasks=1 -jobconf mapre.job.name="test"
输出:
a 142
b 20
c 30
d 16
5. 实例2(强大ValueHistogram)
ValueHistogram是aggregate package中最强大的类,基于每个键,对其value做以下统计
1)唯一值个数
2)最小值个数
3)中位置个数
4)最大值个数
5)平均值个数
6)标准方差
上述例子基础上修改mapper.cpp为:
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char** argv)
{
string a,b,c;
while(cin >> a >> b >> c)
{
cout << "ValueHistogram:"<< a << "\t" << b << endl;
}
return 0;
}
运行命令同上
运行结果:
a 5 1 1 4 1.6 1.2
b 1 1 1 1 1.0 0.0
c 1 2 2 2 2.0 0.0
d 1 1 1 1 1.0 0.0
- Hadoop Streaming 实战: aggregate
- Hadoop Streaming 实战: aggregate
- hadoop Streaming之aggregate
- Hadoop --Aggregate 包使用 Streaming
- hadoop学习;Streaming,aggregate;combiner
- Hadoop Streaming 实战: grep
- Hadoop Streaming 实战: grep
- Hadoop Streaming 实战: bash脚本
- Hadoop Streaming 实战: 传递环境变量
- Hadoop Streaming 实战: bash脚本
- Hadoop Streaming 实战: 传递环境变量
- Hadoop Streaming 实战: 二次排序
- Hadoop Streaming 实战: 传递环境变量
- Hadoop Streaming 实战: 文件分发与打包
- Hadoop Streaming 实战: 输出文件分割
- Hadoop Streaming 实战: 实用Partitioner类KeyFieldBasedPartitioner
- Hadoop Streaming 实战: 文件分发与打包
- Hadoop Streaming 实战: 输出文件分割
- JAVA缓存
- hadoop平台运行python代码
- 15 内部类和ArrayList的实现
- java 虚拟机
- 1151. 魔板[Speical judge]
- Hadoop Streaming 实战: aggregate
- 利用dispatch_once创建单例
- 从创业失败中学到的七条教训
- 图像的边缘提取
- LDAP连接域控,并且从域控中获取信息的列子
- Hadoop运行流程分析
- 《Windows核心编程》读书笔记——作业
- 蝴蝶会飞的人生伤感日志发布:人生就像一场戏,谁在为我编排
- js操作XML