Hive UDF教程(三)
来源:互联网 发布:php传递隐藏域值 编辑:程序博客网 时间:2024/05/03 06:40
Hive UDF教程(一)
Hive UDF教程(二)
Hive UDF教程(三)
1.UDAF
前两节分别介绍了基础UDF和UDTF,这一节我们将介绍最复杂的用户自定义聚合函数(UDAF)。用户自定义聚合函数(UDAF)接受从零行到多行的零个到多个列,然后返回单一值,如sum()、count()。要实现UDAF,我们需要实现下面的类:
- org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver
- org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException;
但是,主要的逻辑处理还是在Evaluator中。我们需要继承GenericUDAFEvaluator,并且实现下面几个方法:
// 输入输出都是Object inspectorspublic ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException;// AggregationBuffer保存数据处理的临时结果abstract AggregationBuffer getNewAggregationBuffer() throws HiveException;// 重新设置AggregationBufferpublic void reset(AggregationBuffer agg) throws HiveException;// 处理输入记录public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException;// 处理全部输出数据中的部分数据public Object terminatePartial(AggregationBuffer agg) throws HiveException;// 把两个部分数据聚合起来public void merge(AggregationBuffer agg, Object partial) throws HiveException;// 输出最终结果public Object terminate(AggregationBuffer agg) throws HiveException;
在处理之前,先看下UADF的Enum GenericUDAFEvaluator.Mode。Mode有4中情况:
- PARTIAL1:Mapper阶段。从原始数据到部分聚合,会调用iterate()和terminatePartial()。
- PARTIAL2:Combiner阶段,在Mapper端合并Mapper的结果数据。从部分聚合到部分聚合,会调用merge()和terminatePartial()。
- FINAL:Reducer阶段。从部分聚合数据到完全聚合,会调用merge()和terminate()。
- COMPLETE:出现这个阶段,表示MapReduce中只用Mapper没有Reducer,所以Mapper端直接输出结果了。从原始数据到完全聚合,会调用iterate()和terminate()。
2.示例
下面我们看一个例子,把某一列的值合并,然后和concat_ws()函数一起实现MySQL中group_concat()函数的功能,代码如下:
@Description(name = "collect",value = "_FUNC_(col) - The parameter is a column name. "+ "The return value is a set of the column.",extended = "Example:\n"+ " > SELECT _FUNC_(col) from src;")public class GenericUDAFCollect extends AbstractGenericUDAFResolver {private static final Log LOG = LogFactory.getLog(GenericUDAFCollect.class.getName());public GenericUDAFCollect() {// TODO Auto-generated constructor stub}@Overridepublic GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)throws SemanticException {if(parameters.length != 1){throw new UDFArgumentTypeException(parameters.length - 1, "Exactly one argument is expected.");}if(parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE){throw new UDFArgumentTypeException(0, "Only primitive type arguments are accepted but "+ parameters[0].getTypeName() + " was passed as parameter 1.");}return new GenericUDAFCollectEvaluator();}@SuppressWarnings("deprecation")public static class GenericUDAFCollectEvaluator extends GenericUDAFEvaluator{private PrimitiveObjectInspector inputOI;private StandardListObjectInspector internalMergeOI;private StandardListObjectInspector loi;@Overridepublic ObjectInspector init(Mode m, ObjectInspector[] parameters)throws HiveException {super.init(m, parameters);if(m == Mode.PARTIAL1 || m == Mode.COMPLETE){inputOI = (PrimitiveObjectInspector) parameters[0];return ObjectInspectorFactory.getStandardListObjectInspector((PrimitiveObjectInspector) ObjectInspectorUtils .getStandardObjectInspector(inputOI));}else if(m == Mode.PARTIAL2 || m == Mode.FINAL){internalMergeOI = (StandardListObjectInspector) parameters[0];inputOI = (PrimitiveObjectInspector) internalMergeOI.getListElementObjectInspector();loi = ObjectInspectorFactory.getStandardListObjectInspector(inputOI);return loi;}return null;}static class ArrayAggregationBuffer implements AggregationBuffer{List<Object> container;}@Overridepublic AggregationBuffer getNewAggregationBuffer() throws HiveException {ArrayAggregationBuffer ret = new ArrayAggregationBuffer();reset(ret);return ret;}@Overridepublic void reset(AggregationBuffer agg) throws HiveException {((ArrayAggregationBuffer) agg).container = new ArrayList<Object>();}@Overridepublic void iterate(AggregationBuffer agg, Object[] param)throws HiveException {Object p = param[0];if(p != null){putIntoList(p, (ArrayAggregationBuffer)agg);}}@Overridepublic void merge(AggregationBuffer agg, Object partial)throws HiveException {ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;ArrayList<Object> partialResult = (ArrayList<Object>) this.internalMergeOI.getList(partial);for(Object obj : partialResult){putIntoList(obj, myAgg);}}@Overridepublic Object terminate(AggregationBuffer agg) throws HiveException {ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;ArrayList<Object> list = new ArrayList<Object>();list.addAll(myAgg.container);return list;}@Overridepublic Object terminatePartial(AggregationBuffer agg)throws HiveException {ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;ArrayList<Object> list = new ArrayList<Object>();list.addAll(myAgg.container);return list;}public void putIntoList(Object param, ArrayAggregationBuffer myAgg){Object pCopy = ObjectInspectorUtils.copyToStandardObject(param, this.inputOI);myAgg.container.add(pCopy);}}}
然后我们把代码编译打包后的jar文件添加到CLASSPATH,然后创建函数collect(),最后仍然使用第一节的数据表employee:
hive (mydb)> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;hive (mydb)> CREATE TEMPORARY FUNCTION collect AS "edu.wzm.hive.udaf.GenericUDAFCollect";hive (mydb)> SELECT collect(name) FROM employee;Query ID = root_20160117221111_c8b88dc9-170c-4957-b665-15b99eb9655aTotal jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1453096763931_0001, Tracking URL = http://master:8088/proxy/application_1453096763931_0001/Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453096763931_0001Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-01-17 22:11:49,360 Stage-1 map = 0%, reduce = 0%2016-01-17 22:12:01,388 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.76 sec2016-01-17 22:12:16,830 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.95 secMapReduce Total cumulative CPU time: 2 seconds 950 msecEnded Job = job_1453096763931_0001MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.95 sec HDFS Read: 1040 HDFS Write: 80 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 950 msecOK["John Doe","Mary Smith","Todd Jones","Bill King","Boss Man","Fred Finance","Stacy Accountant"]Time taken: 44.302 seconds, Fetched: 1 row(s)
然后,把concat_ws(',', collect(name)),还有GROUP BY结合使用达到MySQL中group_concat()函数的效果,下面查询相同工资的员工:
hive (mydb)> SELECT salary,concat_ws(',', collect(name)) FROM employee GROUP BY salary;Query ID = root_20160117222121_dedd4981-e050-4aac-81cb-c449639c721bTotal jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1453096763931_0003, Tracking URL = http://master:8088/proxy/application_1453096763931_0003/Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453096763931_0003Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-01-17 22:21:59,627 Stage-1 map = 0%, reduce = 0%2016-01-17 22:22:07,207 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.2 sec2016-01-17 22:22:14,700 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.8 secMapReduce Total cumulative CPU time: 2 seconds 800 msecEnded Job = job_1453096763931_0003MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.8 sec HDFS Read: 1040 HDFS Write: 131 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 800 msecOK60000.0Bill King,Stacy Accountant70000.0Todd Jones80000.0Mary Smith100000.0John Doe150000.0Fred Finance200000.0Boss ManTime taken: 24.928 seconds, Fetched: 6 row(s)
3.UDAF模式
在实现UDAF时,主要实现下面几个方法:
- init():当实例化UDAF的Evaluator时执行,并且指定输入输出数据的类型。
- iterate():把输入数据处理后放入到内存聚合块中(AggregationBuffer),典型的Mapper。
- terminatePartial():其为iterate()轮转结束后,返回轮转数据,类似于Combiner。
- merge():介绍terminatePartial()的结果,然后把这些partial结果数据merge到一起。
- terminate():返回最终的结果。
merge()和terminate()都在Reducer端。
AggregationBuffer存储中间或最终结果。通过我们定义自己的Aggregation Buffer,可以处理任何类型的数据。
源代码托管在GitHub上:https://github.com/GatsbyNewton/hive_udf
0 0
- Hive UDF教程(三)
- Hive UDF教程(一)
- Hive UDF教程(二)
- Hadoop Hive UDF 教程
- HIVE UDF(1)
- hive udf (python)
- hive udf开发超详细手把手教程
- hive udf开发超详细手把手教程
- hive(三)--udf与文件压缩归档
- hive的UDF (2)
- hive udf
- hive-udf
- hive UDF
- hive UDF
- hive UDF
- hive udf
- hive udf
- Hive UDF
- 首字母大写较好写法
- spark的kafka的低阶API createDirectStream的一些总结。
- em无法登录的解决办法并且报密码过期
- Mysql存储位置变更
- MyBatis传入多个参数的问题
- Hive UDF教程(三)
- Android Volley入门到精通:使用Volley加载网络图片(示例,出错代码)
- iOS AFN 封装数据
- centos7.2怎样连接WiFi?
- SpringMVC获取请求参数的常用注解
- android权限的分类
- C++预处理器
- 成员变量和局部变量的区别
- php通过urlencode转码传到前端,用js解码后特殊字符无法解码