Hive UDF教程(三)

来源:互联网 发布:php传递隐藏域值 编辑:程序博客网 时间:2024/05/03 06:40

Hive UDF教程(一)

Hive UDF教程(二)

Hive UDF教程(三)

1.UDAF

前两节分别介绍了基础UDF和UDTF,这一节我们将介绍最复杂的用户自定义聚合函数(UDAF)。用户自定义聚合函数(UDAF)接受从零行到多行的零个到多个列,然后返回单一值,如sum()、count()。要实现UDAF,我们需要实现下面的类:

  • org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver
  • org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator
AbstractGenericUDAFResolver检查输入参数,并且指定使用哪个resolver。在AbstractGenericUDAFResolver里,只需要实现一个方法:
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException;

但是,主要的逻辑处理还是在Evaluator中。我们需要继承GenericUDAFEvaluator,并且实现下面几个方法:

// 输入输出都是Object inspectorspublic  ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException;// AggregationBuffer保存数据处理的临时结果abstract AggregationBuffer getNewAggregationBuffer() throws HiveException;// 重新设置AggregationBufferpublic void reset(AggregationBuffer agg) throws HiveException;// 处理输入记录public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException;// 处理全部输出数据中的部分数据public Object terminatePartial(AggregationBuffer agg) throws HiveException;// 把两个部分数据聚合起来public void merge(AggregationBuffer agg, Object partial) throws HiveException;// 输出最终结果public Object terminate(AggregationBuffer agg) throws HiveException;

在处理之前,先看下UADF的Enum GenericUDAFEvaluator.Mode。Mode有4中情况:

  1. PARTIAL1:Mapper阶段。从原始数据到部分聚合,会调用iterate()和terminatePartial()。
  2. PARTIAL2:Combiner阶段,在Mapper端合并Mapper的结果数据。从部分聚合到部分聚合,会调用merge()和terminatePartial()。
  3. FINAL:Reducer阶段。从部分聚合数据到完全聚合,会调用merge()和terminate()。
  4. COMPLETE:出现这个阶段,表示MapReduce中只用Mapper没有Reducer,所以Mapper端直接输出结果了。从原始数据到完全聚合,会调用iterate()和terminate()。

2.示例

下面我们看一个例子,把某一列的值合并,然后和concat_ws()函数一起实现MySQL中group_concat()函数的功能,代码如下:

@Description(name = "collect",value = "_FUNC_(col) - The parameter is a column name. "+ "The return value is a set of the column.",extended = "Example:\n"+ " > SELECT _FUNC_(col) from src;")public class GenericUDAFCollect extends AbstractGenericUDAFResolver {private static final Log LOG = LogFactory.getLog(GenericUDAFCollect.class.getName());public GenericUDAFCollect() {// TODO Auto-generated constructor stub}@Overridepublic GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)throws SemanticException {if(parameters.length != 1){throw new UDFArgumentTypeException(parameters.length - 1, "Exactly one argument is expected.");}if(parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE){throw new UDFArgumentTypeException(0, "Only primitive type arguments are accepted but "+ parameters[0].getTypeName() + " was passed as parameter 1.");}return new GenericUDAFCollectEvaluator();}@SuppressWarnings("deprecation")public static class GenericUDAFCollectEvaluator extends GenericUDAFEvaluator{private PrimitiveObjectInspector inputOI;private StandardListObjectInspector internalMergeOI;private StandardListObjectInspector loi;@Overridepublic ObjectInspector init(Mode m, ObjectInspector[] parameters)throws HiveException {super.init(m, parameters);if(m == Mode.PARTIAL1 || m == Mode.COMPLETE){inputOI = (PrimitiveObjectInspector) parameters[0];return ObjectInspectorFactory.getStandardListObjectInspector((PrimitiveObjectInspector) ObjectInspectorUtils                .getStandardObjectInspector(inputOI));}else if(m == Mode.PARTIAL2 || m == Mode.FINAL){internalMergeOI = (StandardListObjectInspector) parameters[0];inputOI = (PrimitiveObjectInspector) internalMergeOI.getListElementObjectInspector();loi = ObjectInspectorFactory.getStandardListObjectInspector(inputOI);return loi;}return null;}static class ArrayAggregationBuffer implements AggregationBuffer{List<Object> container;}@Overridepublic AggregationBuffer getNewAggregationBuffer() throws HiveException {ArrayAggregationBuffer ret = new ArrayAggregationBuffer();reset(ret);return ret;}@Overridepublic void reset(AggregationBuffer agg) throws HiveException {((ArrayAggregationBuffer) agg).container = new ArrayList<Object>();}@Overridepublic void iterate(AggregationBuffer agg, Object[] param)throws HiveException {Object p = param[0];if(p != null){putIntoList(p, (ArrayAggregationBuffer)agg);}}@Overridepublic void merge(AggregationBuffer agg, Object partial)throws HiveException {ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;ArrayList<Object> partialResult = (ArrayList<Object>) this.internalMergeOI.getList(partial);for(Object obj : partialResult){putIntoList(obj, myAgg);}}@Overridepublic Object terminate(AggregationBuffer agg) throws HiveException {ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;ArrayList<Object> list = new ArrayList<Object>();list.addAll(myAgg.container);return list;}@Overridepublic Object terminatePartial(AggregationBuffer agg)throws HiveException {ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;ArrayList<Object> list = new ArrayList<Object>();list.addAll(myAgg.container);return list;}public void putIntoList(Object param, ArrayAggregationBuffer myAgg){Object pCopy = ObjectInspectorUtils.copyToStandardObject(param, this.inputOI);myAgg.container.add(pCopy);}}}

然后我们把代码编译打包后的jar文件添加到CLASSPATH,然后创建函数collect(),最后仍然使用第一节的数据表employee:

hive (mydb)> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;hive (mydb)> CREATE TEMPORARY FUNCTION collect AS "edu.wzm.hive.udaf.GenericUDAFCollect";hive (mydb)> SELECT collect(name) FROM employee;Query ID = root_20160117221111_c8b88dc9-170c-4957-b665-15b99eb9655aTotal jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1453096763931_0001, Tracking URL = http://master:8088/proxy/application_1453096763931_0001/Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job  -kill job_1453096763931_0001Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-01-17 22:11:49,360 Stage-1 map = 0%,  reduce = 0%2016-01-17 22:12:01,388 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.76 sec2016-01-17 22:12:16,830 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.95 secMapReduce Total cumulative CPU time: 2 seconds 950 msecEnded Job = job_1453096763931_0001MapReduce Jobs Launched: Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.95 sec   HDFS Read: 1040 HDFS Write: 80 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 950 msecOK["John Doe","Mary Smith","Todd Jones","Bill King","Boss Man","Fred Finance","Stacy Accountant"]Time taken: 44.302 seconds, Fetched: 1 row(s)

然后,把concat_ws(',', collect(name)),还有GROUP BY结合使用达到MySQL中group_concat()函数的效果,下面查询相同工资的员工:

hive (mydb)> SELECT salary,concat_ws(',', collect(name)) FROM employee GROUP BY salary;Query ID = root_20160117222121_dedd4981-e050-4aac-81cb-c449639c721bTotal jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1453096763931_0003, Tracking URL = http://master:8088/proxy/application_1453096763931_0003/Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job  -kill job_1453096763931_0003Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-01-17 22:21:59,627 Stage-1 map = 0%,  reduce = 0%2016-01-17 22:22:07,207 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.2 sec2016-01-17 22:22:14,700 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.8 secMapReduce Total cumulative CPU time: 2 seconds 800 msecEnded Job = job_1453096763931_0003MapReduce Jobs Launched: Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.8 sec   HDFS Read: 1040 HDFS Write: 131 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 800 msecOK60000.0Bill King,Stacy Accountant70000.0Todd Jones80000.0Mary Smith100000.0John Doe150000.0Fred Finance200000.0Boss ManTime taken: 24.928 seconds, Fetched: 6 row(s)

3.UDAF模式

在实现UDAF时,主要实现下面几个方法:

  • init():当实例化UDAF的Evaluator时执行,并且指定输入输出数据的类型。
  • iterate():把输入数据处理后放入到内存聚合块中(AggregationBuffer),典型的Mapper。
  • terminatePartial():其为iterate()轮转结束后,返回轮转数据,类似于Combiner。
  • merge():介绍terminatePartial()的结果,然后把这些partial结果数据merge到一起。
  • terminate():返回最终的结果。
iterate()和terminatePartial()都在Mapper端。
merge()和terminate()都在Reducer端。
AggregationBuffer存储中间或最终结果。通过我们定义自己的Aggregation Buffer,可以处理任何类型的数据。

源代码托管在GitHub上:https://github.com/GatsbyNewton/hive_udf



0 0
原创粉丝点击