Hive学习之自定义函数（UDF）

来源：互联网发布：电影天堂软件编辑：程序博客网时间：2024/05/22 20:13

在之前的学习，包括后面的实践和工作中，已经多次使用了Hive的内置函数，即有普通的函数，像cast、lower等，也有聚合函数，比如max，min等，除了这些内置函数，Hive还允许在内置函数不满足业务需求的时候用户自己定义函数。

现在就学习如何使用Hive API自定义函数。要想自定义Hive函数，只需要继承org.apache.hadoop.hive.ql.exec.UDF类，并在实现类中定义一个或者多个evaluate 方法。在查询处理过程中，对于函数的每次使用都会实例化函数类的一个实例，对每个输入行调用一次evaluate方法。下面参考一下Hive内置的sin函数的定义，然后再定义自己的函数。

/** * UDFSin. * */@Description(name = "sin",    value = "_FUNC_(x) - returns the sine of x (x is in radians)",    extended = "Example:\n "    + "  > SELECT _FUNC_(0) FROM src LIMIT 1;\n" + "  0")@VectorizedExpressions({FuncSinLongToDouble.class, FuncSinDoubleToDouble.class})public class UDFSin extends UDFMath {  private final DoubleWritable result = new DoubleWritable();  public UDFSin() {  }  /**   * Take Sine of a.   */  public DoubleWritable evaluate(DoubleWritable a) {    if (a == null) {      return null;    } else {      result.set(Math.sin(a.get()));      return result;    }  }}

如上面的代码所示，sin内置函数的定义相当简单，只需继承UDF类并实现evaluate方法，而evaluate方法的实现也很简单，只需调用Math.sin方法，并将结果返回。需要注意的是evaluate方法的参数和返回值必须是可以被Hive序列化的，但不必过于担心调用者传递给evaluate的参数，Hive会自动进行类型转换.在Hive中null对任何类型都是有效地，这一点与Java不同，Java中基本类型不是对象，因此不能是null。

接下来定义一个将HTTP状态码转化为对应的文字描述的函数，比如200对应成功。下面为源代码：

package learning;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;@Description(name = "castStatusToDes",value = "_FUNC_(x) - returns the description of x (x is status code of HTTP)",extended = "Example:\n "+ "  > SELECT _FUNC_(200) FROM src LIMIT 1;\n" + "  OK")public class CastStatusToDes extends UDF{        public Text evaluate(IntWritable status){                if(status == null)                        return null;                else if(status.get() == 200)                        return new Text("OK");                else                        return new Text("Others");        }}

下面的语句在Hive中创建了临时函数cast_http，并使用describe语句查看cast_http的使用方法：

hive> CREATE  TEMPORARY  FUNCTION cast_http as 'learning.CastStatusToDes';OKTime taken: 0.054 secondshive> describe function cast_http;OKcast_http(x) - returns the description of x (x is status code of HTTP)Time taken: 0.762 seconds, Fetched: 1 row(s)hive> describe function extended cast_http;OKcast_http(x) - returns the description of x (x is status code of HTTP)Example:   > SELECT cast_http(200) FROM src LIMIT 1;  OKTime taken: 0.097 seconds, Fetched: 4 row(s)

实际执行的结果如下：

hive> SELECT cast_http(200) FROM  ccp  LIMIT 1;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_201409091422_0002, Tracking URL = http://hadoop:50030/jobdetails.jsp?jobid=job_201409091422_0002Kill Command = /home/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201409091422_0002Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 02014-09-09 15:04:15,097 Stage-1 map = 0%,  reduce = 0%2014-09-09 15:04:25,203 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.93 sec2014-09-09 15:04:29,250 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.93 secMapReduce Total cumulative CPU time: 6 seconds 930 msecEnded Job = job_201409091422_0002MapReduce Jobs Launched: Job 0: Map: 2   Cumulative CPU: 6.93 sec   HDFS Read: 9242 HDFS Write: 6 SUCCESSTotal MapReduce CPU Time Spent: 6 seconds 930 msecOKOKTime taken: 31.337 seconds, Fetched: 1 row(s)

通过上面的代码及测试结果，可以发现编写自定义函数没有想象中复杂，理解了其中的规则后，编写代码如同日常工作一样轻松，但编写复杂函数，如聚合函数则需要更多的技巧，与现在示例有很大不同，后面会进行学习。如何部署UDF，可以参考《Hive学习之部署UDF的四种方法》。

0 0