Hive_4. DML -- Functions & Operators & Transactions

来源:互联网 发布:按键精灵 数据库插件 编辑:程序博客网 时间:2024/05/01 14:55
今天跟大家总结一下 Hive 中常见的函数,以及事务操作。

为了进一步管理数据,我们可以在 Hive 中使用表达式,运算符,和函数来转换数据。Hive 维基百科提供了丰富了表达式和函数 (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)。在这里我就不想重复所有的函数和表达式,挑出一些个人认为在工作中经常能用到的来跟大家分享一下。

Hive 定义了关系运算符,算术运算符,逻辑运算符,复合构造类型。对关系,逻辑,算术运算符来说,它们类似于 SQL/JAVA 中的标准运算符。所以在这里我就不重复了。对于复杂数据类型的运算符,可以参考:Hive_4. DML -- 数据转换 中分区表的示例。
Hive 中函数分类如下:

  • 数学运算函数(Mathematical functions):该函数更多的用来执行数学计算,比如说 RAND() and E().
  • 集合函数(Collection functions):这类函数用来为复杂数据类型发现大小,Keys,Values。比如说 SIZE(Array<T>).
  • 类型转换函数(Type conversion functions):常使用CAST 和 BINARY 函数来将一个类型转换成另一种类型
  • 日期函数(Date functions):这类函数常用来做日期相关的计算,例如 YEAR(string date) 和 MONTH(string date).
  • 条件函数(Conditional functions):这类函数常用来检查返回的值是否满足条件,比如说 COALESCEIF, 和 CASE WHEN.
  • 字符串函数(String functions):这类函数通常用来执行字符串相关的运算,比如说 UPPER(string A) 和 TRIM(string A).
  • 聚合函数(Aggregate functions):这类函数用来执行聚合操作,例如 SUM()COUNT(*).
  • 表生成函数(Table-generating functions):该类函数用来将单行输入转换成多行输出。比如说 EXPLODE(MAP) 和 JSON_TUPLE(jsonString, k1, k2,…).
  • 自定义函数(Customized functions):该类函数使用 JAVA 代码来扩展 Hive 的功能,将会在 Hive UDF 篇做专门介绍。

你可以在 Hive CLI 中输入以下命令来查看 Hive 内置的函数 & UDF:

<span style="font-size:12px;">SHOW FUNCTIONS; --列出所有函数DESCRIBE FUNCTION <function_name>; --具体函数的细节 DESCRIBE FUNCTION EXTENDED <function_name>; --更多细节 </span>

接下来让我们看看如何去使用这些函数以及使用技巧:

复杂数据类型函数技巧:
SIZE() 函数主要用来计算 MAPARRAY,或者MAP/ARRAY的嵌套类型。当大小未知的时候会返回 -1。具体实现如下:
<span style="font-size:12px;">jdbc:hive2://> SELECT work_place, skills_score, depart_title . . . . . . .> FROM employee;+----------------------+--------------------+-------------------------------------+| work_place | skills_score | depart_title |+----------------------+--------------------+-------------------------------------+|["Montreal","Toronto"]|{"DB":80} |{"Product":["Developer","Lead"]} ||["Montreal"] |{"Perl":85} |{"Product":["Lead"],"Test":["Lead"]} ||["New York"] |{"Python":80} |{"Test":["Lead"],"COE":["Architect"]}||["Vancouver"] |{"Sales":89,"HR":94}|{"Sales":["Lead"]} |+----------------------+--------------------+-------------------------------------+4 rows selected (0.084 seconds)jdbc:hive2://> SELECT SIZE(work_place) AS array_size, . . . . . . .> SIZE(skills_score) AS map_size, . . . . . . .> SIZE(depart_title) AS complex_size, . . . . . . .> SIZE(depart_title["Product"]) AS nest_size . . . . . . .> FROM employee;+-------------+-----------+---------------+------------+| array_size | map_size | complex_size | nest_size |+-------------+-----------+---------------+------------+| 2 | 1 | 1 | 2 || 1 | 1 | 2 | 1 || 1 | 1 | 2 | -1 || 1 | 2 | 1 | -1 |+-------------+-----------+---------------+------------+4 rows selected (0.062 seconds)</span>

ARRAY_CONTAINS 语句用来检测数组类型是否包含值,返回值为 TRUE / FALSESORT_ARRAY 语句用来升序排序数组中内容。具体应用如下:

<span style="font-size:12px;">jdbc:hive2://> SELECT ARRAY_CONTAINS(work_place, 'Toronto') AS is_Toronto,. . . . . . .> SORT_ARRAY(work_place) AS sorted_array . . . . . . .> FROM employee;+-------------+-------------------------+| is_toronto | sorted_array |+-------------+-------------------------+| true | ["Montreal","Toronto"] || false | ["Montreal"] || false | ["New York"] || false | ["Vancouver"] |+-------------+-------------------------+4 rows selected (0.059 seconds)</span>
日期函数技巧:
FROM_UNIXTIME(UNIX_TIMESTAMP())语句跟 Oracle 中SYSDATE的功能一样。用来返回 Hive 服务器的当前日期-时间,具体应用如下:
<span style="font-size:12px;">jdbc:hive2://> SELECT . . . . . . .> FROM_UNIXTIME(UNIX_TIMESTAMP()) AS current_time . . . . . . .> FROM employee LIMIT 1;+----------------------+| current_time |+----------------------+| 2014-11-15 19:28:29 |+----------------------+1 row selected (0.047 seconds)</span>

UNIX_TIMESTAMP()语句可以用来比较两个日期,或者转换为字符串类型放在 ORDER BY 后面进行排序: ORDER BY UNIX_TIMESTAMP(string_date, 'dd-MM-yyyy')。具体应用如下:

--用来比较两个不同的日期
<span style="font-size:12px;">jdbc:hive2://> SELECT (UNIX_TIMESTAMP ('2015-01-21 18:00:00') . . . . . . .> - UNIX_TIMESTAMP('2015-01-10 11:00:00'))/60/60/24 . . . . . . .> AS daydiff FROM employee LIMIT 1;+---------------------+| daydiff |+---------------------+| 11.291666666666666 |+---------------------+1 row selected (0.093 seconds)</span>

TO_DATE 语句用来去掉日期中的时、分、秒。用来检查一段时间的日期-时间类型的字段。例如 WHERE TO_DATE(update_datetime) BETWEEN '2014-11-01' AND '2014-11-31'。具体应用如下:

<span style="font-size:12px;">jdbc:hive2://> SELECT TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) . . . . . . .> AS current_date FROM employee LIMIT 1;+---------------+| current_date |+---------------+| 2014-11-15 |+---------------+1 row selected (0.153 seconds)</span>
针对不同数据类型 -- CASE
在 Hive 0.13.0 版本以前,THEN 和 ELSE 的数据类型需要保持一致,否则会报出异常。例如 Else表达式后需要跟相同数据类型,如果期望的是 ”bigint“,但是返回的却是”int“。这种情况就会报错。这里可以通过 IF 进行解决。
Hive 0.13.0 版本解决了该问题,具体应用如下:
<span style="font-size:12px;">jdbc:hive2://> SELECT . . . . . . .> CASE WHEN 1 IS NULL THEN 'TRUE' ELSE 0 END . . . . . . .> AS case_result FROM employee LIMIT 1;+--------------+| case_result |+--------------+| 0 |+--------------+1 row selected (0.063 seconds)</span>
解析 & 搜索技巧 -- (行转列?)--可以肯定的是LATERAL VIEW 语句能将一行拆分为多行
LATERAL VIEW 语句利用用户自定义的表生成函数,例如使用 EXPLODE() 来扁平化 map 或者 array 类型的字段。explode 函数可以跟 LATERAL VIEW 一起使用来处理 array 和 map 类型数据。如果字段中有null 值,则整行将会被过滤掉。例如下面的示例中,Steven 的那行就被过滤掉了。
从 Hive 0.12.0 开始,便可以使用 OUTER LATERAL VIEW来保留空值的行。
在实际应用中,我们会发现,LATERAL VIEW 语句更多的是跟SPLIT explode UDTF一起使用,能够将一列数据拆分成多行数据。在此基础上可以对拆分后的数据进行聚合。当然本文的案例已经很全面的解释LATERAL VIEW语句的具体应用。大家也可以参考以下博客.
在 Hive中利用LATERAL VIEW explode语句来进行WordCount:
http://www.codesec.net/view/149295.html
--数据准备
<span style="font-size:12px;">jdbc:hive2://> INSERT INTO TABLE employee. . . . . . .> SELECT 'Steven' AS name, array(null) as work_place,. . . . . . .> named_struct("sex","Male","age",30) as sex_age, . . . . . . .> map("Python",90) as skills_score, . . . . . . .> map("R&D",array('Developer')) as depart_title. . . . . . .> FROM employee LIMIT 1;No rows affected (28.187 seconds)jdbc:hive2://> SELECT name, work_place, skills_score . . . . . . .> FROM employee;+----------+-------------------------+-----------------------+| name | work_place | skills_score |+----------+-------------------------+-----------------------+| Michael | ["Montreal","Toronto"] | {"DB":80} || Will | ["Montreal"] | {"Perl":85} || Shelley | ["New York"] | {"Python":80} || Lucy | ["Vancouver"] | {"Sales":89,"HR":94} || Steven | NULL | {"Python":90} |+----------+-------------------------+-----------------------+5 rows selected (0.053 seconds)</span>

--LATERAL VIEW 忽略掉返回的空值行 
<span style="font-size:12px;">jdbc:hive2://> SELECT name, workplace, skills, score. . . . . . .> FROM employee. . . . . . .> LATERAL VIEW explode(work_place) wp    -- 在这里会将数组字段拆分成多行(若含有null值,直接过滤掉整行). . . . . . .> AS workplace. . . . . . .> LATERAL VIEW explode(skills_score) ss . . . . . . .> AS skills, score;+----------+------------+---------+--------+| name | workplace | skills | score |+----------+------------+---------+--------+| Michael | Montreal | DB | 80 || Michael | Toronto | DB | 80 || Will | Montreal | Perl | 85 || Shelley | New York | Python | 80 || Lucy | Vancouver | Sales | 89 || Lucy | Vancouver | HR | 94 |+----------+------------+---------+--------+6 rows selected (24.733 seconds)</span>

--OUTER LATERAL VIEW 保留查询返回的空值行
<span style="font-size:12px;">jdbc:hive2://> SELECT name, workplace, skills, score. . . . . . .> FROM employee. . . . . . .> LATERAL VIEW OUTER explode(work_place) wp   -- 在这里会将数组字段拆分成多行. . . . . . .> AS workplace. . . . . . .> LATERAL VIEW explode(skills_score) ss . . . . . . .> AS skills, score;+----------+------------+---------+--------+| name | workplace | skills | score |+----------+------------+---------+--------+| Michael | Montreal | DB | 80 || Michael | Toronto | DB | 80 || Will | Montreal | Perl | 85 || Shelley | New York | Python | 80 || Lucy | Vancouver | Sales | 89 || Lucy | Vancouver | HR | 94 || Steven | None | Python | 90 |+----------+------------+---------+--------+7 rows selected (24.573 seconds)</span>

字符串反转函数:REVERSE 
语句用来反转字符串每个字母的顺序。SPLIT 语句可以使用特定的分词器来标记字符串。以下示例显示从一个 Linux 路径下获得文件名:

<span style="font-size:12px;">jdbc:hive2://> SELECT. . . . . . .> reverse(split(reverse('/home/user/employee.txt'),'/')[0]). . . . . . .> AS linux_file_name FROM employee LIMIT 1;+------------------+| linux_file_name |+------------------+| employee.txt |+------------------+1 row selected (0.1 seconds)</span>

然而,reverse 将array / map 中的每个元素输出为单独的一行,collect_set 和 collect_list  的作用刚好相反,它会返回每行元素的结果集。collect_set 语句会对结果集进行去重操作。但是collect_list 没有改功能。具体应用如下:

<span style="font-size:12px;">jdbc:hive2://> SELECT collect_set(work_place[0]) . . . . . . .> AS flat_workplace0 FROM employee;+--------------------------------------+| flat_workplace0 |+--------------------------------------+| ["Vancouver","Montreal","New York"] |+--------------------------------------+1 row selected (43.455 seconds)jdbc:hive2://> SELECT collect_list(work_place[0]) . . . . . . .> AS flat_workplace0 FROM employee;+-------------------------------------------------+| flat_workplace0 |+-------------------------------------------------+| ["Montreal","Montreal","New York","Vancouver"] |+-------------------------------------------------+1 row selected (45.488 seconds)</span>
虚拟列:
在 Hive 中,虚拟列是一种特殊的函数类型。目前Hive提供两个虚拟列:INPUT__FILE__NAME (表示一个mapper任务的文件名) 和 BLOCK__OFFSET__INSIDE__FILE(表示当前全局文件的偏移量。对于块压缩文件,就是当前块的文件偏移量,即当前块的第一个字节在文件中的偏移量,对排查出现不符合预期或者 Null 结果的查询是很有帮助的)。下面通过显示数据在 HDFS 中的物理路径来展示如何使用 虚拟列。对桶表和分区表来说使用虚拟列是很重要的。代码如下:
--简单示例:(摘抄自:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns)select INPUT__FILE__NAME, key, BLOCK__OFFSET__INSIDE__FILE from src;select key, count(INPUT__FILE__NAME) from src group by key order by key;select * from src where BLOCK__OFFSET__INSIDE__FILE > 12000 order by key;
--对桶表使用虚拟列
<span style="font-size:12px;">jdbc:hive2://> SELECT INPUT__FILE__NAME, . . . . . . .> BLOCK__OFFSET__INSIDE__FILE AS OFFSIDE . . . . . . .> FROM employee_id_buckets;+---------------------------------------------------------+----------+| input__file__name | offside |+---------------------------------------------------------+----------+| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 0 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 55 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 120 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 175 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 240 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 295 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 360 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 415 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 480 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 535 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 592 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 657 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 712 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 769 || hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 834 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 0 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 57 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 122 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 177 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 234 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 291 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 348 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 405 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 462 || hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 517 |+---------------------------------------------------------+----------+25 rows selected (0.073 seconds)</span>

--对分区表使用虚拟列(根据时间进行分区)
<span style="font-size:12px;">jdbc:hive2://> SELECT INPUT__FILE__NAME FROM employee_partitioned;+-------------------------------------------------------------------------+| input__file__name |+-------------------------------------------------------------------------+|hdfs://warehouse_URI/employee_partitioned/year=2010/month=1/000000_0     |hdfs://warehouse_URI/employee_partitioned/year=2012/month=11/000000_0    |hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    |hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    |hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    |hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    +-------------------------------------------------------------------------+10 rows selected (0.47 seconds)</span>

Hive 维基百科中未提到的函数:
以下是 Hive 在维基百科中未曾涉及的函数: 

--检查空值函数 
<span style="font-size:12px;">jdbc:hive2://> SELECT work_place, isnull(work_place) is_null, . . . . . . .> isnotnull(work_place) is_not_null FROM employee;+-------------------------+----------+--------------+| work_place | is_null | is_not_null |+-------------------------+----------+--------------+| ["Montreal","Toronto"] | false | true || ["Montreal"] | false | true || ["New York"] | false | true || ["Vancouver"] | false | true || NULL | true | false |+-------------------------+----------+--------------+5 rows selected (0.058 seconds)</span>

--assert_true:如果条件为 false,则抛出异常 
<span style="font-size:12px;">jdbc:hive2://> SELECT assert_true(work_place IS NULL) . . . . . . .> FROM employee;Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed. (state=,code=0)</span>

--elt(n, str1, str2, ...), 返回指定位置(n-th)的字符串 
<span style="font-size:12px;">jdbc:hive2://> SELECT elt(2,'New York','Montreal','Toronto'). . . . . . .> FROM employee LIMIT 1;+-----------+| _c0 |+-----------+| Montreal |+-----------+1 row selected (0.055 seconds)</span>

--返回当前 Hive 数据库名字(Hive 0.13.0开始)
<span style="font-size:12px;">jdbc:hive2://> SELECT current_database();+----------+| _c0 |+----------+| default |+----------+1 row selected (0.057 seconds)</span>

Transactions

在Hive 0.13.0 版本之前,Hive 不支持行级的事务。这也导致了无法去更新、插入、删除数据行。因此,数据覆写只会发生在表和分区中。使得 Hive 的并行读写和数据清理十分繁琐。

从 Hive 0.13.0 版本开始,Hive通过 ACID(Atomicity, Consistency, Isolation, and Durability) 可以全面支持 行级事务.所有的事务在 ORC(Optimized Row Columnar) 文件格式 和 桶表中都自动支持。关于 Hive ACID,请参考一篇翻译博客:Hive ACID 特性 - 0.13.0

为了让 Hive 支持事务,需要配置以下参数:

<span style="font-size:12px;">SET hive.support.concurrency = true;SET hive.enforce.bucketing = true;SET hive.exec.dynamic.partition.mode = nonstrict;SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;SET hive.compactor.initiator.on = true;SET hive.compactor.worker.threads = 1;</span>
SHOW TRANSACTIONS 命令是从 Hive 0.13.0 版本才添加的,用来显示当前系统开启 / 关闭的事务:
<span style="font-size:12px;">jdbc:hive2://> SHOW TRANSACTIONS;+-----------------+--------------------+-------+-----------+| txnid | state | user | host |+-----------------+--------------------+-------+-----------+| Transaction ID | Transaction State | User | Hostname |+-----------------+--------------------+-------+-----------+1 row selected (15.209 seconds)</span>

从 Hive 0.14.0 版本开始, INSERT VALUEUPDATE, 和 DELETE命令通过以下语法来支持到行运算:

<span style="font-size:12px;">INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row …];UPDATE tablename SET column = value [, column = value ...] [WHERE expression]DELETE FROM tablename [WHERE expression]</span>


0 0