Hive深入详解

来源：互联网发布：知乎ie浏览器版本过低编辑：程序博客网时间：2024/06/05 05:35

Hive表的创建

方法一：同sql

create table if not exist deault.bf_log_20150913(ip string comment 'remot ip address',    ---字段注释user string ,req_url string comment 'user request url')comment 'Beifeng Web Access Logs'   --数据库注释row format delimited fields terminated by ' 'stored as textfile          --默认为文本格式textfilelocation '/user/beifeng/hive/warehouse/bf_log_20150913'; （内部表一般不指定）--默认在/user/beifeng/hive/warehouse/数据库/表 注意：default数据库为/user/beifeng/hive/warehouse/表

方法二：创建包含另一张表字段的字表

create table if not exist default.bf_log_20150913_saas select ip,user from deault.bf_log_20150913 ;

方法三：创建同另一张表格式一样的表，数据为空

create table if not exist default.bf_log_20150914Like deault.bf_log_20150913;

其他常见操作：

查看数据库show databases;show databases like ‘db_hive*’;使用表use db_hive_01查看数据库详细信息desc database db_hive_03;desc database extended db_hive_03;desc formatted 表名删除一个数据库drop database if exist db_hive_03; --如果数据库中有表存在则报错drop database if exist db_hive_03 cascade; --级联删除 同时删除数据库中的表

Hive数据类型

Numeric TypesTINYINT  (1-byte signed integer, from -128 to 127)SMALLINT  (2-byte signed integer, from -32,768 to 32,767)INT  (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)FLOAT  (4-byte single precision floating point number)DOUBLE  (8-byte double precision floating point number)DOUBLE PRECISION  (alias for DOUBLE, only available starting with Hive 2.2.0)DECIMALIntroduced in Hive 0.11.0 with a precision of 38 digitsHive 0.13.0 introduced user definable precision and scaleDate/Time TypesTIMESTAMP  (Note: Only available starting with Hive 0.8.0)DATE  (Note: Only available starting with Hive 0.12.0)String Types (常用)STRING（常用）VARCHAR (Note: Only available starting with Hive 0.12.0)CHAR (Note: Only available starting with Hive 0.13.0)Misc TypesBOOLEANBINARY (Note: Only available starting with Hive 0.8.0)Complex Typesarrays: ARRAY<data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)maps: MAP<primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting with Hive 0.7.0.)

Hive中外部表

内部表也称为managed_table，默认存储在/user/hive/warehouse下，也可以通过location指定（一般不指定），删除表时，会删除表数据以及元数据
外部表称之为external_table，在创建表时可以自己指定目录位置location（一般都要指定）；删除表时，只会删除元数据不会删除表数据

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,     page_url STRING, referrer_url STRING,     ip STRING COMMENT 'IP Address of the User',     country STRING COMMENT 'country of origination')COMMENT 'This is the staging page view table'ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'STORED AS TEXTFILELOCATION '<hdfs_location>';

分区表

和外部表联合使用企业常用。分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成更小的数据集。在查询时通过where子句中的表达式来选择查询所需要的指定的分区，这样的查询效率会提高很多。

create external table if not exist default.emp_partition(empno int,ename string,job string,mgr int,hiredate string,sal double,comm double,deptno int)partitioned by (month string)  --一级分区----partitioned by (month string,day string)  --二级分区row format delimited fields terminated by '\t'加载数据：Load data local inpath ‘/opt/datas/emp.txt’ into table default.emp_partition partition(month=201509,day=30)查询：Select * from emp_partition where month = ‘201509’ ; ----and day =’30’; --二级分区合并：Select * from emp_partition where month = ‘201509’UnionSelect * from emp_partition where month = ‘201508’UnionSelect * from emp_partition where month = ‘201507’; 3+2个MR

注意事项：内部表数据直接上传至/user/hive/warehouse/中，可以查询到结果；但是分区表则不行，因为分区表记录了分区的元数据信息。请看如下实例。

create table if not exist default.emp_nopart(deptno int,dname string,loc string)row format delimited fields terminated by '\t'默认存在/user/hive/warehouse/dept_nopart上传数据：dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_nopart查询：有数据create talbe if not exist default.emp_nopart(deptno int,dname string,loc string)partition by (day string)row format delimited fields terminated by '\t'创建目录dfs –mkdir –p /user/hive/warehouse/dept_part/day=20150913上传数据dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150913查询：无数据 （没有分区的元数据）修复表信息：msck repair table dept_part;   或者alter table dept_part partition(day=’20150913’)查询：Select * from dept_part; 有数据

查看表的分区数：

Show partitions dept_part;

导入数据到Hive

方法一：按如下格式导入数据

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

原始文件存储的位置：[LOCAL]本地: load data local …Hdfs : load data …对表中的数据是否覆盖：[OVERWRITE]，否则为追加eg：加载本地文件到hive表Load data local inpath ‘/opt/datas/emp.txt’ into table default.emp;加载hdfs文件到hiveLoad data inpath ‘/user/beifeng/hive/datas/emp.txt’ into table defaut.emp;（加载完成后hdfs上数据被删除）

方法二：创建表的时候通过insert加载数据

create table default.emp_ci like default.emp;insert into table default.emp_ci select * from default.emp;或者：create table if not exits default.bf_log_20150913_saAs  select ip,user from deault.bf_log_20150913 ;

方法三：创建表的时候通过location指定加载（外部表常用）

create external talbe if not exist default.emp_partition(empno int,ename string,job string,mgr int,hiredate string,sal double,comm double,deptno int)row format delimited fields terminated by '\t'location '/user/beifeng/hive/warehouse/emp_ext2'

Hive数据导出

方法一：

insert overwrite  local directory '/opt/datas/hive_exp_emp'select * from default.emp;(默认分隔符存到本地)insert overwrite  local directory '/opt/datas/hive_exp_emp'row format delimited fields terminated by '\t'collection items terminated by '\n'select * from default.emp;(自定义分隔符)insert overwrite  directory '/user/beifeng/hive/hive_exp_emp'select * from default.emp;(默认分隔符存储到hdfs)

方法二：

bin/hive -e "select * from default.emp;" > /opt/datas/hive_exp_emp

方法三：sqoop(常用，后面章节讲)
sqoop
hdfs/hive -> rdbms
rdbms -> hdfs/hive/hbase

方法四：export

export 将hive中数据导出 hdfsexport table default.emp to '/user/beifeng/hive/export/emp_exp';import 将外部数据hdfs导入到hivecreate table db_hive.emp like default.emp;import table db_hive.emp from '/user/beifeng/hive/export/emp_exp'

Hive常见查询

简单查询

select * from emp;select t.empno,t.enname,t.deptno  from emp t;select * from emp limit 5;

= >= <= between and ..... is null / is not null / in /not in .... max/min/count/sum/avg

select t.empno,t.enname,t.deptno  from emp t where t.sal between 800 and 1500;select t.empno,t.enname,t.deptno  from emp t where t.comm is null;show function ;---查询函数desc fucntion extended max;----查询函数的意义和用法select count(*) cnt from emp;

group by

----查询每个部门的平均工资select t.deptno,avg(t.sal) avg_sal from emp t group by t.deptno;----查询每个部门中每个岗位的最高薪水select t.deptno,t.job,max(t.sal) avg_sal from emp t group by t.deptno t.job;

having
where 是针对单挑记录进行筛选
having 是针对分组结果进行筛选

----求每个部门的平均薪水大于2000的部门select deptno,avg(sal) avg_sal from emp group by deptno having avg_sal>2000;

join 两个表连接

等值连接join onselect e.empno,e.enname,d.deptno,d.name from emp e join dept d on e.deptno = d.deptno;左连接 left join 与左表为准select e.empno,e.enname,d.deptno,d.name from emp e left join dept d on e.deptno = d.deptno;右连接  right join 与右表为准select e.empno,e.enname,d.deptno,d.name from emp e right join dept d on e.deptno = d.deptno;全链接full join select e.empno,e.enname,d.deptno,d.name from emp e full join dept d on e.deptno = d.deptno;

order by

对每个reduce内部进行排序，对全局结果来说没有排序

设置reduce个数set mapreduce.job.reduces = 3select * from emp sort by empno desc; insert overwrite local directory '/opt/datas/sortby-res' select * from emp sort by empno asc;

distribute by
类似于mapreduce中分区partition，对数据进行分区，结合sort by进行使用

insert overwrite local directory '/opt/datas/distby-res' select * from emp distribute by deptno sort by empno asc; 注意：distribute by必须在sort by之前

cluster by
当distribute by和sort by字段相同时可以用cluster by代替

insert overwrite local directory '/opt/datas/clustby-res' select * from emp cluster by empno;

Hive UDF编程

UDF(user definition function)用户自定义函数，允许用户扩展HiveQL功能

show functions; 查看hive中内部函数desc function split; 函数说明desc fucntion extended split; 函数使用具体说明

编程步骤：
1、继承org.apache.hadoop.hive.ql.exec.UDF
2、需要实现evaluate函数；evaluate函数支持重载
注意：
1、UDF必须要有返回类型，可以返回null，但是返回类型不能为void
2、UDF中常用Text/LongWritable等类型，不推荐使用java类型

package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text; public final class Lower extends UDF {  public Text evaluate(final Text s) {    if (s == null) { return null; }    return new Text(s.toString().toLowerCase());  }}

将自定义的UDF函数导入到hive中
方法一：

1、export导出jar包hiveudf.jar2、在hive中加载执行add jar /opt/datas/hiveudf.jar3、注册：create temporary function my_lower as "com.example.hive.udf"  ----为这里用hive2.1的版本 com.example.hive.udf.Lower 否则一直报找不到类!坑了我一下午show functions;可以查看到my_lower使用select ename,my_lower(ename) lowername from emp limit 5;

方法二：

方法二：CREATE FUNCTION self_lower AS 'com.example.hive.udf' USING JAR 'hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/hive/jars/hiveudf.jar';

0 0