hive的基本操作（重点）

来源：互联网发布：二维码扫描软件ios 编辑：程序博客网时间：2024/05/21 14:03

hive的基本表操作
1.创建管理表
create table [if not exists] db01.student(
id int,
name string,
age int,
...
)
row format delimited fields terminated by '\t';

2.加载数据
load data [local] inpath 'filepath' [overwrite] into table tableName;

3.创建外部表
create external table [if not exists] db01.student(
id int,
name string,
age int,
...
)
row format delimited fields terminated by '\t'
location 'hdfspath';

注意:location后面跟的是hdfs的目录,不能够填文件名.

4.分区表
   select * from tableName where id=?;
   分区表的优势就是
   1.查询的速度更快
   2.管理文件的结构更合理

日志文件
   20161025
   20161026
   1.每天加载日志文件到hive表
   2.每天需要统计日志中的有用信息(pv uv ip)
   3.将日志文件以日期的形式来分区
   4.统计有效信息的范围可以通过分区来指定
需求:将学生表按照省份province来划分分区
一级分区表:
create table if not exists db01.student_par(
id int,
name string,
age int
)
partitioned by (province string)
row format delimited fields terminated by '\t';

在创建表的时候,只是指定了分区的字段,并未指定分区的范围,
分区的范围在加载数据的时候进行指定
语法：
LOAD DATA [LOCAL] INPATH 'filepath'
[OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]

load data local inpath '/opt/data/stu.txt'
overwrite into table db01.student_par
partition (province='jiangsu');

load data local inpath '/opt/data/stu2.txt'
overwrite into table db01.student_par
partition (province='zhejiang');

查看分区信息:
   show partitions tableName

查询分区信息
   select * from student_par where province='jiangsu';

二级分区表创建
create table student_par2(
id int,
name string,
age int
)
partitioned by (province string, city string)
row format delimited fields terminated by '\t';

二级分区表的加载
load data local inpath '/opt/data/stu.txt'
overwrite into table student_par2
partition (province='jiangsu',city='xuzhou');

load data local inpath '/opt/data/stu2.txt'
overwrite into table student_par2
partition (province='shandong',city='jinan');

load data local inpath '/opt/data/stu3.txt'
overwrite into table student_par2
partition (province='America',city='Los');

load data local inpath '/opt/data/stu4.txt'
overwrite into table student_par2
partition (province='America',city='NewYork');

查询二级分区表
   select * from student_par2

where province='America' and city='NewYork'

==========================================

创建表，直接上传数据
create table if not exists db01.student2(
id int,
name string,
age int
)
row format delimited fields terminated by '\t';

hdfs dfs -put stu.txt /user/hive/warehouse/db01.db/student2

在创建之后，直接上传文件也可以完成加载数据

创建一个分区表：
create table student_par3(
id int,
name string,
age int
)
partitioned by (province string, city string)
row format delimited fields terminated by '\t';

创建对应目录：
hdfs dfs -mkdir -p /user/hive/warehouse/db01.db/student_par3/province=America/city=Los;
hdfs dfs -mkdir -p /user/hive/warehouse/db01.db/student_par3/province=America/city=NewYork

存放数据：
hdfs dfs -put stu3.txt /user/hive/warehouse/db01.db/student_par3/province=America/city=Los;
hdfs dfs -put stu4.txt /user/hive/warehouse/db01.db/student_par3/province=America/city=NewYork

添加分区信息(add partition)
alter table student_par3 add partition (province='America',city='Los');
alter table student_par3 add partition (province='America',city='NewYork');

==========================================================================
外部分区表
create external table student_ext_par(
id int,
name string,
age int
)
partitioned by (province string, city string)
row format delimited fields terminated by '\t';

将数据上传到 hdfs
hdfs dfs -mkdir -p /nicole/input/student_ext_par/province=America/city=Los;
hdfs dfs -mkdir -p /nicole/input/student_ext_par/province=America/city=NewYork

hdfs dfs -put stu3.txt /nicole/input/student_ext_par/province=America/city=Los;
hdfs dfs -put stu4.txt /nicole/input/student_ext_par/province=America/city=NewYork

关联数据
alter table student_ext_par add partition (province='America',city='Los') location '/nicole/input/student_ext_par/province=America/city=Los';
alter table student_ext_par add partition (province='America',city='NewYork') location '/nicole/input/student_ext_par/province=America/city=NewYork';

alter table student_ext_par add partition (province='America',city='Miami') location '/nicole/input/student_ext_par/test1';
alter table student_ext_par add partition (province='America',city='Las') location '/nicole/input/student_ext_par/test2';
虽然外部表的分区信息与实际关联的hdfs的路径信息并没有一个必须的对应关系，
但是仍然建议对应分区来创建hdfs的路径信息，好处便于管理。

删除分区的命令   drop partition
alter table student_ext_par drop partition (province='America',city='Las')

======================================================
创建加载 hive表
第一种方式：
create table s1 (
id int,
name string
)
row format delimited fields terminated by '\t';

加载数据
load data [local] inpath 'path' [overwrite] into table s1;

第二种方式:
create table student_like like student;

加载数据
load data [local] inpath 'path' [overwrite] into table student_like;

第三种方式：
create table student_as as select * from student;
直接创建并加载数据
这种方式经常用于创建一个临时表

举例，创建emp临时表
create table emp_as as select empno as no,empname as name, empjob as job from emp;

第四种方式：
insert语句插入表数据之前必须创建表
create table emp_insert(
no int,
name string,
job string
)
row format delimited fields terminated by '\t';

insert into table tableName           追加
insert overwrite table tableName   覆盖

insert into table emp_insert select empsalary,empname,empjob from emp;

========================================================

往hive表中导入数据的几种方式

第一种方式：从本地到hive
   load data local inpath 'path/file' [overwrite] into table 表名称 ;

第二种方式：从hdfs到hive
   load data inpath 'path/file' into table 表名称 ;

第三种方式：创建表的时候使用as直接加载数据
   create table db_01.emp_as as select * from emp ;

第四种方式：使用insert命令加载
   insert into table 表名 select * from emp
   insert overwrite table 表名 select * from emp

第五种方式：创建表的时候通过location指定
   create table 表名(...)
   partioned by
   row format ..
   location "" ;

从hive导出表的几种方式：
第一种：往本地导出
   insert overwrite local directory 'localpath' 查询语句;
举例：
insert overwrite local directory "/opt/data/hive" select * from emp;
insert overwrite local directory '/opt/data/hive/aaa' row format delimited fields terminated by '\t' select * from emp;
注意：
1.在用这种方式导出数据的时候，必须切换目录才能看到文件
2.文件000000_0会覆盖目录中的所有其他文件
3.如果不适用row format语句指定分隔符,会使用默认分隔符分隔字段
4.指定目录使用单引号、双引号都可以

第二种：往hdfs导出
insert overwrite directory "/nicole/input/emp/temp" select * from emp;

存在的问题：
   在0.13.1版本中不支持直接导出到hdfs可以指定分隔符

第三种：
$ bin/hive -e "select * from db01.emp" > /opt/data/hive/emp.txt

第四种：
   使用sqoop来导出


练习：
drop table if exists emp;
create table emp(
empno int,
empname string,
empjob string,
managerno int,
empdate string,
empsalary double,
empreward double,
deptno int
)
row format delimited fields terminated by '\t';

load data local inpath '/opt/data/emp.txt' overwrite into table emp;

emp表、dept表
1.求出每个部门的最高薪资
   select max(empsalary),deptno from emp group by deptno;
2.求出每个部门的最高薪资,部门名称
select
max(e.empsalary) salary,e.deptno,d.deptname
from emp as e
join
dept as d
on
e.deptno=d.deptno
group by
e.deptno,d.deptname;

3.显示部门名称,部门最高薪资,部门所在的城市
select
max(e.empsalary) salary,e.deptno,d.deptname,d.deptcity
from emp as e
join
dept as d
on
e.deptno=d.deptno
group by
e.deptno,d.deptname,d.deptcity;

4.显示部门名称,部门最高薪资,且薪资必须大于等于3000的
select
max(e.empsalary) salary,e.deptno,d.deptname,d.deptcity
from emp as e
join
dept as d
on
e.deptno=d.deptno
group by
e.deptno,d.deptname,d.deptcity
having salary >= 3000;

5.从绩效工资小于10000的员工中，按部门查看最高基本工资，
显示部门名称以及基本工资大于等于3000的
select
max(e.empsalary) salary,e.deptno,d.deptname,d.deptcity
from emp as e
join
dept as d
on e.deptno=d.deptno
where e.empreward < 10000
group by
e.deptno,d.deptname,d.deptcity
having salary >= 3000;

join
   select * from emp join dept;       --笛卡尔积m*n

inner join   内连接

left join   左连接
   select * from emp left join dept on emp.deptno=dept.deptno;

right join   右连接
   select * from emp right join dept on emp.deptno=dept.deptno;

==========================================================================
explain
   解析执行计划

语法：explain 查询语句
   explain select * from emp right join dept on emp.deptno=dept.deptno;

常用的函数：
hive (db01)> show functions;             查看所有函数
hive (db01)> desc function max;         查看函数描述
hive (db01)> desc function extended sum;查看函数详细描述

concat 连接字符串函数
hive (db01)> select concat(empname,empjob) from emp;
hive (db01)> select concat(empname,"_",empjob) from emp;

substr 截取字符串函数
hive (db01)> select substr(empdate,1,4) from emp;
   1:代表从第一位开始取
   4:代表取长度为4的string

时间相关函数
day
mouth
year
hour
   hive (db01)> select hour("2010-10-10 10:11:12");
minute
second

hive (db01)> select year(empdate),month(empdate),day(empdate),hour(empdate) from emp;
_c0     _c1     _c2     _c3
1980    12      17      NULL
1981    2       20      NULL
有就能取出来，没有就返回NULL

Synonyms: dayofmonth
date is a string in the format of 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'.

unix_timestamp函数
   将时间转换为自从1970年1月1日至今的秒数
hive (db01)> select unix_timestamp("2016-10-26 15:23:30");
   1477466610

from_unixtime函数
   将unix时间转换为日期时间(格式2016-10-26 15:23:30)
hive (db01)> select from_unixtime(1477466610)
   2016-10-26 15:23:30

cast函数
   从日志上获取到了ms值：1477466610456ms
   cast(1477466610456/1000 as int)


case when
语法：
case
when 条件 then 返回值
when 条件 then 返回值
...
else 返回值
end

举例：
select empname,
case
when empreward>=10000 then "rich"
when empreward<10000 and empreward >=5000 then "just so so"
else "pool"
end
from emp;

hiveserver2    基于thrift软件架构的服务器

修改hive-site.xml配置文件

<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>


<property>
<name>hive.server2.thrift.bind.host</name>
<value>nicole02.com.cn</value>
</property>


<property>
<name>hive.server2.long.polling.timeout</name>
<value>5000</value>
</property>


启动hiveserver2
$ bin/hive --service hiveserver2   #启动服务器

$ bin/beeline       #启动客户端
beeline> help       #查看命令
   > !connect jdbc:hive2://192.168.234.150:10000
                   # 连接数据库
Enter username for jdbc:hive2://192.168.234.150:10000: nicole
Enter password for jdbc:hive2://192.168.234.150:10000: *****
   #注:输入的用户名和密码是linux使用的用户名密码
   #如果不输入也可以进入，但是没有权限

   #进入之后,显示如下:
0: jdbc:hive2://192.168.234.150:10000>
   #输入的命令与hive无异

2.2.2Drop Partitions
ALTER TABLE table_name DROP partition_spec, partition_spec,...

ALTER TABLE c02_clickstat_fatdt1 DROP PARTITION (dt='20101202');
2.2.3Rename Table
ALTER TABLE table_name RENAME TO new_table_name
这个命令可以让用户为表更名。数据所在的位置和分区名并不改变。换而言之，老的表名并未“释放”，对老表的更改会改变新表的数据。
2.2.4Change Column
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
这个命令可以允许改变列名、数据类型、注释、列位置或者它们的任意组合
Eg:

2.2.5Add/Replace Columns
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

ADD是代表新增一字段，字段位置在所有列后面(partition列前);REPLACE则是表示替换表中所有字段。
Eg:
hive> desc xi;
OK
id      int
cont    string
dw_ins_date     string
Time taken: 0.061 seconds
hive> create table xibak like xi;
OK
Time taken: 0.157 seconds
hive> alter table xibak replace columns (ins_date string);
OK
Time taken: 0.109 seconds
hive> desc xibak;
OK
ins_date        string
2.3Create View
CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)][？？？？？？]
AS SELECT ...

0 0