hive之分区表

来源：互联网发布：淘宝二手购买编辑：程序博客网时间：2024/05/11 18:30

一、hive语法

1.1创建表（MANAGED_TABLE）：

create table student(id bigint,name string) row format delimited fields terminated by '\t';

会在元数据库中生成一下信息：

TBLS表中信息：

COLUMNS_V2表中信息：

SDS表中信息：

1.2加载数据：
load data local inpath '/root/student.txt' into student;

1.3查询数据
hive> select * from student;
OK
1 zhangsan
2 wangwu
3 lisi
NULL NULL
Time taken: 0.318 seconds, Fetched: 4 row(s)
hive> select * from student limit 2;
OK
1 zhangsan
2 wangwu
Time taken: 0.087 seconds, Fetched: 2 row(s)

注：都不会执行MR。

hive> select sum(id) from student;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1494075968732_0002, Tracking URL = http://heres04:8088/proxy/application_1494075968732_0002/
Kill Command = /heres/hadoop-2.2.0/bin/hadoop job -kill job_1494075968732_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-05-06 21:26:09,717 Stage-1 map = 0%, reduce = 0%
2017-05-06 21:26:19,428 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.71 sec
2017-05-06 21:26:28,987 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.4 sec
MapReduce Total cumulative CPU time: 3 seconds 400 msec
Ended Job = job_1494075968732_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.4 sec HDFS Read: 232 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 400 msec
OK
6
Time taken: 31.431 seconds, Fetched: 1 row(s)

注：执行MR程序

1.4外部表（EXTERNAL_TABLE）
建立外部表：先有数据（数据已经在hdfs中），然后我们创建一个表，让这个表指向这个目录。hive的客户端也可以执行hadoop的命令。
只要把文件放到这个目录下，就可以在hive客户端查询到（内部表也适用，不管是内部表还是外部表，只要把数据放到指定目录下，就可以查询到;但是分区表除外）。
hive> dfs -ls /
> ;
Found 11 items
-rw-r--r-- 3 root supergroup 59 2017-04-18 16:55 /a.txt
drwxr-xr-x - root supergroup 0 2017-04-30 15:15 /hbase
-rw-r--r-- 3 root supergroup 27605 2017-04-18 10:44 /install.log
-rw-r--r-- 3 root supergroup 27605 2017-04-18 15:47 /log
drwxr-xr-x - root supergroup 0 2017-04-23 17:31 /sqoop
drwx------ - root supergroup 0 2017-05-04 16:41 /tmp
drwxr-xr-x - root supergroup 0 2017-05-04 19:14 /user
drwxr-xr-x - root supergroup 0 2017-04-18 18:46 /wca
drwxr-xr-x - root supergroup 0 2017-04-18 18:53 /wcc
drwxr-xr-x - root supergroup 0 2017-04-18 18:09 /wcout
drwxr-xr-x - root supergroup 0 2017-04-18 19:53 /wcs
hive> dfs -mkdir /data;
hive> dfs -put /root/student.txt /data/a.txt;
hive> dfs -put /root/student.txt /data/b.txt;
hive> create external table ext_student(id int,name string) row format delimited fields terminated by '\t' location '/data' ;
OK
Time taken: 0.338 seconds
hive> select * from ext_student;
OK
1 zhangsan
2 wangwu
3 lisi
NULL NULL
1 zhangsan
2 wangwu
3 lisi
NULL NULL

二、建立分区表

hive>create table beauties (id bigint, name string,size int ) partitioned by (nation string) row format delimited fields terminated by '\t';

hive> load data local inpath '/root/b.c' into table beauties partition(nation='china');

Copying data from file:/root/b.c
Copying file: file:/root/b.c
Loading data to table default.beauties partition (nation=china)
Partition default.beauties{nation=china} stats: [numFiles=1, numRows=0, totalSize=45, rawDataSize=0]
OK

Time taken: 0.739 seconds

hive> select * from beauties;
OK
1 glm 34.0 china
2 lina 30.9 china
3 liu 45.0 china
4 bing 56.56 china
Time taken: 0.139 seconds, Fetched: 4 row(s)

hive> dfs -mkdir /beauty/nation=Japan;
hive> dfs -put /root/b.j /beauty/nation=Japan;
hive> select * from beauties;
OK
1 glm 34.0 china
2 lina 30.9 china
3 liu 45.0 china
4 bing 56.56 china
Time taken: 0.089 seconds, Fetched: 4 row(s)

为什么没有查到？因为元数据库没有记录这个信息。缺少如下图红色框中的数据：

hive> alter table beauties add partition (nation='Japan') location "/beauty/nation=Japan";//添加分区信息
OK
Time taken: 0.185 seconds
hive> select * from beauties;
OK
1 bgyjy 56.6565 Japan
2 jzmb 23.232 Japan
3 ewrwe 43.9 Japan
1 glm 34.0 china
2 lina 30.9 china
3 liu 45.0 china
4 bing 56.56 china

Time taken: 0.104 seconds, Fetched: 7 row(s)

hive> select * from beauties where nation='Japan';
OK
1 bgyjy 56.6565 Japan
2 jzmb 23.232 Japan
3 ewrwe 43.9 Japan

Time taken: 0.253 seconds, Fetched: 3 row(s)

三、从mysql导入数据，利用hive进行统计分析

1、将hive添加到环境变量当中
vim /etc/profile
添加: :/heres/apache-hadoop.../bin

2、在hive当中创建两张表
create table trade_detail (id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t';
create table user_info (id bigint, account string, name string, age int) row format delimited fields terminated by '\t';

3、将mysq当中的数据直接导入到hive当中（先将数据导到hdfs中，然后hive再将数据load进来，所有要将hive添加到环境变量）：首先进入sqoop目录中的bin目录中，执行以下：
sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table user_info --hive-import --hive-overwrite --hive-table user_info --fields-terminated-by '\t'

4、创建一个result表保存前一个sql执行的结果
create table result row format delimited fields terminated by '\t' as select t2.account, t2.name, t1.income, t1.expenses, t1.surplus from user_info t2 join (select account, sum(income) as income, sum(expenses) as expenses, sum(income-expenses) as surplus from trade_detail group by account) t1 on (t1.account = t2.account);

四、语句小结

1、创建一个user表
create table user (id int, name string) row format delimited fields terminated by '\t';
2、将本地文件系统上的数据导入到HIVE当中
load data local inpath '/root/user.txt' into table user;
3、创建外部表
create external table stubak (id int, name string) row format delimited fields terminated by '\t' location '/stubak';

4、创建分区表
普通表和分区表区别：有大量数据增加的需要建分区表
create table book (id bigint, name string) partitioned by (pubdate string) row format delimited fields terminated by '\t';

分区表加载数据
load data local inpath './book.txt' overwrite into table book partition (pubdate='2010-08-22');

0 0