hive之分区表

来源:互联网 发布:淘宝二手购买 编辑:程序博客网 时间:2024/05/11 18:30

一、hive语法

1.1创建表(MANAGED_TABLE):

create table student(id bigint,name string) row format delimited fields terminated by '\t';

会在元数据库中生成一下信息:


TBLS表中信息:


COLUMNS_V2表中信息:

SDS表中信息:



1.2加载数据:
load data local inpath '/root/student.txt' into student;

1.3查询数据
hive> select * from student;
OK
1       zhangsan
2       wangwu
3       lisi
NULL    NULL
Time taken: 0.318 seconds, Fetched: 4 row(s)
hive> select * from student limit 2;
OK
1       zhangsan
2       wangwu
Time taken: 0.087 seconds, Fetched: 2 row(s)


注:都不会执行MR。


hive> select sum(id) from student;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1494075968732_0002, Tracking URL = http://heres04:8088/proxy/application_1494075968732_0002/
Kill Command = /heres/hadoop-2.2.0/bin/hadoop job  -kill job_1494075968732_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-05-06 21:26:09,717 Stage-1 map = 0%,  reduce = 0%
2017-05-06 21:26:19,428 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.71 sec
2017-05-06 21:26:28,987 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
MapReduce Total cumulative CPU time: 3 seconds 400 msec
Ended Job = job_1494075968732_0002
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.4 sec   HDFS Read: 232 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 400 msec
OK
6
Time taken: 31.431 seconds, Fetched: 1 row(s)


注:执行MR程序

1.4外部表(EXTERNAL_TABLE)
建立外部表:先有数据(数据已经在hdfs中),然后我们创建一个表,让这个表指向这个目录。hive的客户端也可以执行hadoop的命令。
只要把文件放到这个目录下,就可以在hive客户端查询到(内部表也适用,不管是内部表还是外部表,只要把数据放到指定目录下,就可以查询到;但是分区表除外)。
hive> dfs -ls / 
    > ;
Found 11 items
-rw-r--r--   3 root supergroup         59 2017-04-18 16:55 /a.txt
drwxr-xr-x   - root supergroup          0 2017-04-30 15:15 /hbase
-rw-r--r--   3 root supergroup      27605 2017-04-18 10:44 /install.log
-rw-r--r--   3 root supergroup      27605 2017-04-18 15:47 /log
drwxr-xr-x   - root supergroup          0 2017-04-23 17:31 /sqoop
drwx------   - root supergroup          0 2017-05-04 16:41 /tmp
drwxr-xr-x   - root supergroup          0 2017-05-04 19:14 /user
drwxr-xr-x   - root supergroup          0 2017-04-18 18:46 /wca
drwxr-xr-x   - root supergroup          0 2017-04-18 18:53 /wcc
drwxr-xr-x   - root supergroup          0 2017-04-18 18:09 /wcout
drwxr-xr-x   - root supergroup          0 2017-04-18 19:53 /wcs
hive> dfs -mkdir /data;
hive> dfs -put /root/student.txt /data/a.txt;
hive> dfs -put /root/student.txt /data/b.txt;

hive> create external table ext_student(id int,name string) row format delimited fields terminated by '\t' location '/data' ;

OK
Time taken: 0.338 seconds
hive> select * from ext_student;
OK
1       zhangsan
2       wangwu
3       lisi
NULL    NULL
1       zhangsan
2       wangwu
3       lisi
NULL    NULL


二、建立分区表

hive>create table beauties (id bigint, name string,size int ) partitioned by (nation string) row format delimited fields terminated by '\t'; 

hive> load data local inpath '/root/b.c' into table beauties partition(nation='china'); 


Copying data from file:/root/b.c
Copying file: file:/root/b.c
Loading data to table default.beauties partition (nation=china)
Partition default.beauties{nation=china} stats: [numFiles=1, numRows=0, totalSize=45, rawDataSize=0]
OK

Time taken: 0.739 seconds


hive> select * from beauties;
OK
1       glm     34.0    china
2       lina    30.9    china
3       liu     45.0    china
4       bing    56.56   china
Time taken: 0.139 seconds, Fetched: 4 row(s)


hive> dfs -mkdir /beauty/nation=Japan;  
hive> dfs -put /root/b.j /beauty/nation=Japan; 
hive> select * from beauties;
OK
1       glm     34.0    china
2       lina    30.9    china
3       liu     45.0    china
4       bing    56.56   china
Time taken: 0.089 seconds, Fetched: 4 row(s)


为什么没有查到?因为元数据库没有记录这个信息。缺少如下图红色框中的数据:


hive> alter table beauties add partition (nation='Japan') location "/beauty/nation=Japan";//添加分区信息
OK
Time taken: 0.185 seconds
hive> select * from beauties;
OK
1       bgyjy   56.6565 Japan
2       jzmb    23.232  Japan
3       ewrwe   43.9    Japan
1       glm     34.0    china
2       lina    30.9    china
3       liu     45.0    china
4       bing    56.56   china

Time taken: 0.104 seconds, Fetched: 7 row(s)


hive> select * from beauties where nation='Japan';
OK
1       bgyjy   56.6565 Japan
2       jzmb    23.232  Japan
3       ewrwe   43.9    Japan

Time taken: 0.253 seconds, Fetched: 3 row(s)

三、从mysql导入数据,利用hive进行统计分析

1、将hive添加到环境变量当中
vim /etc/profile
添加: :/heres/apache-hadoop.../bin

2、在hive当中创建两张表
create table trade_detail (id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t';
create table user_info (id bigint, account string, name  string, age int) row format delimited fields terminated by '\t';

3、将mysq当中的数据直接导入到hive当中(先将数据导到hdfs中,然后hive再将数据load进来,所有要将hive添加到环境变量):首先进入sqoop目录中的bin目录中,执行以下:
sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table user_info --hive-import --hive-overwrite --hive-table user_info --fields-terminated-by '\t'


4、创建一个result表保存前一个sql执行的结果
create table result row format delimited fields terminated by '\t' as select t2.account, t2.name, t1.income, t1.expenses, t1.surplus from user_info t2 join (select account, sum(income) as income, sum(expenses) as expenses, sum(income-expenses) as surplus from trade_detail group by account) t1 on (t1.account = t2.account);

四、语句小结

1、创建一个user表
create table user (id int, name string) row format delimited fields terminated by '\t';
2、将本地文件系统上的数据导入到HIVE当中
load data local inpath '/root/user.txt' into table user;
3、创建外部表
create external table stubak (id int, name string) row format delimited fields terminated by '\t' location '/stubak';

4、创建分区表
普通表和分区表区别:有大量数据增加的需要建分区表
create table book (id bigint, name string) partitioned by (pubdate string) row format delimited fields terminated by '\t'; 

分区表加载数据
load data local inpath './book.txt' overwrite into table book partition (pubdate='2010-08-22');

0 0