6-Hive 分区

来源：互联网发布：淘宝美瞳店铺哪个好编辑：程序博客网时间：2024/06/11 12:05

1. 分区概念

Hive 把表组织成分区。这是一种根据 分区列的值 对表进行粗略划分的机制。
使用分区可以加快数据分片的查询速度。
分区的创建是在 创建表 的时候使用 PARTITIONED BY 字句定义的。
该子句需要定义列的列表。

2. 静态分区：

2.1 创建分区

hive (test)> create table linux (id int, context string)           > partitioned by (date_code int)           > row format delimited            > fields terminated by '#'           > stored as textfile;OKTime taken: 0.41 seconds

2.2 数据加载

# 将一个表的数据导入到另一个表时指定分区hive (test)> insert into table linux partition(date_code=4) select * from syslog;# 将文件中的数据加载到表时指定分区hive (test)> load data local inpath '/home/saligia/tmp/syslog1.bak' into table linux partition(date_code=3);# 手动在hdfs上创建分区并导入数据$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/mysql/warehouse/test.db/linux/date_code=2$ $HADOOP_HOME/bin/hadoop fs -put syslog2.bak /user/hive/mysql/warehouse/test.db/linux/date_code=2hive (test)> select id from linux where date_code = 2;# 直接查询结果将无法被查询到， 原因是 metastore中未同步分区hive(test)> alter table linux add partition(date_code = 2);

说明:

在文件系统级别，分区只是表目录下嵌套的子目录。
PARTITIONED BY 字句中的列定义是表中正式的列，称为”分区列”, 但是, 数据文件并不包含这些列的值，因为他们源于目录名。
手动在hdfs 上创建的目录不能直接显示，必须同步到数据库原数据中
可以使用 alter table add partition(…); 来解决

2.3 动态分区：

准备条件：

set hive.exec.dynamic.partition=true;  set hive.exec.dynamic.partition.mode=nonstrict;

创建分区：

创建语句与创建静态分区并没有什么差异

hive (test)> create table dyn_linux(id int, context string)           > partitioned by (date_code int)           > row format delimited            > fields terminated by '#'           > stored as textfile;OKTime taken: 0.263 seconds

导入数据：

导入数据时 partition 语句后面跟列标示，而不是值。

hive (test)> insert into table dyn_linux partition(date_code) select id,context,date_code from linux;

0 0