桶表,分区表

来源：互联网发布：代办软件著作权编辑：程序博客网时间：2024/05/22 06:40

分区表

如何创建一张分区表？只需要在之前的创建表后面使用partition by加上分区字段就可以了，eg.

　　create table tblName (

　　 id int comment 'ID',

　　 name string comment 'name'

　　) partitioned by (dt date comment 'create time')

　　row format delimited

　　fields terminated by '\t';

动态分区
①动态分区的数据是不先建分区加载数据的时候让hive创建分区目录
create table if not exists mytable09
(id int,
name String)
partitioned by(class String)
row format delimited fields terminated by "\t"
;
动态分区里的数据不能加载,只能用insert into... select...加载到分区表中

需要一张临时表(源数据表),上传源数据表的数据

有两种 strict 不允许全部都是动态分区(必须要至少指定一个静态分区 alter table ... 至少一个)
nonstrice 可以全是动态分区

如何加载数据？

load data local inpath linux_fs_path into table tblName partition(dt='2015-12-12');

分区的一些操作：

查询分区中的数据：select * from tblName where dt='2015-12-13';(分区相当于where的一个条件)

手动创建一个分区：alter table tblName add partition(dt='2015-12-13');

查看分区表有哪些分区：show partitions tblName;

删除一个分区(数据一起删掉了)：alter table tblName drop partition(dt='2015-12-12');

多个分区如何创建？

和单分区表的创建类似：

　　create table tblName (

　　 id int comment 'ID',

　　 name string comment 'name'

　　) partitioned by (year int comment 'admission year', school string comment 'school name')

　　row format delimited

　　fields terminated by '\t';

同时也可以从hdfs上引用数据：

alter table tblName partition(year='2015', school='crxy') set location hdfs_uri;

注意：

必须得现有分区,必须要使用hdfs绝对路径。

桶表

桶表是对数据进行哈希取值，然后放到不同文件中存储。查看每个桶文件中的内容，可以看出是通过对 buckets 取模确定的。

如何创建桶表？

create table tblName_bucket(id int) clustered by (id) into 3 buckets;

说明：

clustered by ：按照什么分桶

into x buckets:分成x个桶

如何加载数据？

不能使用load data这种方式，需要从别的表来引用

insert into table tblName_bucket select * from tbl_other;

注意:在插入数据之前需要先设置开启桶操作，不然插入数据不会设置为桶!

set hive.enforce.bucketing=true;

桶表的主要作用：

数据抽样

提高某些查询效率

注意：

需要特别注意的是：clustered by 和 sorted by 不会影响数据的导入，这意味着，用户必须自己负责数据如何导入，包括数据的分桶和排序。

'set hive.enforce.bucketing = true'可以自动控制上一轮 reduce 的数量从而适配 bucket 的个数，

当然，用户也可以自主设置 mapred.reduce.tasks 去适配bucket 个数，

推荐使用'set hive.enforce.bucketing = true'。

阅读全文

0 0