Hive-分桶与排序

来源:互联网 发布:艾弗森对乔丹比赛数据 编辑:程序博客网 时间:2024/05/21 14:51

创建带分桶的表

建表语句

 CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)  COMMENT 'A bucketed copy of user_info'  PARTITIONED BY(ds STRING)  CLUSTERED BY(user_id) INTO 256 BUCKETS;

以上分桶表以字段user_id进行分桶,以下说明如何在分桶的表中导入分桶数据

插入数据

  • 方法1:设置hive.enforce.bucketing = true
    set hive.enforce.bucketing = true;
 FROM user_id INSERT OVERWRITE TABLE user_info_bucketed PARTITION (ds='2009-02-25') SELECT userid, firstname, lastname WHERE ds='2009-02-25';

hive.enforce.bucketing = true会在insert的过程中根据bucket的大小(256),bucket依赖的字段(user_id),自动设置reduce的个数(256),否则,你需要自己手工写如下语句进行bucket表数据的导入

  • 方法2:手工设置reduce的个数,并且对分桶字段进行cluster by
    set mapred.reduce.tasks=256;
 FROM user_id INSERT OVERWRITE TABLE user_info_bucketed PARTITION (ds='2009-02-25') SELECT userid,firstname,lastname WHERE ds='2009-02-25' cluster by user_id

分桶+sorted表

 CREATE TABLE page_view(viewTime INT, userid BIGINT,    page_url STRING, referrer_url STRING,    ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\001'  COLLECTION ITEMS TERMINATED BY '\002'  MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE;

抽样查询

针对分桶表进行抽样

  • 语法:
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

注意点:
TABLESAMPLE语句可以跟在select…from…之后
桶的编号x从1开始计数
colname代表从哪一列对表进行抽样,可以是原始表中的任一列,或者是rand(),rand()代表对整行抽样,而不是某一列。

  • 举例:
建表语句 查询语句 效果说明 create table () clustered by (user_id) into 32 buckets select * from .. TABLESAMPLE(BUCKET 3 OUT OF 16 ON user_id)
select * from .. TABLESAMPLE(BUCKET 3 OUT OF 64 ON user_id)
select * from .. TABLESAMPLE(BUCKET 3 OUT OF 32 ON user_id)
    建表语句有 clustered by,查询时sample字段与create的cluster by字段一致,可以直接返回对应桶的数据
    取32/16=2个桶,分别是第3,3+16=19个桶
    取32/64=1/2个桶,第3个桶的一半
    取32/32=1个桶,第3个
同上 select * from .. TABLESAMPLE(BUCKET 3 OUT OF 32 ON user_name)
select * from .. TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) 查询时sample字段与create的cluster by字段一致,需进行全表扫描,然后确认抽样数据 ,其中,rand(),是根据整条记录进行抽样 create table () select * from .. TABLESAMPLE(BUCKET 3 OUT OF 32 ON id) 建表时没有分桶,查询时需要全表扫描

参考url

  • LanguageManual DDL BucketedTables
  • Bucketed Sorted Tables
  • https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling
0 0
原创粉丝点击