大讲台谈hive（后篇二）

来源：互联网发布：无线信号探测软件编辑：程序博客网时间：2024/04/30 11:09

桶操作

Hive 中 table 可以拆分成Partition table 和桶（BUCKET），桶操作是通过 Partition 的 CLUSTERED BY 实现的，BUCKET 中的数据可以通过 SORT BY 排序。

BUCKET 主要作用如下。

1)数据 sampling；

2)提升某些查询操作效率，例如 Map-Side Join。

需要特别主要的是，CLUSTERED BY 和 SORT BY 不会影响数据的导入，这意味着，用户必须自己负责数据的导入，包括数据额分桶和排序。 'set hive.enforce.bucketing=true' 可以自动控制上一轮Reduce 的数量从而适配 BUCKET 的个数，当然，用户也可以自主设置 mapred.reduce.tasks 去适配 BUCKET 个数，推荐使用：

1.          hive> set hive.enforce.bucketing=true;

操作示例如下。

1) 创建临时表 student_tmp，并导入数据。

1.          hive> desc student_tmp;

2.          hive> select * from student_tmp;

2) 创建 student 表。

1.          hive> create table student(id int,age int,name string)

2.          partitioned by (stat_date string)

3.          clustered by (id) sorted by(age) into 2 bucket

4.          row format delimited fields terminated by ',';

3) 设置环境变量。

1.          hive> set hive.enforce.bucketing=true;

4) 插入数据。

1.          hive> from student_tmp

2.          insert overwrite table student partition(stat_date='2015-01-19')

3.          select id,age,name where stat_date='2015-01-18' sort by age;

5) 查看文件目录。

1.          $ hadoop fs -ls /usr/hive/warehouse/student/stat_date=2015-01-19/

6) 查看 sampling 数据。

1.          hive> select * from student tablesample(bucket 1 out of 2 on id);

tablesample 是抽样语句，语法如下。

1.          tablesample(bucket x out of y)

y 必须是 table 中 BUCKET 总数的倍数或者因子。

Hive 复合类型

hive提供了复合数据类型：

1)Structs： structs内部的数据可以通过DOT（.）来存取。例如，表中一列c的类型为STRUCT{a INT; b INT}，我们可以通过c.a来访问域a。

2)Map（K-V对）：访问指定域可以通过["指定域名称"]进行。例如，一个Map M包含了一个group-》gid的kv对，gid的值可以通过M['group']来获取。

3)Array：array中的数据为相同类型。例如，假如array A中元素['a','b','c']，则A[1]的值为'b'

1、Struct使用

1) 建表

1.          hive> create table student_test(id INT, info struct< name:STRING, age:INT>)

2.          > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

3.          > COLLECTION ITEMS TERMINATED BY ':';

'FIELDS TERMINATED BY' ：字段与字段之间的分隔符。'COLLECTIONITEMS TERMINATED BY' ：一个字段各个item的分隔符。

2) 导入数据

1.          $ cat test5.txt

2.          1,zhou:30

3.          2,yan:30

4.          3,chen:20

5.          4,li:80

6.          hive> LOAD DATA LOCAL INPATH '/home/hadoop/djt/test5.txt' INTO TABLE student_test;

3) 查询数据

1.          hive> select info.age from student_test;

2、Array使用

1) 建表

1.          hive> create table class_test(name string, student_id_list array< INT>)

2.          > ROW FORMAT DELIMITED

3.          > FIELDS TERMINATED BY ','

4.          > COLLECTION ITEMS TERMINATED BY ':';

2) 导入数据

1.          $ cat test6.txt

2.          034,1:2:3:4

3.          035,5:6

4.          036,7:8:9:10

5.          hive>  LOAD DATA LOCAL INPATH '/home/work/data/test6.txt' INTO TABLE class_test ;

3) 查询

1.          hive> select student_id_list[3] from class_test;

3、Map使用

1) 建表

1.          hive> create table employee(id string, perf map< string, int>)

2.          > ROW FORMAT DELIMITED

3.          > FIELDS TERMINATED BY '\t'

4.          > COLLECTION ITEMS TERMINATED BY ','

5.          > MAP KEYS TERMINATED BY ':';

‘MAP KEYS TERMINATED BY’ ：key value分隔符

2) 导入数据

1.          $ cat test7.txt

2.          1       job:80,team:60,person:70

3.          2       job:60,team:80

4.          3       job:90,team:70,person:100

5.          hive>  LOAD DATA LOCAL INPATH '/home/work/data/test7.txt' INTO TABLE employee;

3) 查询

1.      hive> select perf['person'] from employee;

0 0