hive 表连接操作注意事项

来源：互联网发布：银联数据待遇编辑：程序博客网时间：2024/04/30 11:15

1.hive jion 只支持等值连接

2.hive jion目前不支持在on子句中使用谓词or

3.on子句中的分区过滤条件在outer join中是无效的，但是在inner join中是可以用的

4.hive 中不支持in和not in

对于in

在hive中可以使用left semi join实现,但是要注意这种方式在select 和where 子句中不能引用右边表的字段

例：select distinct v.dev_mac from vod_test v left semi join device d on v.dev_mac=d.dev_mac

当然也可以通过left outer join实现

select distinct v.dev_mac from vod_test v left outer join device d on v.dev_mac=d.dev_mac where d.dev_mac is not null;

对于not in

在hive中可以通过left outer join 实现

select distinct v.dev_mac from vod_test v left outer join device d on v.dev_mac=d.dev_mac where d.dev_mac is null;

5.对于hive 中只使用join 是进行的笛卡尔积，hive中并没有像RDBMS中对其进行inner join的优化，而且，其查询

也不能并行执行。所有速度很慢，特别注意。

6.hive 的map-side join

hive可以将比较小的表加载到内存中，在map过程进行join。

例：select /*+ mapjoin(d) */ distinct v.dev_mac from vod_test v left outer join device d on v.dev_mac=d.dev_mac ；

v0.7以后的版本废弃这种标记方式，但是还是可以用，有两个属性可以设置

set hive.auto.convert.join=true;(默认false)

set hive.mapjoin.smalltable.filesize=25000000;（默认值，单位字节）

第一个代表启用功能，这样在合适的时候hive会自动启动功能

第二个参数设定可以使用优化的小表的大小

注意hive中对于right outer join 和full outer join 不支持这个优化

7.对于order by，sort by ，distribute by，cluster by

order by 只有一个reduce，执行效率慢，全局有序

sorty by 每个reudce中的数据是有序的

distribute by 控制着数据是如何划分到每个reduce中的，默认是map的key的hash值。这样输出内容会有明显的重叠，至少对于排序顺序是这样，如果我们希望有相同的一列分发到同一个reduce，可以使用distribute by

hive 要求distribute by 要写在sort by之前

cluster by 如果sort by 和distribute by中使用的列相同可以用cluster by 代替

8.cast（）类型转换函数

例如

cast（salary as float）

9.抽样查询

例：select * from device tablesample(bucket 1 out of 1000 on rand());

10.数据块抽样

select * from device tablesample(0.1 percent);

最小抽样单元是一个块大小，当文件小于块大小时还是会返回所有数据

0 0