hive 数据查询复杂SQL

来源：互联网发布：兰蔻和mac口红哪个好编辑：程序博客网时间：2024/06/05 16:17

排序和聚集

正常在数据少的情况下

直接使用order by来操作即可，因为是全排序所以要在一个reduce中完成

from records

select year,temperature

order by year asc,temperature desc;

如果数据量大，并且不需要全排序，只是需要每个reduce中的数据排序即可。如下根据year来指定（distribute by）到相同的reduce中，然后根据sort by来排序

from records

select year,temperature

distribute by year

sort by year asc,temperature desc;

当然一般如果不用指定排序默认字段是排序asc的且在同一个reduce中

from records

select year,temperature

cluster by year;

--------------------------------------------------

from records

select year,temperature

cluster by year,temperature;

MapReduce脚本

连接

内连接

Hive中的连接就是把我们查询操作根据连接条件解析成对对应的maper的输出key，value就是数据对象关联的两条记录。Reducer去处理连接查询的操作。

数据准备

/root/hcr/tmp/sample2.txt数据文件

1990 ruishenh0

1992 ruishenh2

1991 ruishenh1

1993 ruishenh3

1994 ruishenh4

1995 ruishenh5

1996 ruishenh6

1997 ruishenh7

1998 ruishenh8

create table records2 (year string,namestring) row format delimited fields terminated by '\t'

load data local inpath'/root/hcr/tmp/sample2.txt' overwrite into tablerecords2;

joinon

select records.*,records2.*

from records join records2 on(records.year=records2.year)

在hive中的join on 操作可以多个条件连接，比如 a join b on a.id=b.aid and a.type=b.atype

select records.*,records2.*

from records join records2 on(records.year=records2.year and records.quality!=1)

hive中同样也是支持多表做连接的

selectr1.year,r2.name,r2.year,r4.y,r4.standard from records2 r2 join records r1 on (r1.year=r2.year) join records4 r4 on(r4.y=r2.year);

但是执行后报错，//找问题TODO

提示到因为join子句一般把大数据的表都放到后边；

外连接

左外连接以左表为主查询，关联不到为null

select * from records r left outer joinrecords2 r2 on r.year=r2.year;

右外连接以右表为主查询，关联不到为null

select * from records r right outer joinrecords2 r2 on r.year=r2.year;

半连接

select * from records2 r left semi join records r2 on r.year=r2.year;

map 连接 /*+MAPJOIN(records2)*/

From records r join records2 r2 onr.year=r2.year

select /*+MAPJOIN(records2)*/ r2.*,r.*;

子查询

子查询是内嵌在另一个SQL语句中的SELECT语句。Hive对子查询的支持很有限。它只允许子查询出现在SELECT语句的FROM子句中。

from

(

From records r

select r.year,MAX(r.temperature)asmax_temperature

where r.temperature !=9999 and (r.quality=0or r.quality=1 or r.quality=2)

group by r.year

) mt

select mt.year,avg(mt.max_temperature)

group by mt.year ;

因为在外层查询要用到子查询的字段，所以必须赋值别名，比如上文中的mt，而且在子查询中的返回的列名中必须不能存在重复的列名。（比如不能有两个records.year,和records2.year）

视图

Hive中的数据就是一个虚拟的存在写好的sql一样，它不会物化实际。且不能向基表加载或者插入数据。

创建视图

create view max_records

select r.year,MAX(r.temperature)asmax_temperature

From records r

where r.temperature !=9999 and (r.quality=0or r.quality=1 or r.quality=2)

group by r.year ;

查询视图

Select * from max_records;

重现上边子查询操作：

select year,avg(max_temperature)

from max_records

group by year;

39 2