Hive排序 cluster by column = distribute by column + sort by column

来源：互联网发布：苹果内存清理软件编辑：程序博客网时间：2024/05/01 13:56

（1）对于order by，sort by：

我们可以使用limit进行限制返回的行数，从而实现抓出数据的top N的情形。

（2）对于distribute by：

sort by为每个reducer产生一个排序文件。在有些情况下，你需要控制某个特定行应该到哪个reducer，通常是为了进行后续的聚集操作。hive的distribute by就派上用场了：

select year, temperature

distribute by year

sort by year asc, temperature desc;

上面实现了局部排序，且规定了：根据年份和气温对气象数据进行排序，以确保所有具有相同年份的行最终都在一个reducer分区中（文件下），可以看出，distribute by经常与sort by一起使用。

需要注意的是，hive要求distribute by 要写在sort by之前。

（3）对于cluster by：

简而言之：cluster by column = distribute by column + sort by column （注意，都是针对column列，且采用默认ASC）

即对于上面例子：

From table    
select year, temperature    
cluster by year;  

就等于：

From table    
select year, temperature    
distribute by year    
sort by year; 

当然这失去了按照气温排序的要求。

0 0