hive中的sort by

来源：互联网发布：linux命令大全chmod 编辑：程序博客网时间：2024/05/16 17:20

在hive中不光有order by操作，还有个sort by操作。两者执行的都是排序的操作，但有存在很大的不同。
还是用上次order by的例子来说明。

测试用例
hive> select * from test09;
OK
100 tom
200 mary
300 kate
400 tim
Time taken: 0.061 seconds

hive> select * from test09 sort by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0068, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0068
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0068
2011-05-03 05:39:21,389 Stage-1 map = 0%, reduce = 0%
2011-05-03 05:39:23,410 Stage-1 map = 50%, reduce = 0%
2011-05-03 05:39:25,430 Stage-1 map = 100%, reduce = 0%
2011-05-03 05:39:30,470 Stage-1 map = 100%, reduce = 50%
2011-05-03 05:39:32,493 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0068
OK
100 tom
300 kate
200 mary
400 tim
Time taken: 17.783 seconds

结果看起来和order by差不多，但是sort by是不受hive.mapred.mode参数影响，无论hive.mapred.mode在什么模式都可以。
从上面的Number of reduce tasks not specified. Defaulting to jobconf value of: 2可以看得出来，此时共启动了2个reduce。
实际上sort by控制的是每个reduce产生的文件都是排序的（从上面的结果可以看出，整体上并不保证有序），这样对多个已经排序好的文件做一次归并排序就ok了。
比用order by的时候，仅仅有单个reduce要好得多。

我们把上面的结果写到文件中就看得清楚的多了。

hive> insert overwrite local directory ‘/home/hjl/sunwg/qqq’ select * from test09 sort by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0069, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0069
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0069
2011-05-03 05:41:27,913 Stage-1 map = 0%, reduce = 0%
2011-05-03 05:41:30,939 Stage-1 map = 100%, reduce = 0%
2011-05-03 05:41:37,993 Stage-1 map = 100%, reduce = 50%
2011-05-03 05:41:41,023 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0069
Copying data to local directory /home/hjl/sunwg/qqq
Copying data to local directory /home/hjl/sunwg/qqq
4 Rows loaded to /home/hjl/sunwg/qqq
OK
Time taken: 18.496 seconds

[hjl@sunwg src]$ ll /home/hjl/sunwg/qqq
total 8
-rwxrwxrwx 1 hjl hjl 17 May 3 05:41 attempt_201105020924_0069_r_000000_0
-rwxrwxrwx 1 hjl hjl 17 May 3 05:41 attempt_201105020924_0069_r_000001_0

此时产生了2个文件，分别查看每个文件的内容。

[hjl@sunwg src]$ cat /home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000000_0
100tom
300kate
[hjl@sunwg src]$ cat /home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000001_0
200mary
400tim

可以看得出来每个文件的内部都是排好顺序的。

order by和sort by都可以实现排序的功能，不过具体怎么使用还得根据情况，如果数据量不是太大的情况可以使用order by，如果数据库过于庞大，最好还是使用sort by。

本文转自http://www.oratea.net/?p=624