hive中的distribute By
来源:互联网 发布:linux命令大全chmod 编辑:程序博客网 时间:2024/05/16 11:38
hive中的distribute by是控制在map端如何拆分数据给reduce端的。
hive会根据distribute by后面列,根据reduce的个数进行数据分发,默认是采用hash算法。
对于distribute by进行测试,一定要分配多reduce进行处理,否则无法看到distribute by的效果。
hive> select * from test09;
OK
100 tom
200 mary
300 kate
400 tim
Time taken: 0.061 seconds
hive> insert overwrite local directory ‘/home/hjl/sunwg/ooo’ select * from test09 distribute by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0070, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0070
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0070
2011-05-03 06:12:36,644 Stage-1 map = 0%, reduce = 0%
2011-05-03 06:12:37,656 Stage-1 map = 50%, reduce = 0%
2011-05-03 06:12:39,673 Stage-1 map = 100%, reduce = 0%
2011-05-03 06:12:44,713 Stage-1 map = 100%, reduce = 50%
2011-05-03 06:12:46,733 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0070
Copying data to local directory /home/hjl/sunwg/ooo
Copying data to local directory /home/hjl/sunwg/ooo
4 Rows loaded to /home/hjl/sunwg/ooo
OK
Time taken: 17.663 seconds
第一次执行根据id字段来做分发,结果如下:
[hjl@sunwg src]$ cat /home/hjl/sunwg/ooo/attempt_201105020924_0070_r_000000_0
400tim
200mary
[hjl@sunwg src]$ cat /home/hjl/sunwg/ooo/attempt_201105020924_0070_r_000001_0
300kate
100tom
这次我们换个分发的方式,采用length(id)的结果,因为这几条记录的id字段的长度都相同,所以应该会被分布到同一个reduce中。
hive> insert overwrite local directory ‘/home/hjl/sunwg/lll’ select * from test09 distribute by length(id);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0071, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0071
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0071
2011-05-03 06:15:21,430 Stage-1 map = 0%, reduce = 0%
2011-05-03 06:15:24,454 Stage-1 map = 100%, reduce = 0%
2011-05-03 06:15:31,509 Stage-1 map = 100%, reduce = 50%
2011-05-03 06:15:34,539 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0071
Copying data to local directory /home/hjl/sunwg/lll
Copying data to local directory /home/hjl/sunwg/lll
4 Rows loaded to /home/hjl/sunwg/lll
OK
Time taken: 20.632 seconds
在查看下结果是否和我们的预期相同:
[hjl@sunwg src]$ cat /home/hjl/sunwg/lll/attempt_201105020924_0071_r_000000_0
[hjl@sunwg src]$ cat /home/hjl/sunwg/lll/attempt_201105020924_0071_r_000001_0
100tom
200mary
300kate
400tim
文件attempt_201105020924_0071_r_000000_0中没有记录,而全部的记录都在attempt_201105020924_0071_r_000001_0中。
转自http://www.oratea.net/?p=626
- hive中的distribute By
- hive中的distribute By
- hive中的distribute By
- hive中的order by+sort by+distribute by+cluster by
- hive中的order by , sort by, distribute by, cluster by
- hive中的distribute by, order by, sort by简述
- Hive 的 distribute by
- Hive 的 distribute by
- Hive 的 distribute by
- Hadoop Hive 中的排序 Order by ,Sort by ,Distribute by, Cluster By,
- hive 中的Sort By、 Order By、Cluster By、Distribute By 区别<转>
- Hive优化----distribute by和sort by
- hive Sort By/Order By/Cluster By/Distribute By
- Hive Sort by/Order By/Cluster By/Distribute By
- hive sort by,order by ,distribute by,cluster by
- Hive order by/sort by/distribute by/cluster by作用
- Hive中order by,sort by,distribute by,cluster by
- hive中order by,distribute by,sort by,cluster by
- 判断某种特定类型可以容纳的最大值或最小值
- 前向声明
- 二叉树 c实现
- c++中如何返回内置类型的最值——使用
- 使用Java Mail Authenticator子类进行用户认证来发送电子邮件示例
- hive中的distribute By
- C++中如何使用gsoap开发WebService
- (转)女人职场须知。
- Netbeans7安装python插件
- centos源(官方test源和科大源)
- 简单的java web 统计在线人数实现方法
- hive中的sort by
- hive中的Order By
- centos编译linux内核