hive SQL优化之distribute by和sort by
来源:互联网 发布:诚信通软件下载 编辑:程序博客网 时间:2024/06/05 03:04
最近在优化hiveSQL,
下面是一段排序,分组后取每组第一行记录的SQL
- INSERT
OVERWRITE pt='${SRCTIME}')TABLE t_wa_funnel_distinct_temp PARTITION ( - SELECT
-
bussiness_id, -
cookie_id, -
session_id, -
funnel_id, -
group_first(funnel_name) funnel_name, -
step_id, -
group_first(step_name) step_name, -
group_first(log_type) log_type, -
group_first(url_pattern) url_pattern, -
group_first(url) url, -
group_first(refer) refer, -
group_first(log_time) log_time, -
group_first(is_new_visitor) is_new_visitor, -
group_first(is_mobile_traffic) is_mobile_traffic, -
group_first(is_bounce) is_bounce, -
group_first(campaign_name) campaign_name, -
group_first(group_name) group_name, -
group_first(slot_name) slot_name, -
group_first(source_type) source_type, -
group_first(next_page) next_page, -
group_first(continent) continent, -
group_first(sub_continent_region) sub_continent_region, -
group_first(country) country, -
group_first(region) region, -
group_first(city) city, -
group_first(language) language, -
group_first(browser) browser, -
group_first(os) os, -
group_first(screen_color) screen_color, -
group_first(screen_resolution) screen_resolution, -
group_first(flash_version) flash_version, -
group_first(java) java, -
group_first(host) host - FROM
- (
* -
FROM r_wa_funnel -
WHERE pt='${SRCTIME}' -
ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC - )
t1 - GROUP
BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id;
group_first: 自定义函数,用户取每组第一个字段
${SRCTIME}: 由外部oozie调度传入, 作为时间分区,精确到小时.eg: 2011.11.01.21
下面在hive上以SRCTIME = 2011.11.01.21 执行以上SQL. 2011.11.01.21小时分区记录数有10435486
执行时间:
从上面可以看出,reduce阶段只有一个reduce, 这是因为ORDER
从业务需求来看, 只要按
OK, 这样可以采用hive提供的distribute by 和 sort by,这样可以充分利用hadoop资源, 在多个
reduce中局部按log_time 排序
优化有的hive代码:
- INSERT
OVERWRITE pt='2011.11.01.21')TABLE t_wa_funnel_distinct PARTITION ( - SELECT
-
bussiness_id, -
cookie_id, -
session_id, -
funnel_id, -
group_first(funnel_name) funnel_name, -
step_id, -
group_first(step_name) step_name, -
group_first(log_type) log_type, -
group_first(url_pattern) url_pattern, -
group_first(url) url, -
group_first(refer) refer, -
group_first(log_time) log_time, -
group_first(is_new_visitor) is_new_visitor, -
group_first(is_mobile_traffic) is_mobile_traffic, -
group_first(is_bounce) is_bounce, -
group_first(campaign_name) campaign_name, -
group_first(group_name) group_name, -
group_first(slot_name) slot_name, -
group_first(source_type) source_type, -
group_first(next_page) next_page, -
group_first(continent) continent, -
group_first(sub_continent_region) sub_continent_region, -
group_first(country) country, -
group_first(region) region, -
group_first(city) city, -
group_first(language) language, -
group_first(browser) browser, -
group_first(os) os, -
group_first(screen_color) screen_color, -
group_first(screen_resolution) screen_resolution, -
group_first(flash_version) flash_version, -
group_first(java) java, -
group_first(host) host - FROM
- (
* -
FROM r_wa_funnel -
WHERE pt='2011.11.01.21' -
distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC - )
t1 - GROUP
BY bussiness_id, cookie_id, session_id, funnel_id, step_id;
执行时间:
第一个需要执行6:43, 而优化有只要执行0:35秒,性能得到大幅提升
转自:http://3199782.blog.51cto.com/3189782/703873
0 0
- hive SQL优化之distribute by和sort by
- hive SQL优化之distribute by和sort by
- hive SQL优化之distribute by和sort by
- hive SQL优化之distribute by和sort by
- hive SQL优化之distribute by和sort by
- hive SQL优化之distribute by和sort by .
- hive SQL优化之distribute by和sort by
- Hive优化----distribute by和sort by
- hive中order by 和sort by与distribute by
- hive中order by、distribute by、sort by和cluster by的区别和联系
- hive Sort By/Order By/Cluster By/Distribute By
- hive中的order by+sort by+distribute by+cluster by
- Hive Sort by/Order By/Cluster By/Distribute By
- hive sort by,order by ,distribute by,cluster by
- Hive order by/sort by/distribute by/cluster by作用
- hive中的order by , sort by, distribute by, cluster by
- Hive中order by,sort by,distribute by,cluster by
- hive中order by,distribute by,sort by,cluster by
- 正则表达式
- Java设计模式------单例模式
- 随机森林、Boost和GBDT
- ORACLE RAC安装过程碰到问题处理汇总
- C#.net免客户端访问Oracle数据库
- hive SQL优化之distribute by和sort by
- k近邻模型
- 系统拆分解耦利器之消息队列---RabbitMQ-Persistence Configuration
- Android中UI控件的详细介绍(一)——Android中UI的总体分类即通用属性
- 用ffmpeg创建多个program的ts
- 懒惰学习和急切学习
- MySql存储过程常见问题
- HTTP协议入门
- javascript网页特效——window.open