HAWQ与Hive查询性能对比测试
来源:互联网 发布:儿童牙刷推荐 知乎 编辑:程序博客网 时间:2024/06/18 05:26
一、实验目的
本实验通过模拟一个典型的应用场景和实际数据量,测试并对比HAWQ内部表、外部表与Hive的查询性能。二、硬件环境
1. 四台VMware虚机组成的Hadoop集群。2. 每台机器配置如下:
(1)15K RPM SAS 100GB
(2)Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz,双核双CPU
(3)8G内存,8GSwap
(4)10000Mb/s虚拟网卡
三、软件环境
1. Linux:CentOS release 6.4,核心2.6.32-358.el6.x86_642. Ambari:2.4.1
3. Hadoop:HDP 2.5.0
4. Hive(Hive on Tez):2.1.0
5. HAWQ:2.1.1.0
6. HAWQ PXF:3.1.1
四、数据模型
1. 表结构
实验模拟一个记录页面点击数据的应用场景。数据模型中包含日期、页面、浏览器、引用、状态5个维度表,1个页面点击事实表。表结构和关系如图1所示。图1
2. 记录数
各表的记录数如表1所示。表名
行数
page_click_fact
1亿
page_dim
20万
referrer_dim
100万
browser_dim
2万
status_code
70
date_dim
366
表1五、建表并生成数据
1. 建立hive库表
create database test;use test;create table browser_dim( browser_sk bigint, browser_nm varchar(100), browser_version_no varchar(100), flash_version_no varchar(100), flash_enabled_flg int, java_version_no varchar(100), platform_desc string, java_enabled_flg int, java_script_enabled_flg int, cookies_enabled_flg int, user_language_cd varchar(100), screen_color_depth_no varchar(100), screen_size_txt string)row format delimited fields terminated by ',' stored as orc; create table date_dim( cal_dt date, day_in_cal_yr_no int, day_of_week_no int, start_of_month_dt date, start_of_quarter_dt date, start_of_week_dt date, start_of_year_dt date)row format delimited fields terminated by ',' stored as orc;create table page_dim( page_sk bigint, domain_nm varchar(200), reachability_cd string, page_desc string, protocol_nm varchar(20))row format delimited fields terminated by ',' stored as orc; create table referrer_dim( referrer_sk bigint, referrer_txt string, referrer_domain_nm varchar(200))row format delimited fields terminated by ',' stored as orc;create table status_code_dim( status_cd varchar(100), client_error_flg int, status_cd_desc string, server_error_flg int)row format delimited fields terminated by ',' stored as orc;create table page_click_fact( visitor_id varchar(100), detail_tm timestamp, page_click_dt date, page_sk bigint, client_session_dt date, previous_page_sk bigint, referrer_sk bigint, next_page_sk bigint, status_cd varchar(100), browser_sk bigint, bytes_received_cnt bigint, bytes_sent_cnt bigint, client_detail_tm timestamp, entry_point_flg int, exit_point_flg int, ip_address varchar(20), query_string_txt string, seconds_spent_on_page_cnt int, sequence_no int, requested_file_txt string)row format delimited fields terminated by ',' stored as orc;说明:hive表使用ORCfile存储格式。
2. 用Java程序生成hive表数据
ORC压缩后的各表对应的HDFS文件大小如下:2.2 M /apps/hive/warehouse/test.db/browser_dim641 /apps/hive/warehouse/test.db/date_dim4.1 G /apps/hive/warehouse/test.db/page_click_fact16.1 M /apps/hive/warehouse/test.db/page_dim22.0 M /apps/hive/warehouse/test.db/referrer_dim1.1 K /apps/hive/warehouse/test.db/status_code_dim
3. 分析hive表
analyze table date_dim compute statistics;analyze table browser_dim compute statistics;analyze table page_dim compute statistics;analyze table referrer_dim compute statistics;analyze table status_code_dim compute statistics;analyze table page_click_fact compute statistics;
4. 建立HAWQ外部表
create schema ext;set search_path=ext;create external table date_dim(cal_dt date, day_in_cal_yr_no int4, day_of_week_no int4, start_of_month_dt date, start_of_quarter_dt date, start_of_week_dt date, start_of_year_dt date )location ('pxf://hdp1:51200/test.date_dim?profile=hiveorc') format 'custom' (formatter='pxfwritable_import'); create external table browser_dim(browser_sk int8, browser_nm varchar(100), browser_version_no varchar(100), flash_version_no varchar(100), flash_enabled_flg int, java_version_no varchar(100), platform_desc text, java_enabled_flg int, java_script_enabled_flg int, cookies_enabled_flg int, user_language_cd varchar(100), screen_color_depth_no varchar(100), screen_size_txt text ) location ('pxf://hdp1:51200/test.browser_dim?profile=hiveorc') format 'custom' (formatter='pxfwritable_import'); create external table page_dim(page_sk int8, domain_nm varchar(200), reachability_cd text, page_desc text, protocol_nm varchar(20) ) location ('pxf://hdp1:51200/test.page_dim?profile=hiveorc') format 'custom' (formatter='pxfwritable_import'); create external table referrer_dim(referrer_sk int8, referrer_txt text, referrer_domain_nm varchar(200) )location ('pxf://hdp1:51200/test.referrer_dim?profile=hiveorc') format 'custom' (formatter='pxfwritable_import'); create external table status_code_dim(status_cd varchar(100), client_error_flg int4, status_cd_desc text, server_error_flg int4) location ('pxf://hdp1:51200/test.status_code_dim?profile=hiveorc') format 'custom' (formatter='pxfwritable_import'); create external table page_click_fact(visitor_id varchar(100), detail_tm timestamp, page_click_dt date, page_sk int8, client_session_dt date, previous_page_sk int8, referrer_sk int8, next_page_sk int8, status_cd varchar(100), browser_sk int8, bytes_received_cnt int8, bytes_sent_cnt int8, client_detail_tm timestamp, entry_point_flg int4, exit_point_flg int4, ip_address varchar(20), query_string_txt text, seconds_spent_on_page_cnt int4, sequence_no int4, requested_file_txt text ) location ('pxf://hdp1:51200/test.page_click_fact?profile=hiveorc') format 'custom' (formatter='pxfwritable_import');说明:HAWQ外部表使用PXF协议,指向相应的hive表。
5. 建立HAWQ内部表
set search_path=public;create table date_dim(cal_dt date, day_in_cal_yr_no int4,day_of_week_no int4,start_of_month_dt date, start_of_quarter_dt date, start_of_week_dt date, start_of_year_dt date) with (compresstype=snappy,appendonly=true); create table browser_dim(browser_sk int8, browser_nm varchar(100), browser_version_no varchar(100), flash_version_no varchar(100), flash_enabled_flg int, java_version_no varchar(100), platform_desc text, java_enabled_flg int, java_script_enabled_flg int, cookies_enabled_flg int, user_language_cd varchar(100), screen_color_depth_no varchar(100), screen_size_txt text ) with (compresstype=snappy,appendonly=true); create table page_dim(page_sk int8, domain_nm varchar(200), reachability_cd text, page_desc text, protocol_nm varchar(20) ) with (compresstype=snappy,appendonly=true); create table referrer_dim(referrer_sk int8, referrer_txt text, referrer_domain_nm varchar(200) ) with (compresstype=snappy,appendonly=true); create table status_code_dim(status_cd varchar(100), client_error_flg int4, status_cd_desc text, server_error_flg int4) with (compresstype=snappy,appendonly=true); create table page_click_fact(visitor_id varchar(100), detail_tm timestamp, page_click_dt date, page_sk int8, client_session_dt date, previous_page_sk int8, referrer_sk int8, next_page_sk int8, status_cd varchar(100), browser_sk int8, bytes_received_cnt int8, bytes_sent_cnt int8, client_detail_tm timestamp, entry_point_flg int4, exit_point_flg int4, ip_address varchar(20), query_string_txt text, seconds_spent_on_page_cnt int4, sequence_no int4, requested_file_txt text ) with (compresstype=snappy,appendonly=true);说明:内部表结构定义与hive表等价,使用snappy压缩的行存储格式。
6. 生成HAWQ内部表数据
insert into date_dim select * from hcatalog.test.date_dim;insert into browser_dim select * from hcatalog.test.browser_dim; insert into page_dim select * from hcatalog.test.page_dim; insert into referrer_dim select * from hcatalog.test.referrer_dim; insert into status_code_dim select * from hcatalog.test.status_code_dim; insert into page_click_fact select * from hcatalog.test.page_click_fact;说明:通过HCatalog直接查询hive表,插入到HAWQ内部表中。snappy压缩后的各表对应的HDFS文件大小如下:
6.2 K /hawq_data/16385/177422/1776773.3 M /hawq_data/16385/177422/17768223.9 M /hawq_data/16385/177422/17768739.3 M /hawq_data/16385/177422/1777071.8 K /hawq_data/16385/177422/1777267.9 G /hawq_data/16385/177422/177731
7. 分析HAWQ内部表
analyze date_dim;analyze browser_dim;analyze page_dim;analyze referrer_dim;analyze status_code_dim;analyze page_click_fact;
六、执行查询
分别在hive表、HAWQ外部表、HAWQ内部表上执行以下5个查询语句,记录执行时间。1. 查询给定周中support.sas.com站点上访问最多的目录
-- hive查询select top_directory, count(*) as unique_visits from (select distinct visitor_id, substr(requested_file_txt,1,10) top_directory from page_click_fact, page_dim, browser_dim where domain_nm = 'support.sas.com' and flash_enabled_flg=1 and weekofyear(detail_tm) = 19 and year(detail_tm) = 2017 ) directory_summary group by top_directory order by unique_visits; -- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。select top_directory, count(*) as unique_visits from (select distinct visitor_id, substr(requested_file_txt,1,10) top_directory from page_click_fact, page_dim, browser_dim where domain_nm = 'support.sas.com' and flash_enabled_flg=1 and extract(week from detail_tm) = 19 and extract(year from detail_tm) = 2017 ) directory_summary group by top_directory order by unique_visits;
2. 查询各月从www.google.com访问的页面
-- hive查询select domain_nm, requested_file_txt, count(*) as unique_visitors, month from (select distinct domain_nm, requested_file_txt, visitor_id, month(detail_tm) as month from page_click_fact, page_dim, referrer_dim where domain_nm = 'support.sas.com' and referrer_domain_nm = 'www.google.com' ) visits_pp_ph_summary group by domain_nm, requested_file_txt, month order by domain_nm, requested_file_txt, unique_visitors desc, month asc; -- HAWQ查询,只是用extract函数代替了hive的month函数,与hive的查询语句等价。select domain_nm, requested_file_txt, count(*) as unique_visitors, month from (select distinct domain_nm, requested_file_txt, visitor_id, extract(month from detail_tm) as month from page_click_fact, page_dim, referrer_dim where domain_nm = 'support.sas.com' and referrer_domain_nm = 'www.google.com' ) visits_pp_ph_summary group by domain_nm, requested_file_txt, month order by domain_nm, requested_file_txt, unique_visitors desc, month asc;
3. 给定年份support.sas.com站点上的搜索字符串计数
-- hive查询select query_string_txt, count(*) as count from page_click_fact, page_dim where query_string_txt <> '' and domain_nm='support.sas.com' and year(detail_tm) = '2017' group by query_string_txt order by count desc; -- HAWQ查询,只是用extract函数代替了hive的year函数,与hive的查询语句等价。select query_string_txt, count(*) as count from page_click_fact, page_dim where query_string_txt <> '' and domain_nm='support.sas.com' and extract(year from detail_tm) = '2017' group by query_string_txt order by count desc;
4. 查询使用Safari浏览器访问每个页面的人数
-- hive查询select domain_nm, requested_file_txt, count(*) as unique_visitors from (select distinct domain_nm, requested_file_txt, visitor_id from page_click_fact, page_dim, browser_dim where domain_nm='support.sas.com' and browser_nm like '%Safari%' and weekofyear(detail_tm) = 19 and year(detail_tm) = 2017 ) uv_summary group by domain_nm, requested_file_txt order by unique_visitors desc; -- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。select domain_nm, requested_file_txt, count(*) as unique_visitors from (select distinct domain_nm, requested_file_txt, visitor_id from page_click_fact, page_dim, browser_dim where domain_nm='support.sas.com' and browser_nm like '%Safari%' and extract(week from detail_tm) = 19 and extract(year from detail_tm) = 2017 ) uv_summary group by domain_nm, requested_file_txt order by unique_visitors desc;
5. 查询给定周中support.sas.com站点上浏览超过10秒的页面
-- hive查询select domain_nm, requested_file_txt, count(*) as unique_visits from (select distinct domain_nm, requested_file_txt, visitor_id from page_click_fact, page_dim where domain_nm='support.sas.com' and weekofyear(detail_tm) = 19 and year(detail_tm) = 2017 and seconds_spent_on_page_cnt > 10 ) visits_summary group by domain_nm, requested_file_txt order by unique_visits desc; -- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。 select domain_nm, requested_file_txt, count(*) as unique_visits from (select distinct domain_nm, requested_file_txt, visitor_id from page_click_fact, page_dim where domain_nm='support.sas.com' and extract(week from detail_tm) = 19 and extract(year from detail_tm) = 2017 and seconds_spent_on_page_cnt > 10 ) visits_summary group by domain_nm, requested_file_txt order by unique_visits desc;
七、测试结果
Hive、HAWQ外部表、HAWQ内部表查询时间对比如表2所示。每种查询情况执行三次取平均值。查询
Hive(秒)
HAWQ外部表(秒)
HAWQ内部表(秒)
1
74.337
304.134
19.232
2
169.521
150.882
3.446
3
73.482
101.216
18.565
4
66.367
359.778
1.217
5
60.341
118.329
2.789
表2从图2中的对比可以看到,HAWQ内部表比Hive on Tez快的多(4-50倍)。同样的查询,在HAWQ的Hive外部表上执行却很慢。因此,在执行分析型查询时最好使用HAWQ内部表。如果不可避免地需要使用外部表,为了获得满意的查询性能,需要保证外部表数据量尽可能小。同时要使查询尽可能简单,尽量避免在外部表上执行聚合、分组、排序等复杂操作。
图2
1 0
- HAWQ与Hive查询性能对比测试
- Hive查询性能测试记录
- Impala、Hive性能简单对比测试
- impala0.7与hive(分别基于MR1和MR2)的测试性能对比
- 虚拟机性能测试与对比
- mysql与sphinx查询性能对比试验
- Solr与MySQL查询性能对比
- Solr与MySQL查询性能对比
- Solr与MySQL查询性能对比
- tmpfs与ext3性能对比测试
- php5.2与5.3性能测试对比
- 浏览器性能对比测试技术研究与应用
- 浏览器性能对比测试技术研究与应用
- python JSON性能测试与simplejson对比
- C#、PHP与NodeJs性能测试对比
- stringstream与snprintf性能对比测试
- Yii2与phalcon性能测试对比
- MongoDB简单查询与复杂查询性能对比
- mysql_每隔十分钟定时器
- Problem : 平面上的点和线——Point类、Line类 (VII)
- 1101. Quick Sort (25)[快排]
- C++: 类的继承派生
- 微信小程序TLS版本大于1.2
- HAWQ与Hive查询性能对比测试
- ALV中的fieldcat常用字段属性选项大全
- Error fetching https://ruby.taobao.org/:no such name淘宝镜源无效解决
- RadioButton 设置Margin间距
- Linux学习之六(Linux用户、权限、用户管理命令)
- python图像旋转代码
- centos6.8上安装nginx
- Android thread+handler实现获取验证码定时器
- android studio 导入主题方法