TPC-H测试数据表生成,以及在Impala中的使用
来源:互联网 发布:淘宝网退回旧版本 编辑:程序博客网 时间:2024/06/10 10:52
一、下载TPCH生成工具
官方网址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp
本文中安装的是:
TPC-H 2.17.1 pdf Download TPCH_Tools.zip
可通过wget、curl来获取安装包。
【注】获取包地址,需要填写Email信息进行申请,无伤大雅,照做。请确保,您的邮箱可以收到自动发送的包含连接地址的邮件,
这个地址才是真正的下载地址。
二、安装TPCH生成工具
下载完成后,您可能看到这样的文件:
<span style="color:#CC0000;">c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip</span>
这个就是TCPH的安装工具,so,使用unzip工具解压之。
unzip c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip
而后,得到两个文件夹:
第一个,无用,忽略之。进入第二个。
<span style="font-size:12px;">[root@node20 tpch_2_17_0]# cd dbgen/[root@node20 dbgen]# lsanswers bm_utils.c build.o dbgen driver.o dsstypes.h makefile permute.o qgen queries rnd.c rng64.h speed_seed.o tpcd.h update_release.shbcd2.c bm_utils.o check_answers dbgen.dsp dss.ddl HISTORY makefile.suite PORTING.NOTES qgen.c README rnd.h rng64.o tests tpch.dsw variantsbcd2.h BUGS column_split.sh dists.dss dss.h load_stub.c permute.c print.c qgen.o reference rnd.o shared.h text.c tpch.sln varsub.cbcd2.o build.c config.h driver.c dss.ri load_stub.o permute.h print.o qgen.vcproj release.h rng64.c speed_seed.c text.o tpch.vcproj varsub.o</span>
[root@node20 dbgen]# ./dbgen -help你会看到相应的参数信息。详情,自己看。
三、生成指定的数据库文件
1. 编辑Makefile文件,具体的参数配置可参考如下示例。
<span style="font-size:12px;">[root@node20 dbgen]# cp makefile.suite makefile</span>
<span style="font-size:12px;">[root@node20 dbgen]# vim makefile</span>CC = gccDATABASE= SQLSERVERMACHINE = LINUXWORKLOAD = TPCHCFLAGS = -g -DDBNAME=\"dss\" -D$(MACHINE) -D$(DATABASE) -D$(WORKLOAD) -DRNG_TEST -D_FILE_OFFSET_BITS=64LDFLAGS = -O
执行make。
2.执行dbgen,生成orders数据表,大小为5G。
<span style="font-size:12px;">[root@node20 dbgen]# ./dbgen -vf -s 5 -T OTPC-H Population Generator (Version 2.17.0)Copyright Transaction Processing Performance Council 1994 - 2010Generating data for order table/Preloading text ... 100%done.</span>
<span style="font-size:12px;">[root@node20 dbgen]# ls -lh *.tbl-rw-r--r-- 1 root root 830M May 11 14:01 orders.tbl</span><span style="font-size:14px;"></span>
四、加载数据表到Impala中
实际上就是,将tpch-tool生成的
*.tbl文件加载到hadoop系统中,然后,通过impala-shell来创建相应的表。
1.clone脚本工程,用于自动完成加载工作。
参考网址:https://github.com/kj-ki/tpc-h-impala
2.将利用TPCH工具生成的数据表文件,移动到该工程目录的data/文件夹下。
<span style="font-size:12px;">[root@node20 data]# mv ../../../tpch_2_17_0/dbgen/*.tbl ./[root@node20 data]# lslineitem.tbl orders.tbl tpch_prepare_data.sh</span>
3.加载数据到Hadoop系统中。
<span style="font-size:12px;">[root@node20 data]# ./tpch_prepare_data.sh mkdir: `/tpch': File existsmkdir: `/tpch/customer': File existsmkdir: `/tpch/lineitem': File existsmkdir: `/tpch/nation': File existsmkdir: `/tpch/orders': File existsmkdir: `/tpch/part': File existsmkdir: `/tpch/partsupp': File existsmkdir: `/tpch/region': File existsmkdir: `/tpch/supplier': File existscopyFromLocal: `customer.tbl': No such file or directorycopyFromLocal: `/tpch/lineitem/lineitem.tbl': File existscopyFromLocal: `nation.tbl': No such file or directorycopyFromLocal: `part.tbl': No such file or directorycopyFromLocal: `partsupp.tbl': No such file or directorycopyFromLocal: `region.tbl': No such file or directorycopyFromLocal: `supplier.tbl': No such file or directory</span>
这里,之前已经执行过该脚本,导入lineitem数据表,所以相应的文件夹已经存在,无碍。
查看导入情况:
<span style="font-size:12px;">[root@node20 data]# hadoop fs -ls /tpchFound 8 itemsdrwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/customerdrwxr-xr-x - root supergroup 0 2016-05-06 15:42 /tpch/lineitemdrwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/nationdrwxr-xr-x - root supergroup 0 2016-05-11 14:14 /tpch/ordersdrwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/partdrwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/partsuppdrwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/regiondrwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/supplier[root@node20 data]# hadoop fs -ls /tpch/ordersFound 1 items-rw-r--r-- 3 root supergroup 870187306 2016-05-11 14:14 /tpch/orders/orders.tbl[root@node20 data]# hadoop fs -ls /tpch/lineitemFound 2 itemsdrwxr-xr-x - root supergroup 0 2016-05-06 15:42 /tpch/lineitem/lineitem-rw-r--r-- 3 root supergroup 7775727688 2016-05-06 12:12 /tpch/lineitem/lineitem.tbl</span>
【注】如果你遇到如下异常情况,请确保impala所需的环境变量已经导入。执行source ${IMPALA_HOME}/bin/impala-config.sh
<span style="font-size:12px;">[root@node20 data]# ./tpch_prepare_data.sh -mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...^C-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst></span>
4. 生成表文本文件的数据表。
各类表的参考描述如下:
The official TPC-H specification: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
直接执行如下语句,会一直卡住。
<span style="font-size:12px;">[root@node20 tpc-h-impala]# ./tpch_benchmark.sh ************************************************ TPC-H benchmark on Impala ************************************************ See benchmark.log for more details of query errors.Executing Trial #1 of 1 trial(s)...Running query: q1_pricing_summary_report</span>
SO,直接查看该文件,可知自动导入表到impala是通过tpch_prepare/文件夹中的.hive文件来完成的。
分析可知,直接打开impala-shell,手动创建外部表。
<span style="font-size:12px;">[node20:21000] > create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';Query: create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders'Fetched 0 row(s) in 0.04s[node20:21000] > select count(*) from orders;Query: select count(*) from orders+----------+| count(*) |+----------+| 7500000 |+----------+Fetched 1 row(s) in 5.03s</span>
其它表的插入过程类似,在.hive文件中检索相应的创建表的sql语句。
5.生成parquet格式的数据表。
<span style="font-size:12px;">[node20:21000] > create database tpch_parquet;Query: create database tpch_parquetFetched 0 row(s) in 0.10s[node20:21000] > show databases;Query: show databases+------------------+----------------------------------------------+| name | comment |+------------------+----------------------------------------------+| _impala_builtins | System database for Impala builtin functions || default | Default Hive database || tpch_parquet | |+------------------+----------------------------------------------+Fetched 3 row(s) in 0.01s[node20:21000] > create table tpch_parquet.orders_pq stored as parquet as select * from orders;Query: create table tpch_parquet.orders_pq stored as parquet as select * from orders+-------------------------+| summary |+-------------------------+| Inserted 7500000 row(s) |+-------------------------+Fetched 1 row(s) in 10.62s[node20:21000] > desc tpch_parquet.orders_pq;Query: describe tpch_parquet.orders_pq+-----------------+--------+---------+| name | type | comment |+-----------------+--------+---------+| o_orderkey | int | || o_custkey | int | || o_orderstatus | string | || o_totalprice | double | || o_orderdate | string | || o_orderpriority | string | || o_clerk | string | || o_shippriority | int | || o_comment | string | |+-----------------+--------+---------+Fetched 9 row(s) in 0.01s[node20:21000] > show table stats tpch_parquet.orders_pq;Query: show table stats tpch_parquet.orders_pq+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+| -1 | 3 | 269.18MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/tpch_parquet.db/orders_pq |+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+Fetched 1 row(s) in 0.01s</span>
- TPC-H测试数据表生成,以及在Impala中的使用
- TPC-H linux下生成数据表
- TPC-H DBGEN使用方法_ 生成数据表(Linux_Windows)
- 使用TPC-H对Hive测试
- PostgreSQL TPC-H测试
- TPC-H 使用
- 转载:TPC-H 使用
- windows环境下使用TPC-H生成数据集
- 使用TPC-H测试oracle&infobright&postgresql的性能
- tpc-c 基准测试简介以及程序的安装、使用
- TPC-H tools的使用
- 大数据基准测试 TPC-H、TPC-DS
- Hive的TPC-DS和TPC-H基准测试工具
- windows :TPC-H测试的数据库模式生成及其数据导入MySQL的教程记录
- 利用TPC-H为MYSQL生成数据
- Greenplum平台上的TPC-H测试
- [笔记] 大家一起来测试,benchmark起来(MySQL下的TPC-C,TPC-H,TPC-W)
- TPC-H工具使用:生成数据集及转换到PostgreSQL数据库中
- hdu 3836 Equivalent Sets
- 先验概率、后验概率以及共轭先验
- 信号与系统3:单位阶跃,单位响应以及系统性质
- 你不敢孤独,谈什么卓越?
- Android从菜鸟到笨鸟:ViewPager+Handler实现广告栏无限循环
- TPC-H测试数据表生成,以及在Impala中的使用
- 15分钟帮你轻松理解 JS闭包
- cordova的白名单插件所形成的问题 Please add one when using the cordova-plugin-whitelist plugin
- 防安全漏洞开发规范
- 使用Vagrant在Windows下部署开发环境
- 使用getBackground().setAlpha,导致其他布局背景透明度都改变的问题
- 怎样写出优秀的研究论文?
- 文章标题
- OC中属性self.a与_a访问的区别