TPC-H测试数据表生成,以及在Impala中的使用

来源:互联网 发布:淘宝网退回旧版本 编辑:程序博客网 时间:2024/06/10 10:52

一、下载TPCH生成工具

官方网址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp

本文中安装的是:

TPC-H 2.17.1 pdf Download TPCH_Tools.zip

可通过wget、curl来获取安装包。

【注】获取包地址,需要填写Email信息进行申请,无伤大雅,照做。请确保,您的邮箱可以收到自动发送的包含连接地址的邮件,

这个地址才是真正的下载地址。

二、安装TPCH生成工具

下载完成后,您可能看到这样的文件:

<span style="color:#CC0000;">c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip</span>

这个就是TCPH的安装工具,so,使用unzip工具解压之。

unzip c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip

而后,得到两个文件夹:

第一个,无用,忽略之。进入第二个。

<span style="font-size:12px;">[root@node20 tpch_2_17_0]# cd dbgen/[root@node20 dbgen]# lsanswers  bm_utils.c  build.o          dbgen      driver.o  dsstypes.h   makefile        permute.o      qgen         queries    rnd.c    rng64.h       speed_seed.o  tpcd.h       update_release.shbcd2.c   bm_utils.o  check_answers    dbgen.dsp  dss.ddl   HISTORY      makefile.suite  PORTING.NOTES  qgen.c       README     rnd.h    rng64.o       tests         tpch.dsw     variantsbcd2.h   BUGS        column_split.sh  dists.dss  dss.h     load_stub.c  permute.c       print.c        qgen.o       reference  rnd.o    shared.h      text.c        tpch.sln     varsub.cbcd2.o   build.c     config.h         driver.c   dss.ri    load_stub.o  permute.h       print.o        qgen.vcproj  release.h  rng64.c  speed_seed.c  text.o        tpch.vcproj  varsub.o</span>
[root@node20 dbgen]# ./dbgen -help
你会看到相应的参数信息。详情,自己看。

三、生成指定的数据库文件

1. 编辑Makefile文件,具体的参数配置可参考如下示例。

<span style="font-size:12px;">[root@node20 dbgen]# cp makefile.suite makefile</span>
<span style="font-size:12px;">[root@node20 dbgen]# vim makefile</span>CC      = gccDATABASE= SQLSERVERMACHINE = LINUXWORKLOAD = TPCHCFLAGS  = -g -DDBNAME=\"dss\" -D$(MACHINE) -D$(DATABASE) -D$(WORKLOAD) -DRNG_TEST -D_FILE_OFFSET_BITS=64LDFLAGS = -O

执行make

2.执行dbgen,生成orders数据表,大小为5G。

<span style="font-size:12px;">[root@node20 dbgen]# ./dbgen -vf -s 5 -T OTPC-H Population Generator (Version 2.17.0)Copyright Transaction Processing Performance Council 1994 - 2010Generating data for order table/Preloading text ... 100%done.</span>

<span style="font-size:12px;">[root@node20 dbgen]# ls -lh *.tbl-rw-r--r-- 1 root root 830M May 11 14:01 orders.tbl</span><span style="font-size:14px;"></span>

四、加载数据表到Impala中

实际上就是,将tpch-tool生成的

*.tbl
文件加载到hadoop系统中,然后,通过impala-shell来创建相应的表。


1.clone脚本工程,用于自动完成加载工作。

参考网址:https://github.com/kj-ki/tpc-h-impala

2.将利用TPCH工具生成的数据表文件,移动到该工程目录的data/文件夹下。

<span style="font-size:12px;">[root@node20 data]# mv ../../../tpch_2_17_0/dbgen/*.tbl ./[root@node20 data]# lslineitem.tbl  orders.tbl  tpch_prepare_data.sh</span>

3.加载数据到Hadoop系统中。

<span style="font-size:12px;">[root@node20 data]# ./tpch_prepare_data.sh mkdir: `/tpch': File existsmkdir: `/tpch/customer': File existsmkdir: `/tpch/lineitem': File existsmkdir: `/tpch/nation': File existsmkdir: `/tpch/orders': File existsmkdir: `/tpch/part': File existsmkdir: `/tpch/partsupp': File existsmkdir: `/tpch/region': File existsmkdir: `/tpch/supplier': File existscopyFromLocal: `customer.tbl': No such file or directorycopyFromLocal: `/tpch/lineitem/lineitem.tbl': File existscopyFromLocal: `nation.tbl': No such file or directorycopyFromLocal: `part.tbl': No such file or directorycopyFromLocal: `partsupp.tbl': No such file or directorycopyFromLocal: `region.tbl': No such file or directorycopyFromLocal: `supplier.tbl': No such file or directory</span>

这里,之前已经执行过该脚本,导入lineitem数据表,所以相应的文件夹已经存在,无碍。

查看导入情况:

<span style="font-size:12px;">[root@node20 data]# hadoop fs -ls /tpchFound 8 itemsdrwxr-xr-x   - root supergroup          0 2016-05-06 12:10 /tpch/customerdrwxr-xr-x   - root supergroup          0 2016-05-06 15:42 /tpch/lineitemdrwxr-xr-x   - root supergroup          0 2016-05-06 12:10 /tpch/nationdrwxr-xr-x   - root supergroup          0 2016-05-11 14:14 /tpch/ordersdrwxr-xr-x   - root supergroup          0 2016-05-06 12:10 /tpch/partdrwxr-xr-x   - root supergroup          0 2016-05-06 12:11 /tpch/partsuppdrwxr-xr-x   - root supergroup          0 2016-05-06 12:11 /tpch/regiondrwxr-xr-x   - root supergroup          0 2016-05-06 12:11 /tpch/supplier[root@node20 data]# hadoop fs -ls /tpch/ordersFound 1 items-rw-r--r--   3 root supergroup  870187306 2016-05-11 14:14 /tpch/orders/orders.tbl[root@node20 data]# hadoop fs -ls /tpch/lineitemFound 2 itemsdrwxr-xr-x   - root supergroup          0 2016-05-06 15:42 /tpch/lineitem/lineitem-rw-r--r--   3 root supergroup 7775727688 2016-05-06 12:12 /tpch/lineitem/lineitem.tbl</span>

【注】如果你遇到如下异常情况,请确保impala所需的环境变量已经导入。执行source ${IMPALA_HOME}/bin/impala-config.sh

<span style="font-size:12px;">[root@node20 data]# ./tpch_prepare_data.sh -mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...-mkdir: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -mkdir [-p] <path> ...^C-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>-copyFromLocal: java.net.UnknownHostException: node1.hadoop.comUsage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst></span>

4. 生成表文本文件的数据表。

各类表的参考描述如下:

The official TPC-H specification: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf

直接执行如下语句,会一直卡住。

<span style="font-size:12px;">[root@node20 tpc-h-impala]# ./tpch_benchmark.sh ************************************************          TPC-H benchmark on Impala          ************************************************                                               See benchmark.log for more details of query errors.Executing Trial #1 of 1 trial(s)...Running query: q1_pricing_summary_report</span>

SO,直接查看该文件,可知自动导入表到impala是通过tpch_prepare/文件夹中的.hive文件来完成的。
分析可知,直接打开impala-shell,手动创建外部表。

<span style="font-size:12px;">[node20:21000] > create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';Query: create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders'Fetched 0 row(s) in 0.04s[node20:21000] > select count(*) from orders;Query: select count(*) from orders+----------+| count(*) |+----------+| 7500000  |+----------+Fetched 1 row(s) in 5.03s</span>

其它表的插入过程类似,在.hive文件中检索相应的创建表的sql语句。

5.生成parquet格式的数据表。

<span style="font-size:12px;">[node20:21000] > create database tpch_parquet;Query: create database tpch_parquetFetched 0 row(s) in 0.10s[node20:21000] > show databases;Query: show databases+------------------+----------------------------------------------+| name             | comment                                      |+------------------+----------------------------------------------+| _impala_builtins | System database for Impala builtin functions || default          | Default Hive database                        || tpch_parquet     |                                              |+------------------+----------------------------------------------+Fetched 3 row(s) in 0.01s[node20:21000] > create table tpch_parquet.orders_pq stored as parquet as select * from orders;Query: create table tpch_parquet.orders_pq stored as parquet as select * from orders+-------------------------+| summary                 |+-------------------------+| Inserted 7500000 row(s) |+-------------------------+Fetched 1 row(s) in 10.62s[node20:21000] > desc tpch_parquet.orders_pq;Query: describe tpch_parquet.orders_pq+-----------------+--------+---------+| name            | type   | comment |+-----------------+--------+---------+| o_orderkey      | int    |         || o_custkey       | int    |         || o_orderstatus   | string |         || o_totalprice    | double |         || o_orderdate     | string |         || o_orderpriority | string |         || o_clerk         | string |         || o_shippriority  | int    |         || o_comment       | string |         |+-----------------+--------+---------+Fetched 9 row(s) in 0.01s[node20:21000] > show table stats tpch_parquet.orders_pq;Query: show table stats tpch_parquet.orders_pq+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+| #Rows | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                        |+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+| -1    | 3      | 269.18MB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://localhost:20500/test-warehouse/tpch_parquet.db/orders_pq |+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+Fetched 1 row(s) in 0.01s</span>


0 0