dsdgen+dsqgen安装

来源:互联网 发布:云端软件平台下载 编辑:程序博客网 时间:2024/06/05 12:11

1安装vmware9虚拟机+centos6.4  i386,virualbox需要给大些内存和显存否则不出图形界面,virtualbox安完reboot会崩溃。

2安装gcc  yum -y install gcc     

gcc yum -y install gcc/c++

3下载dsdgen 在官网需要它给邮箱发  ,搜索tpc-ds,然后unzip压缩包

4在tools文件夹 make命令

5命令生成数据

dbgen2–scale 100 –dir /tmp   

其他见下面引用内部文档:

TPC-DS “how to” Guide

 

Introduction

The purpose of this guide is to describethe most common tasks necessary to implement a TPC-DS benchmark.  The target audience is individuals who wantto install, populate, run and analyze the database, queries and data maintenanceworkloads for TPC-DS.

 

This guide does not discuss anythingrelated to actually publishing TPC-DS results (e.g.: tuning, data layout,pricing, auxiliary data structures, auditing, etc.).

 

For more details, see the official TPC-DSspecification.

 

Change Log

Date

Version

Comments

Feb. 8, 2007

1.0

Initial draft for kit 1.1.52 (Doug Johnson/Netezza)

Feb. 22, 2007

1.1

Added comments from Meikel Poess/Oracle and Shirley Wang/MS

 

How to download the latest kit

The kit is downloadable from www.tpc.org. Use the link to the TPC-DSbenchmark.

 

What’s in the kit

There are hundreds of files in the kit butonly a few that you have to read or modify. Here’s the list of files that you shouldread:

  1. Specification.doc: latest specification document

 

Here’s the list of files that you willprobably have to modify:

  1. tpcds.sql: file with SQL to create the data warehouse tables. You will probably need to add allocation or distribution information.
  2. tpcds_source.sql: file with SQL to create the source/staging tables. You will probably need to add allocation or distribution information.
  3. Makefile.suite: file with Unix-style make commands to build the dbgen2 and qgen2 binaries for a target platform. You will need to specify the execution environment (see make/build section below).
  4. vcproj files: files used to build the tools for Windows Visual C++.
  5. query_templates: folder with query template files (.tpl). You don’t manually change these files but the qgen2 program transforms them into executable SQL.
  6. TBD: folder with sample ANSI SQL “create view” statements used to refresh the data warehouse during the data maintenance phase.

 

How to make/build the binaries

For AIX, LINUX, HPUX, NCR and Solaris

  1. Copy Makefile.suite to Makefile
  2. Edit Makefile and find the line containing “OS = “
  3. Read the comments and append your target OS. For example: “OS = LINUX”
  4. Run “make”

 

For Windows

  1. Install Microsoft Visual Studio 2005
  2. Open the entire solution by double-clicking on dbgen2.sln (you might see an error saying “project file grammar.vcproj” has failed to load; you can safely ignore this error)
  3. From list of projects, right click on dbgen2 and select “build” (or from top menu, Build -> Build Solution).  This will build mkheader and distcomp before building dbgen2.
  4. Repeat step 3 for building qgen2
  5. To cross compile for X64 and IA64 on X86 platforms, install Microsoft Visual Studio 2005 “Team Suite” SKU, modify the target platform from Build -> Configuration Manager, and repeat steps 3 & 4.

 

How to generate the load data

The dbgen2 utility generates input data to(a) load the initial data warehouse and (b) “refresh” the data warehouse for thedata maintenance workload. This section describes how to generate the loaddata.

 

Run “dbgen2 –h” for the help info. Notethat many of the options are “advanced” and usually not needed.

 

Example to generate the load data files fora 100GB in the /tmp directory:

 

   dbgen2 –scale 100 –dir /tmp

 

The official scale factors are 100GB,300GB, 1TB, 3TB, 10TB, 30TB and 100TB.

 

The output files will be of the form“<table>.csv”. Even though file suffix is “.csv”, the default fielddelimiter is ‘|’. Use the “-delimiter ‘<c>’” option to change delimiters.

 

Since dbgen2 only generates 200-300GB/hour(on a 2-3GHz x86 processor), it is useful to run multiple parallel streams whengenerating large amounts of data. Here’s an example for generating 100 GB with 4parallel streams simultaneously on Linux/Unix :

 

     dbgen2–scale 100 –dir /tmp –parallel 4 –child 1 &

     dbgen2–scale 100 –dir /tmp –parallel 4 –child 2 &

     dbgen2–scale 100 –dir /tmp –parallel 4 –child 3 &

     dbgen2–scale 100 –dir /tmp –parallel 4 –child 4 &

 

Note that dbgen2 always reads the“tpcds.idx” file so if you run it from somewhere other than the “kit” directory,then you need to copy tpcds.idx to the current directory.

 

How to load the data

Run the loader provided with your DBMS toload the dbgen2 generated data files into the data warehouse tables.

 

Note that the default delimiter is ‘|’ soyou may need to specify a different delimiter with dbgen2 or the loader if thedefaults don’t match.

 

Also, the default “null” value is “||” soif your loader expects (for example), “|NULL|”, then you will need to overridethe loader’s value for nulls.

 

Unlike TPC-H, the load time is a componentof the QphDS@SF performance metric.

 

How to generate query SQL from templates

The “qgen2” utility is used to transformthe query templates (see query_templates/*.tpl) into executable SQL for yourtarget DBMS. The unmodified templates are not executable.  

 

Run “qgen2 –h” for the help info. Note thatmany of the options are “advanced” and usually not needed.

 

Since some common SQL features do not haveANSI standard forms (e.g. “LIMIT” and “BEGIN/COMMIT”), the qgen2 utility mustbe told which “dialect” to use. The following “dialect templates” aresupported: db2.tpl, netezza.tpl, oracle.tpl, sqlserver.tpl. The followingexample generates a SQL file (named query_0.sql) from the query99 template for a100GB database using Oracle syntax.

 

   qgen2 –query99.tpl –directory query_templates

–dialect oracle –scale 100

 

Note that qgen2 also reads the “tpcds.idx” file(in the “kit” directory) and the “ansi.tpl” file (in the “kit/query_templates”)so you’ll need to copy something somewhere.

 

How to run the queries

You can, of course, run the queries any wayyou want but the “official” method is to run N concurrent query streams where Nis a function of the database size. Unlike TPC-H, the TPC-DS query workloaddoes not have a single-stream query component. The relationship betweendatabase size (SF) and query streams (N) is:

 

SF

N

100

7

300

9

1,000

11

3,000

13

10,000

15

30,000

17

100,000

19

 

 

How to generate the refresh data

Run “dbgen2 –h” for the help info. Notethat many of the options are “advanced” and usually not needed.

 

Example to generate the refresh data filesin /tmp for the 3rd “update” stream:

 

   dbgen2 –scale 100 –dir /tmp –update 3

 

The output files will be of the form“s_<table>_<stream>.csv”. The default field delimiter is ‘|’. Usethe “-delimiter ‘<c>’” option to change delimiters.

 

Since dbgen2 only generates 200-300GB/hour(on a 2-3GHz x86 processor), it is useful to run multiple parallel streams whengenerating large amounts of data. Here’s an example for the 3rdstream/child of 10 parallel streams:

 

   dbgen2 –scale 100 –dir /tmp –update 3 –parallel10

–child 3

 

Note that dbgen2 always reads the“tpcds.idx” file so if you run it from somewhere other than the “kit”directory, then you need to copy tpcds.idx to the current directory.

 

How to run the data maintenance workload

The rules for the data maintenance phase (a.k.a.“refresh”) are intentionally “open-ended” because the natural methods forimplementation are DBMS-specific. The specification describes the methods ingeneral and the kit contains some sample SQL and but be warned thatimplementing the refresh workload on your system will require some time (thinkweeks, not days) from someone that has read the specification and knows whatthey’re doing. 

 

As with the query streams, you can run therefresh streams any way you want. Officially, the specification defines thesame number of streams for both query and refresh operations. Unlike the querystreams, the refresh streams are run serially. Within a single stream, you canrun the individual DM function serially or in parallel.

 

In general, the process is:

 

  1. Create the staging tables (also optional): Example: “isql –db tpcds100 –f tpcds_source.sql”
  2. Load the staging tables using the DBMS-supplied loader.
  3. Run whatever else your database requires (e.g. grant permissions, generate statistics, etc.) to prepare the tables.
  4. Create sequences needed for the “history keeping” tables: call_center, item, store, web_page and web_site. See the specification for definition of “history keeping”.
  5. Create the refresh views. The kit contains sample ANSI SQL that will almost certainly require some changes.
  6. Run the refresh functions. There are five types of refresh functions – corresponding to the type of table and type of data maintenance operations. See the specification for details about the functions. In general, the methods involve running UPDATE, INSERT and DELETE commands using the views created above to refresh the data warehouse tables.

 

Here’s an example using generic shellscript and SQL.

 

MAX=`nzsqltpcds1 -At -c "(SELECT max(cc_call_center_sk)+1

   FROM call_center)"`

isql -db$DB -c "CREATE SEQUENCE cc_seq START WITH $MAX"

 

 




原创粉丝点击