Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition

来源:互联网 发布:centos netsnmp 编辑:程序博客网 时间:2024/06/14 13:17
 

Introduction

WebSphere DataStage is one the foremost leaders in the ETL (Extract,Transform, and Load) market space. One of the great advantages of thistool is its scalability, as it is capable of parallel processing on anSMP, MPP or cluster environment. Although DataStage Enterprise Edition(DS/EE) provides many types of plug-in stages to connect to DB2,including DB2 API, DB2 load, and dynamic RDBMS, only DB2 EnterpriseStage is designed to support parallel processing for maximumscalability and performance.

The DB2 Data Partitioning Feature (DPF) offers thenecessary scalability to distribute a large database over multiplepartitions (logical or physical). ETL processing of a large bulk ofdata across whole tables is very time-expensive using traditionalplug-in stages. DB2 Enterprise Stage however provides a parallelexecution engine, using direct communication with each databasepartition to achieve the best possible performance.

DB2 Enterprise Stage with DPF communication architecture


Figure 1. DS/EE remote DB2 communication architecture
DS/EE remote DB2 communication architecture

Asyou see in Figure 1, the DS/EE primary server can be separate from theDB2 coordinate node. Although a 32-bit DB2 client still must beinstalled, it’s different from the typical remote DB2 access whichrequires only DB2 client for connectivity. It can be used to pre-querythe DB2 instance and determine partitioning of source or target table.On the DB2 server, every DB2 DPF partition must have the DS/EE engineinstalled. In addition, the DS/EE engine and libraries must beinstalled in the same location on all DS/EE servers and DB2 servers.

The following principles are important in understanding how this framework works:

  • DataStage conductor node uses local DB2 environment variables to determine DB2 instance.
  • DataStagereads the DB2nodes.cfg file to determine each DB2 partition.DB2nodes.cfg file is copied from DB2 server node and can be put anylocation of sqllib subdirectory on DS/EE server. One DS/EE environmentvariable $APT_DB2INSTANCE_HOME can be used to specify this location ofsqllib.
  • DataStage scans the current parallelconfiguration file specified by environment variable $APT_CONFIG_FILE.Each fastname property of this file must have a match with the nodename of DB2nodes.cfg.
  • DataStage conductor node queries local DB2 instance using the DB2 client to determine table partition information.
  • DataStagestarts up processes across ETL and DB2 nodes in the cluster. DB2/UDBEnterprise stage passes data to/from each DB2 node through theDataStage parallel framework, not the DB2 client. The parallelexecution instance can be examined from the job monitor of theDataStage Director.


Back to top

Environment used in our example


Figure 2. Example topology
Example topology

Inour example, we use 2 machines with RedHat Enterprise Linux 3.0operating system for testing, one with 2 CPUs and 1G memory for the DB2server, another with 1 CPU with 1G memory for DS/EE server. In the DB2server, we have 2 database partitions which can be configured viaDB2nodes.cfg, while in DS/EE server; the engine configuration filetells us which nodes are used to execute DataStage jobs concurrently.

Thefollowing are steps we followed to successfully configure remote DB2instance using DS/EE DB2 Enterprise Stage. We will begin this exercisefrom scratch, including DB2 server configuration, DS/EE installationand configuration.



Back to top

Installation and configuration steps for the DB2 server

  1. Install DB2 Enterprise Server Edition (with DPF) and create a DB2 instance at Stage164 node.
  2. Configure rsh service and remote authority file.
  3. Create sample database and check distribution of tables.
  4. Create DS/EE users on all members of both nodes.

If DB2 DPF environments are installed and configured, you can skip step 1 and step 3

Step 1. Install DB2 Enterprise Server and create DB2 instance at Stage164 node

Check your DB2 version before installing DB2 ESE on Stage164 node.For our example we used V8.1 fix pack 7. For DPF feature, you must haveanother separate license. Pay attention to Linux kernel parameterswhich can potentially affect DB2 installation. Please follow the DB2installation guide.

  1. Before installation, create DB2 group and DB2 users.
    [root@stage164 home]# groupadd –g db2grp1
    [root@stage164 home]# groupadd –g db2fgrp1
    [root@stage164 home]# useradd –g db2grp1 db2inst1
    [root@stage164 home]# useradd –g db2fgrp1 db2fenc1
    [root@stage164 home]# passwd db2inst1

  2. Createinstance. Install DB2, then create the instance using the GUI orcommand line. If using the command line, switch to DB2 install pathwith root user and issue the command below to create one DB2 instancethe users created in the previous step as parameters.
    [root@stage164 home]# cd /opt/IBM/db2/V8.1/instance/
    [root@stage164 instance]# ./db2icrt -u db2fenc1 db2inst1

  3. Confirm db2inst1 instance was created successfully. If it failed, please refer to the official DB2 installation documentation.
    [root@stage164 instance]# su – db2inst1
    [db2inst1@stage164 db2inst1]$ db2start
    05-19-2006 03:56:01 0 0 SQL1063N DB2START processing was successful.
    SQL1063N DB2START processing was successful.

  4. Check to confirmthat DBM SVCENAME configuration parameter was configured successfully.If it is not set, the client has no way to connect to the DB2 server.In addition, the TCPIP communication protocol also must be set.
    [db2inst1@stage164 db2inst1]$ db2 get dbm cfg | grep -i svcename
    TCP/IP Service name (SVCENAME) = 50000
    [db2inst1@stage164 db2inst1]$ db2set DB2COMM=TCPIP

Step 2. Configure remote shell (rsh) service and remote authority file.

For the DPF environment, DB2 needs the remote shell utility tocommunicate and execute commands between each partition. Rsh utilitycan be used for inter-partition communication; OpenSSH utility isanother option for inter-partition communication that protects securecommunication. For simplicity, we will not cover it in this article.

  1. Check whether rsh server has installed. If not, download it and issue "rpm –ivh rsh.server-xx.rpm" to install it.
    [root@stage164 /]# rpm -qa | grep -i rsh
    rsh-0.17-17
    rsh-server-0.17-17

  2. Confirm rsh service can be started successfully.
    [root@stage164 /]#service xinetd start
    [root@stage164 /]#netstat –na | grep 514

  3. Create or modify filefor authority users to execute remote commandsYou can create (or edit if it already exists) an /etc/hosts.equiv file.This first column of this file is the machine name, and the second isthe instance owner. For example, the following means only db2inst1 userhas authority to execute commands on Stage164 using rsh:
    .
    Stage164 db2inst1

  4. Check whether rshworks correctly or not by issuing below command using DB2inst1 user. Ifthe date doesn't show correctly, that means there is still aconfiguration problem.
    [db2inst1@stage164 db2inst1]$ rsh stage164 date
    Thu May 18 23:26:03 CST 2006

Step 3. Create DPF partitions and create sample database

  1. Editthe database partition configuration file (DB2nodes.cfg) under<DB2HOME>/sqllib. In this example, we have 2 logical partitionson Stage164 host.
    0  stage164  0
    1 stage164 1

  2. Restart DB2 instance and be sure both partitions can be started successfully.
    [db2inst1@stage164 db2inst1]$ db2stop force
    05-18-2006 23:32:08 0 0 SQL1064N DB2STOP processing was successful.
    SQL1064N DB2STOP processing was successful.
    [db2inst1@stage164 db2inst1]$ db2start
    05-18-2006 23:32:18 1 0 SQL1063N DB2START processing was successful.
    05-18-2006 23:32:18 0 0 SQL1063N DB2START processing was successful.
    SQL1063N DB2START processing was successful.

  3. Create sample database and check the data distribution. Accordingto result, the total row count of table department is 9, and 4 of 9 isdistributed into partition 0, while 5 of 9 into partition 1 accordingto partition key deptno.
    [db2inst1@stage164 db2inst1]$ db2sampl
    [db2inst1@stage164 db2inst1]$ db2 connect to sample
    [db2inst1@stage164 db2inst1]$ db2 "select count(*) from department"
    1
    -----------
    9
    1 record(s) selected.
    [db2inst1@stage164 db2inst1]$ db2 "select count(*) from department where
    dbpartitionnum(deptno)=0"
    1
    -----------
    4
    1 record(s) selected.
    [db2inst1@stage164 db2inst1]$ db2 "select count(*) from department where
    dbpartitionnum(deptno)=1"
    1
    -----------
    5
    1 record(s) selected.

Step 4. Create DS/EE users and configure the to access the DB2 database

If DS/EE users and groups have been created on the DS/EE node, thencreate the same users and groups on the DB2 server node. In any case,make sure you have the same DS/EE users and groups on these twomachines.

  1. Create DS/EE user/groups at DB2 server. In this example, they are dsadmin/dsadmin. Also add DS/EE user to DB2 instance group.
    [root@stage164 home]# groupadd -g 501 dsadmin
    [root@stage164 home]# useradd –u 501 –g dsadmin –G db2grp1 db2inst1
    [root@stage164 home]# passwd dsadmin

  2. Add an entry in /etc/hosts.equiv file which was created in Step2.3. This gives dsadmin authority to execute some commands on Stage164.
    Stage164 db2inst1
    Stage164 dsadmin

  3. Add DB2 profile environment variable at <DSEngine_HOME>/.bashrc file (for example, <DSEngine_HOME> = /home/dsadmin).
    . /home/db2inst1/sqllib/db2profile

  4. Be sure dsadmin user can connect to sample db successfully.
    # su - dsadm

    $ db2 connect to sample

    Database Connection Information

    Database server = DB2/6000 8.2.3
    SQL authorization ID = DSADM
    Local database alias = SAMPLE
    $



Back to top

Installation and configuration steps for the DS/EE node

  • Install DataStage Enterprise Edition(DS/EE) and DB2 client
  • Add DB2 library and instance home at DS/EE configuration file
  • Catalog sample db to DS/EE using dsadmin
  • Copy DB2nodes.cfg from DB2 server to DS/EE and configure environment variable.
  • NFS configuration, export /home/dsadmin/Ascential/
  • Verify DB2 operator library and execute DB2setup.sh and DB2grants.sh
  • Create or modify DS/EE configuration file
  • Restart DS/EE server

Now let's walk through the process in detail.

Step 1. Install DataStage Enterprise Edition(DS/EE) and DB2 client

First, DS/EE users and groups need to be created in advance. In thisexample, the user is dsadmin, group dsadmin. If DS/EE not installed,follow the WebSphere DataStage install guide. We assume the software isinstalled on DSHOME variable which is/home/dsadmin/Ascential/DataStage/DSEngine.Then, install the DB2 client and create one client instance at DS/EEnode.

Step 2. Add DB2 library and instance home at DS/EE configuration file

The dsenv configuration file, located under DSHOME directory, is one ofthe most important configuration files in DS/EE. It contains theenvironment variables and library path. At this step, we will add DB2library to LD_LIBRARY_PATH so that DS/EE engine can connect to DB2.

Note: PXEngine library should precede DB2 library for LD_LIBRARY_PATH environment path.

Configure the dsenv file as follows:


PATH=$PATH:/home/dsadmin/Ascential/DataStage/PXEngine/bin:/home/dsadmin/Ascential/
DataStage/DSEngine/bin

# for DB2 configuration
DB2DIR=/opt/IBM/db2/V8.1; export DB2DIR
DB2INSTANCE=db2inst1; export DB2INSTANCE
INSTHOME=/home/db2inst1; export INSTHOME
DB2PATH=/opt/IBM/db2/V8.1; export DB2PATH

LD_LIBRARY_PATH=$LD_LIBRARY_PATH: /home/dsadmin/Ascential/DataStage/PXEngine/lib:
$DB2DIR/lib:$INSTHOME/sqllib/lib; export LD_LIBRARY_PATH

PATH=$PATH:$INSTHOME/sqllib/bin:$INSTHOME/sqllib/adm; export PATH

You can add this dsenv file to dsadmin .bashrc file(/home/dsadmin/.bashrc) to avoid executing it manually every time. Whatyou need do is to exit the dsadmin user and re-logon to make it executeand take effect.

. /home/dsadmin/Ascential/DataStage/DSEngine/dsenv

Step 3. Catalog remote sample db to DS/EE using dsadmin

  1. Catalog remote sample database from DB2 EE (Stage164) to DS/EE using dsadmin.
    [dsadmin@transfer dsadmin]$ db2 CATALOG TCPIP NODE stage164 REMOTE stage164 SERVER 50000
    [dsadmin@transfer dsadmin]$ db2 CATALOG DB sample AS samp_02 AT NODE stage164

  2. Configure rsh utility according to Step 2of "Installation and configuration steps for the DB2 server." Be suredsadmin user at transfer can execute remote commands for Stage164 usingrsh.
    [dsadmin@transfer dsadmin]$ rsh stage164 date
    Thu May 19 10:22:09 CST 2006

Step 4. Copy DB2nodes.cfg from DB2 server to DS/EE and configure environment variable.

Copy the DB2nodes.cfg file from the DB2 server to one directory ofDS/EE. This file tells DS/EE engine how many DB2 partitions there arein the DB2 server. Then create one environment variableAPT_DB2INSTANCE_HOME by DataStage Administrator to point to thedirectory. This variable can be specified at the project level or thejob level.

Step 5. NFS configuration, export /home/dsadmin/Ascential/

First, add 2 machine names into /etc/hosts file at both nodes toidentify one another’s network name. Then, share the DS/EE wholedirectory to the DB2 server so that each partition can communicate withDS/EE.

  1. At DS/EE node, export /home/dsadmin/Ascential directory.This can be done by adding an entry in /etc/exports file, it will allowusers from stage164 machine to mount /home/dsadmin/Ascential directorywith read/write authority.
    /home/dsadmin/Ascential   stage164(rw,sync)

  2. Once you have changed /etc/export file, you must notify NFS daremodprocess to reload changes. Or you can stop and restart this nfsdprocess by issuing following commands:
    [root@transfer /]# service nfs start
    Starting NFS services: [ OK ]
    Starting NFS quotas: [ OK ]
    Starting NFS daemon: [ OK ]
    Starting NFS mountd: [ OK ]

  3. Then at DB2 server, create one directory called/home/dsadmin/Ascential, it’s same with DS/EE server, then mount thisdirectory to remote DS/EE directory.
    [root@stage164 home]# mount -t nfs -o rw transfer:/home/dsadmin/Ascential /home/dsadmin/
    Ascential

    You can check mounted files as follows:

    [root@stage164 home]# df -k
    Filesystem 1K-blocks Used Available Use% Mounted on
    /dev/sda1 7052464 3491352 3202868 53% /
    transfer:/home/dsadmin/Ascential
    7052464 6420892 273328 96% /home/dsadmin/Ascential

    To avoid mounting it every time when machine restart, you can alsoadd this entry into file /etc/fstab to mount this directoryautomatically:

    transfer:/home/dsadmin/Ascential /home/dsadmin/Ascential nfs defaults 0 0

Step 6. Verify DB2 operator library and execute DB2setup.sh and DB2grants.sh

  1. Execute DB2setup.sh script ,which located in $PXHOME/bin.Note, you may have a problem for remote DB2 instances; you will need tochange the connect userid and password.
      db2 connect to samp_02 user dsadmin using passw0rd
    db2 bind ${APT_ORCHHOME}/bin/db2esql.bnd datetime ISO blocking all grant public
    # this statement must be run from /instance_dir/bnd
    cd ${INSTHOME}/sqllib/bnd
    db2 bind @db2ubind.lst blocking all grant public
    db2 bind @db2cli.lst blocking all grant public
    db2 connect reset
    db2 terminate

  2. Execute DB2grants.sh
    db2 connect to samp_2 user dsadmin using passw0rd
    db2 grant bind, execute on package dsadm.db2esql to group dsadmin
    db2 connect reset
    db2 terminate

Step 7. Create or modify DS/EE configuration file

DS/EE provides parallel engine configuration files. DataStage learnsabout the shape and size of the system from the configuration file. Itorganizes the resources needed for a job according to what is definedin the configuration file. The DataStage configuration file needs tocontain the node on which DataStage and the DB2 client are installedand the nodes of the remote computer where the DB2 server is installed.

The following is one example. For more detail info of engineconfiguration file, please refer to the "Parallel job developmentguide."

{
node "node1"
{
fastname "transfer"
pools ""
resource disk "/home/dsadmin/Ascential/DataStage/Datasets" {pools ""}
resource scratchdisk "/home/dsadmin/Ascential/DataStage/Scratch" {pools ""}
}
node "node2"
{
fastname "stage164"
pools ""
resource disk "/tmp" {pools ""}
resource scratchdisk "/tmp" {pools ""}
}
}

Step 8. Restart DS/EE server and test connectivity

At this point you have completed all configurations on both nodes. Restart DS/EE server by issuing the commands below:

[dsadmin@transfer bin]$ uv -admin –stop
[dsadmin@transfer bin]$ uv -admin -start

Note: after stopping the DS/EE engine, you need to exit dsadmin andre-logon, and the dsenv configuration file will be executed. Also, besure the time interval between stop and start is longer than 30 secondsin order for the changed configuration to take effect.

Next, we will test remote connectivity using DataStage Designer. Choose Import plug-in table definition. The following window will appear. Click Next. If it imports successfully, that means the remote DB2 connectivity configuration has succeeded.


Figure 3. DB2 Enterprise stage job
DB2 Enterprise stage job


Back to top

Develop one DB2 enterprise job on DS/EE

Inthis part, we will develop one parallel job with DB2 Enterprise Stageusing DataStage Designer. This job is very simple because it justdemonstrates how to extract DB2 department table data to one sequentialfile.


Figure 4. Import DSDB2 Meta Data
Import DSDB2 Meta Data

Double-clickthe DB2 Enterprise stage icon, and set the following properties to theDB2 Enterprise stage. For detailed information, please reference the"Parallel job developer’s guide."


Figure 5. DS/EE DB2 Enterprise stage properties
DS/EE DB2 Enterprise stage properties
  • Client Instance Name: Set this to the DB2 client instance name. If you set this property, DataStage assumes you require remote connection.
  • Server:Optionally set this to the instance name of the DB2 server. Otherwiseuse the DB2 environment variable, DB2INSTANCE, to identify the instancename of the DB2 server.
  • Client Alias DB Name: Setthis to the DB2 client’s alias database name for the remote DB2 serverdatabase. This is required only if the client’s alias is different fromthe actual name of the remote server database.
  • Database:Optionally set this to the remote server database name. Otherwise usethe environment variables APT_DBNAME or APT_DB2DBDFT to identify thedatabase.
  • User: Enter the user name for connecting to DB2. This is required for a remote connection.
  • Password: Enter the password for connecting to DB2. This is required for a remote connection.

Addthe following two environment variables into this job via DataStageManager. APT_DB2INSTANCE_HOME defines DB2nodes.cfg location, whileAPT_CONFIG_FILE specifies the engine configuration file.


Figure 6. Job properties set
Job properties set


Back to top

Performance comparison between Enterprise Stage and API Stage

Inthis part, we will execute the jobs developed above by DataStageDirector and compare the performance between DS/EE Enterprise Stage andAPI Stage. The following is another job with DB2 API Stage.


Figure 7. DB2 API stage
DB2 API stage

o generate a quantity of test data, we created the following stored procedure:


CREATE PROCEDURE insert_department( IN count)
language sql
begin
declare number int;
declare str varchar(10);
declare deptno char(10);
set number=1;

while ( number>count)
do
set deptno=char( mod(number, 100) );
insert into department values( deptno, 'deptname', 'mgr', 'dep', 'location');
if( mod(number, 2000)=0) then
commit;
end if;
set number=number+1 ;
end while;
end@

Execute the stored procedure:

DB2 –td@ -f emp_resume.sql
DB2 call emp_resume( 5000000)

Then, we execute these 2jobs against 100,000, 1M and 5M rows via DataStage Director and observethe result using the job monitor. The following screenshots are testresults with DB2 Enterprise Stage and DB2 API Stage.


Figure 8. 100,000 records (DB2 Enterprise Stage)
100,000 records (DB2 Enterprise Stage)

Figure 9. 100,000 records (DB2 API stage)
100,000 records (DB2 API stage)

Figure 10. 1,000,000 records (DB2 Enterprise Stage)
1,000,000 records (DB2 Enterprise Stage)

Figure 11. 1,000.000 records (DB2 API Stage)
1,000.000 records (DB2 API Stage)

Figure 12. 5,000,000 records (DB2 Enterprise Stage)
5,000,000 records (DB2 Enterprise Stage)

Figure 13. 5,000,000 records (DB2 API Stage)
5,000,000 records (DB2 API Stage)

Figure 14. Compare performance between Enterprise Stage and API Stage
Compare performance between Enterprise Stage and API Stage

InFigure 8, there are 2 nodes are executing ETL with DB2 EnterpriseStage, while with DB2 API Stage there is 1 node. The ETL processing ofEnterprise Stage per second is above 2 times to API Stage. Furthermore,with data growth, Enterprise Stage has greater advantage because ofparallel performance.



Back to top

Limitation

DB2Enterprise Stage has a great parallel performance over the other DB2plug-in stages using a DB2 DPF environments, however, it requires thehardware and operating system of ETL server, and the DB2 nodes must bethe same. Consequently, it’s not a replacement for other DB2 plug-instages, especially in heterogeneous environments.



Back to top

Conclusion

Thisarticle has described how to configure remote connectivity for DS/EEDB2 Enterprise Stage using step-by-step instructions. In addition, weprovided a performance comparison between Enterprise Stage and DB2 APIStage using two DS/EE DataStage jobs.



Resources

  • 1.Ascential Developer Site

  • 2.DataStage Enterprise Edition Configure Parallel Remote DB2 Connectivity
原创粉丝点击