High-performance solution to feeding a data warehouse with real-time data, Part 2: Explore the integration options with staging tables and WebSphere MQ messages

Integrating InfoSphere Replication Server and InfoSphere DataStage to trickle feed the data warehouse

Feeding a data warehouse with changes from the source database can be very expensive. If the extraction is only done with SQL, there is no way to easily identify the rows that have been changed. IBM InfoSphere™ Replication Server can detect changed data by reading only the database log. This series shows how to use InfoSphere Replication Server to efficiently extract only the changed data and how to pass the changes to IBM InfoSphere DataStage® to feed the data warehouse. Part 1 of the 2-part series provided an overview of these products and how they can work together. In this Part 2, explore two integration options: using WebSphere® MQ messages with InfoSphere Event Publisher and using staging tables.

Share:

Anand Krishniyer (akrishni@us.ibm.com), Staff Software Engineer, InfoSphere Replication, IBM

Anand Krishniyer photoAnand Krishniyer is a staff engineer with the InfoSphere Replication development organization. As a member of the admin team, his responsibilities include developing line items as well as providing technical assistance to customers with respect to installation, setup, and configuration of the Replication Server. Prior to his current role, Anand worked as a project lead for the Integration and Tools team for a process management company, Savvion (now part of Progress Software).



Tony Lee (tonylee@us.ibm.com ), Senior Certified IT Specialist, I.B.M.

Tony Lee photoTony Lee is a certified senior IT specialist in the InfoSphere Replication Center of Competency (COC), a team within the InfoSphere Replication development organization. As a member of the COC, Tony has provided technical assistance to customers and business partners on InfoSphere replication technologies. Tony has many years of consulting experience with various IBM replication technologies, covering SQL replication, Q replication, and, most recently, InfoSphere Change Data Capture. Prior to his current role, Tony provided Information Management technical consultation to customers and partners for nearly a decade, covering a wide range of topics, from DB2 tuning, to Information Server and Master Data Management. Prior to becoming a consultant, Tony worked in many different roles in the Information Management area, ranging from management to development.



James Yau (jamesyau@us.ibm.com), Technical Solution Architect, InfoSphere Information Server, IBM

James Yau photoJames Yau is a senior solution architect certified on the InfoSphere Information Server DataStage product. Currently, he is part of the InfoSphere Technology Enablement organization responsible for Information Server Boot Camp content development and delivery. James has many years of consulting experience with the Information Server Suite of products, including InfoSphere DataStage, QualityStage, Information Analyzer, and FastTrack. Prior to his current role, James was part of a Business Partner Technical Enablement team, in which he was the technical program manager for InfoSphere Information Server. His role included course content development and delivery with various delivery vehicles, such as instructor-led, on-line, and self-paced learning. In the past, James worked in many different roles, ranging from software developer to marketing manager, both in IBM and outside of IBM.



02 September 2010

Also available in Chinese Spanish

Before you start

About this tutorial

The Part 1 article of this series addressed the technologies of the InfoSphere Replication Server and DataStage products and the different ways of integrating the two to feed the warehouse. It also covered the pros and cons of the various integration options. In this Part 2 tutorial, explore two specific integration options: using MQ messages and using staging tables. This tutorial takes you through the setup and configuration of each of these integration options with screen shots and step-by-step instructions. This tutorial does not dive into the details of how to write a DataStage job or of how to configure replication, but instead concentrates on the integration techniques.

Prerequisites

The scenarios described in the tutorial can be performed in the following environment:

Operating system and hardware
  • AIX®, Version 5.3, operating system with 64-bit Common Hardware Reference Platform (CHRP) architecture hardware
  • Windows XP Professional Service Pack 3 with 32-bit Intel processor
Software
  • IBM InfoSphere Replication Server 9.7
  • IBM Information Server 8.1 Server for AIX (with Connectors rollup patch 2 for DB2® connector)
  • IBM Information Server 8.1 Client for Windows® (with Connectors rollup patch 2 for DB2 connector)
  • WebSphere MQ, Version 7 (for the MQ scenario)

Real-time integration with InfoSphere Event Publisher

Setting up Event Publisher

Part 1 discussed the technologies of the InfoSphere Replication Server and InfoSphere DataStage products and the different ways of integrating the two to feed the data warehouse. It also provided pros and cons for the various integration options. Review this article if you need to before you begin to set up and configure integration options.

Figure 1 illustrates a sample real-time integration setup. The Event Publisher configuration on which the source data resides is on a Windows system svl-akrishni2. The operational data store (ODS) on which InfoSphere DataStage runs is on an AIX system scoobydoo. The data warehouse location does not matter and can be configured on an entirely different system.

Figure 1. Event publication integration details
shows the source database on svl-akrishn2, output through the db log to Q Capture and Q Manager, connecting to Q Manager on scoobydoo, then to the DataStage job on scoobydoo

The Q Capture program captures the changes in the Product and Inventory source tables from the log and publishes them to a WebSphere MQ Queue. This queue is referred to as the send queue (SALES_DATA_TO_DATASTAGE) and is defined in a queue manager QMEP on the Windows system svl-akrishni2. This QMEP queue manager sends the messages across to a queue on the AIX system. This queue is referred to as the receive queue (SALES_DATA_TO_DATASTAGE, which is the same name as the send queue) and is defined in a queue manager QMDS on the AIX system.

The event publication is configured so that the message is in comma-separated value (CSV) format, and each message contains a single row operation, such as INSERT, UPDATE, or DELETE. Listing 1 shows a sample MQ message in the queue.

Listing 1. Sample MQ message in the queue
10,"IBM","2009355","021255984000","AKRISHNI","PRODUCT","REPL","0000:0000:0000:0000:5ae5",
"0000:87a0:c204:0000:0000","2009-12-21-10.12.52",,0000,"100-201-01","Ice Scraper, 
Windshield 4 inch",3.99,,,,,"100-201-01","Ice Scraper1, Windshield 4inch",3.99,,,,"
<product pid=""100-201-01"" ><description><name>Ice Scraper,Windshield
4 inch</name><details> Basic Ice Scraper 4 inches wide, foam handle
</details><price>3.99</price></description></product>"

As described in Part 1, the first 12 columns in the message are replication headers, and the remaining columns are the change data captured from the source table. The headers in columns 6 and 7 are important to your setup and configuration. The header in column 6 is the name of the source table, and the header in column 7 is a 4-character code for the SQL operation that causes the row change. The four-character code ISRT refers to insert. REPL refers to update. DLET refers to delete operations. For a detailed explanation of all the headers in the message, refer to the IBM InfoSphere Event Publisher in formation in the DB2 V9.7 Information Center (see Resources).

The DataStage job does the following:

  1. Reads the messages from the receive queue on the AIX system
  2. Parses the messages
  3. Transforms the data
  4. Writes to data sets, which are operating system files created by data set stage

The change data corresponding to each of the three types of row operations for a particular source table (INSERT, UPDATE, or DELETE) are written to different data sets. So for each source table, the DataStage job generates three data sets with the respective insert, update, or delete data.

Note: The example assumes there are no dependencies between the messages, such as referential constraints. The order of processing an INSERT message can be separate from the order of processing a DELETE message. If there are dependencies across the messages, all the messages need to be written to a single data set in order to preserve the ordering, and the DataStage job itself needs to ensure that messages are processed in the correct order.

Configuring Event Publisher

The high-level steps for the Event Publisher configuration are as follows:

  1. Create the DB2 objects (the source tables)

    Note: This first step is only needed for this example scenario; it is not a step in the configuration setup. Normally, the DB2 objects would already exist, and you would identify the source tables and determine the subset of data to be replicated.
  2. Create the Event Publisher setup by doing the following:
    1. Set up MQ
    2. Set up Event Publisher
    3. Operate Event Publisher
  3. Create the DataStage setup by doing the following:
    1. Configure MQ Connector
    2. Configure the first transformer stage
    3. Configure the second transformer stage
    4. Configure the Data Set stages
  4. Test the real-time configuration setup by doing the following:
    1. Start Event Publisher
    2. Import the DataStage job
    3. Compile the DataStage job
    4. Run the test script to introduce a change in the source data
    5. Run the DataStage job
    6. Verify the data in the data sets to show they were populated

The following sections describe the example in detail. All the scripts needed to configure this sample are included with this tutorial in the Download section.

Create the DB2 objects

For the sample setup, you will use a provided database called SALES and import the PRODUCT and INVENTORY tables, which are ixf files provided with the download. These are the source tables for the scenario. Alternatively, you can either use the PRODUCT and INVENTORY tables that come with the DB2 V9.7 sample database, or you can import the tables them onto another database of your choice.

To import the Product and Inventory tables, change directories to setupDB in the downloaded attachment, where you will find the product.ixf file and the inventory.ixf file. Open a DB2 command window by entering db2cmd, and run the following commands:

import from product.ixf of ixf create into product
import from inventory.ixf of ixf create into inventory

Setting up the WebSphere MQ queue manager h

Begin creating the Event Publisher setup by setting up the MQ as follows:

  1. Create the WebSphere MQ queue manager QMEP on the source system svl-akrishni2 by issuing the command Crtmqm.exe QMEP.
  2. Start the queue manager by issuing the command Strmqm.exe QMEP.
  3. Create the rest of the MQ objects on the source system by issuing the command runmqsc.exe QMEP < mqObjectsPublisher.in

    The script mqObjectsPublisher.in is available under the RealTimeInt\setupEP directory of the download. This script creates all the MQ objects on the source system (queues, channel, and listener).

  4. Create the queue manager and the MQ objects on the target (subscriber) system by entering the following:
    Createmqm QMDS
    Strmqm QMDS
    Runmqsc QMDS < mqObjectsSubscriber.in

    The script mqObjectsSubscriber.in is available under the RealTimeInt\setupEP directory of the download that is available with this tutorial.

Setting up Event Publisher

Complete the following steps to set up the EP.

  1. Create the replication objects by running the following scripts using the ASNCLP program (the command line replication admin tool) in a DB2 command window: asnclp -f crtCtlTables.in
  2. Create all the control tables required for the publication setup by entering asnclp -f crtQMapAndPublications.in

    The crtCtlTables.in and crtQMapAndPublications.in scripts are available under the RealTimeInt\setupEP directory of the download. These scripts create the queue map (a mapping of the MQ objects to Q replication objects) and the publications for the Product and the Inventory tables.

Operating Event Publisher

Start the EP (Q Capture program) by issuing the following command at a DB2 command window: asnqcap CAPTURE_SERVER=SALES CAPTURE_SCHEMA=ASN

This starts the Q Capture program, which starts publishing the changes in the Product and the Inventory tables. You can also use the startQCapture.bat script, provided in the RealTimeInt\setupEP directory of this tutorial's download.

Setting up the DataStage job

In the DataStage job, you use a WebSphere MQ connector to read the messages from the receive queue and two transformer stages to parse the messages, as shown in Figure 2. The first transformer stage is used as a selection control to separate the messages related to a particular publication, such as a particular source table. The second transformer stage is used as a selection control to separate the message related to a particular operation, such as insert, update, or delete, for each table. The separated data, which corresponds to each row in the source table, is written to the respective data set (corresponding to the insert, update, or delete operation). Use the data set stage to write the processed data due, because data sets are native file formats that consist of a single header file referencing the actual data that can be partitioned into multiple DataStage parallel partitions. This format is therefore readily usable by DataStage to run in parallel environments, and therefore this format gives the best performance.

Figure 2. Event Publication DataStage job
Event Publication DataStage Job

The actual transformation processing that the main DataStage job performs can use these data sets as input. Typically the DataStage processing separates the data based on the row operation (insert, update, or delete). However, if there are referential constraints with the source data, the change data needs to be written to a single data set, instead of three different data sets.

The following sections describe in more detailed steps how to set up the DataStage job.

Configuring the MQ Connector

The following list shows the configuration of the MQ Connector:

  • Connection properties
    • Mode = server
    • Queue manager = QMDS
    • Username = db2inst1
  • Usage properties
    • Queue name = SALES_DATA_To_DATASTAGE
    • Access mode = As in queue definition
    • Wait time = 1
    • Message quantity = -1
    • Message read mode = Delete

Following are some notes about the configuration elements:

Wait time
Use this property to specify the maximum number of seconds to wait for a new message to arrive upon the input queue. The default value is -1, which specifies an indefinite amount of time. For the example, set it to 1 second.
Message quantity
Use this property to specify the number of messages (not rows) to retrieve from the input queue. For the example, set it to -1. A value of -1 specifies an indefinite number of messages. A value of 0 specifies no messages. You can specify integers between -1 and 999999999 for the message quantity. For the example, the message quantity is -1 and the wait time is 1 so that the MQ Connector can fetch all the messages that have arrived in that queue.
Message read mode
Use this property to specify how messages are read in the current transaction. For the example, set this to Delete (destructive read). You can also choose the Move to work queue option from the drop-down list if the messages need to be preserved.

Create a column called PayLoad of type Varchar with size 1000, as shown in Figure 3. Because this is a text message, the size refers to the number of characters. For the example, only the portion of the message that follows the message header is important.

Figure 3. Configuration of the MQ Connector
Configuration of the MQ Connector - Column name = PayLoad

Configuring the first transformer stage

Configure the SeparatePayLoad_Based_On_TableName transformer stage as shown in Figure 4.

Figure 4. Configuration of the first transformer stage
Shows PayLoad definition, explained below

Note the following about the example configuration:

  • Define a stage variable called SCHEMANAME, and set it to the fifth column of the input PayLoad.
  • Define a stage variable called TABLENAME, and set it to the sixth column of the input PayLoad.
  • In Figure 4, there are two output links from the SeparatePayLoad_Based_On_TableName transformer stage. Because each of these links goes to another transformer stage, use the stage variable SCHEMANAME and TABLENAME as a selection control to separate messages related to the Product and Inventory tables.

Configure the stage constraints, as shown in Figure 5.

Figure 5. First transformer stage constraints
shows double quotes for constraints

Note that all non-numeric data in the messages are enclosed in double quotes in the messages. So when you do a comparison using the TABLENAME stage variable, enclose the actual table name (PRODUCT and INVENTORY tables) in an extra pair of double quotes.

This transformer stage separates the messages according to the source table name. The sample message is shown in Listing 2.

Listing 2. Sample message
10,"IBM","2009355","021255984000","ADMIN","PRODUCT","REPL","0000:0000:0000:0000:5ae5",
"0000:87a0:c204:0000:0000","2009-12-21-10.12.52",,0000,"100-201-01","Ice Scraper, 
Windshield 4 inch",3.99,,,,,"100-201-01","Ice Scraper1 Windshield 4inch",3.99,,,,"
<productpid=""100-201-01""><description><name>Ice Scraper,
Windshield 4 inch</name><details>Basic Ice Scraper 4 inches wide, foam
handle</details><price>3.99</price></description></product>"

Note the following about the sample message:

  • The stage variable SCHEMANAME is defined as Field(Read_MQPayLoad.PayLoad, ",",5). The fifth column in a message is the table name.
  • The stage variable TABLENAME is defined as Field(Read_MQPayLoad.PayLoad, ",",6). The sixth column in a message is the table name.
  • All non-numeric data in the messages are enclosed in double quotes in the messages. So when you define the SCHEMANAME or TABLENAME constraint, enclose the names in an extra pair of double quotes.
  • The seventh column in a message is always the type of operation: ISRT, REPL, or DLET for insert, update, or delete, respectively.
  • The source data starts from column 13.

Configuring the second transformer stage

After you separate the messages related to the PRODUCT and the INVENTORY tables, configure the transformer stages SeparatePayLoad_Based_On_I_U_D_Operation1 and SeparatePayLoad_Based_On_I_U_D_Operation2 to parse the row operations data (insert, update and delete). Figure 6 shows how the SeparatePayLoad_Based_On_I_U_D_Operation1 transformer stage is configured to parse the columns in the PRODUCT table.

Figure 6. Configuration of the second transformer stage
Configuration of the second transformer stage

Note the following about the configuration of the second transformer stage:

  • Define a stage variable called OPERATION, and set it to the seventh column of the input PayLoad.
  • You have three output links from the SeparatePayLoad_Based_On_I_U_D_Operation1 transformer stage, as shown in Figure 2. Because each of these links goes to a Data Set stage, use the stage variable OPERATION as a selection control to separate messages related to insert, update, and delete operations.
  • Define all the columns in the Product table, and map them to the individual columns in the PayLoad. For example, for the insert operation, map the columns PID, NAME, PRICE, PROMOPRICE, PROMOSTART, and PROMOEND to the columns in the PayLoad, starting at location 19.
  • For the insert operation, the before values of all the columns in the Product table are empty, so the columns from 13 to 18 are empty.
  • For the delete operation, the after values for all columns are empty, so the columns 19 and higher are empty.
  • Update operations have both before and after values.
  • Non-numeric column values are enclosed in double quotes in the MQ message.

Figure 7 shows the column mapping for the update operation

Figure 7. Column mapping for update operation
Column Mapping for Update Operation

Note the following about the column mapping for the update operation:

  • Figure 7 shows column mappings for update operation (operations keyword REPL).
  • Locations 13 to 18 in PayLoad hold the before values, and locations 19 to 24 hold the after values.

Figure 8 shows how the operations data is separated in the Inventory table.

Figure 8. Separating the operations data in the Inventory table
Separating the operations data in the 'Inventory' table

Note that the configuration of SeparatePayLoad_Based_On_I_U_D_Operation2 transformer stage separates the operations data in the Inventory table.

Configuring the Data Set stages

Complete the insert, update and delete data sets for the Inventory table as described in this section. Figure 9 shows how to set the Data Set properties.

Figure 9. Data set properties
DataSet Properties

Note the following about setting Data Set properties:

  • Figure 9 shows the Data Set stage properties for the data from the insert operation in the Product table.
  • The update policy is set to Append. Make sure that the update policy for all the Data Sets is set to Append.
  • Set the file property to the appropriate name with the full path.
  • The file in the file property will be created if it does not exist. However the directories specified in the path should already exist in the file system.

Figure 10 shows the column configuration of Data Set stage Inserts_DataSet1.

Figure 10. Column configuration of the insert data set
Column Configuration of the Insert Data Set

Figure 11 shows the column configuration of the update data set.

Figure 11. Column configuration of the update data set
Column Configuration of the Update Data Set

Note the following about the column configuration of the update data set:

  • Figure 11 shows the column configuration of the Update_DataSet1 Data Set stage.
  • Before and after image values are captured for update operations.

Figure 12 shows the column configuration of the delete data set

Figure 12. Column configuration of the deletes data set
Column Configuration of the Deletes Data Set

Note the following about the delete data set:

  • Figure 12 shows the column definitions for the Deletes_DataSet1.
  • These are before image values in the Product table.

Steps for testing the real-time configuration setup

The following steps describe at a high-level what is needed to test the real-time configuration setup. The steps are described in more detail in the following sections.

  1. Start Event Publisher
  2. Import the DataStage job
  3. Compile the DataStage job
  4. Run the test script to introduce a change in the source data
  5. Run the DataStage job
  6. Verify the data in the data sets to show they were populated

Starting Event Publisher

Complete the following steps to start Event Publisher.

  1. Open a DB2 command window and cd to the RealTimeInt\setupEP directory from the download.
  2. Start the Event Publisher by running the startQCapture.bat script.
  3. Make sure that you get the ASN0572I message that the program initialized successfully, as shown in Figure 13.
Figure 13. Start Q Capture in DB2 cmd window
Start Q Capture in Db2 cmd window

Importing the DataStage job

Complete the following steps to import the DataStage job.

  1. Start the DataStage Designer, and import the RealTimeDataInt.dsx job by browsing to the RealTimeInt directory from the download.
  2. Click OK in the DataStage Import window to complete the import of the job as shown in Figures 14 and 15.
Figure 14. Import DataStage job
Import DataStage Job
Figure 15. More Import DataStage job
Import DataSTage Job Contd.

Compiling the DataStage job

Complete the following steps to compile the DataStage job.

  1. If you used a different Q Manager or the receive queue name, change these values in the MQ Connector stage by double-clicking on the stage to get to the properties page.
  2. If you run the job on a Windows® platform, change the File properties of all the Data Set stages: Inserts_DataSet1, Updates_DataSet1, Deletes_DataSet1, Inserts_DataSet2, Updates_DataSet2, and Deletes_DataSet2.
  3. Compile the DataStage job by clicking Compile in the DataStage Designer menu, as shown in the Figure 16.
Figure 16. Compile the DataStage job
Compile the DataStage Job

Running the test script to introduce changes in the source data

Complete the following steps to run the test script.

  1. Open a DB2 command window, and cd to the RealTimeInt\setupEP.
  2. Run the script updateSourceTables.sql. This script does an insert, update, and delete on the Product and the Inventory tables.

Running the DataStage job

Complete the following steps to run the DataStage job.

  1. Start the DataStage Director by clicking the All Programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Director shortcut.
  2. Run the job by clicking the Run icon on the DataStage Director menu, as shown in Figure 17.
Figure 17. Run the DataStage job
Run the DataStage Job

Verifying the data in the data sets to show they were populated

View the job status log by right-clicking on the job and clicking View Log from the DataStage Director, as shown in Figure 18.

Figure 18. View job log
View job log

You can also verify that the job ran successfully from the Designer, which shows the color of the data flow in green with the number of rows processed between different stages. Figure 19 shows that there were 6 messages processed and one message each for each insert, update, and delete operation for the Product and the Inventory table.

Figure 19. Check job status
Check Job Status

You can view the data set by right-clicking on the data set and clicking View data. Figure 20 shows the data in the Updates_DataSet1 for the Product table.

Figure 20. Updates data set
Updates DataSet

Similarly you can verify the propagated change data by viewing the data in each of the data sets.


Staged data integration with SQL replication

Setting up SQL replication

In the example staged integration setup, the SQL replication configuration that contains the source data is on a Windows system (svl-akrishni2). The operational data store (ODS) on which InfoServer DataStage runs is on the AIX system scoobydoo. The data warehouse setup is not shown in the setup. The data warehouse location does not matter and can be configured on an entirely different system.

The SQL capture program captures the changes in the source tables (the Product and Inventory tables from database SALES in the example setup) from the log and populates the change data (CD) table on the same source database SALES. The SQL apply reads the CD table from the source database and populates the staging table (CCD table) on the target database STAGEDB. The SALES database and the SQL capture are on the Windows system (svl-akrishni2), and STAGEDB and the SQL apply program are on the AIX system. The connection to the SALES database is cataloged on the target system (the scoobydoo AIX system). The SQL apply is configured to be event-driven. The SQL apply applies the changes to the target CCD table in response to an event, such as a row in the IBMSNAP_SUBS_EVENT control table with an updated current timestamp.

The DataStage job running on the AIX system extracts from the CCD table in database STAGEDB and writes to three different data sets, depending whether the operation is of type insert, update, or delete. If the extract from the CCD table is successful, the job proceeds to prune the CCD table. If all the extracts and prunes are successful for all CCD tables, the event in the IBMSNAP_SUBS_EVENT table that triggers the SQL apply is updated so that SQL apply can process another replication cycle.

Note that the SQL apply program wakes up every five minutes to check for updates in the IBMSNAP_SUBS_EVENT table. So if you schedule the Staged Integration DataStage job to run automatically through a scheduler (either the DataStage Director or an external scheduler), there should be a five-minute interval between consecutive cycles. An interval less than five minutes between two consecutive Staged Integration DataStage job cycles might result in a data loss, because the DataStage job might prune the CCD table while the SQL apply program is applying the changes to it.

SQL replication configuration

The high-level steps for the SQL replication configuration are as follows:

  1. Create DB2 objects

    Note: This first step is only needed for this example scenario; it is not a step in the configuration setup. Normally, the DB2 objects would already exist, and you would identify the source tables and determine the subset of data to be replicated.
  2. Set up SQL replication and create SQL replication objects
  3. Operate the SQL capture and apply
  4. Set up DataStage

The following sections describe the example in detail. All the scripts needed to configure this sample are included with this tutorial in the Download section.

Create the DB2 objects

For the sample setup, you will use a provided database called SALES and import the PRODUCT and INVENTORY tables, which are ixf files provided with the download. These are the source tables for the scenario. Alternatively, you can either use the PRODUCT and INVENTORY tables that come with the DB2 V9.7 sample database, or you can import the tables onto another database of your choice.

To import the Product and Inventory tables, change directories to setupDB in the downloaded attachment, where you will find the product.ixf file and the inventory.ixf file. Open a DB2 command window by entering db2cmd, and run the following commands:

import from product.ixf of ixf create into product
import from inventory.ixf of ixf create into inventory

You can create a target database STAGEDB on the target system, such as the example scoobydoo on an AIX system, or you can use a database of your choice as the SQL replication target database.

Setting up SQL replication and create SQL replication objects

You can set up the SQL replication configuration by using the replication administration GUI tool (the replication center) or by using the asnclp command-line tool. The following steps describe how to set SQL replication using the command-line tool.

  1. To create the SQL replication objects, catalog the connection to the SALES source database on the target system, which is the scoobydoo AIX system.
  2. Run all the scripts from the target system. For the example, catalog the connection to the SALES database on the target system by issuing the following commands on the target system:
    db2 catalog tcpip node srcnode remote IPADDRESS server DB2_PORT
    db2 catalog db sales as sales at node srcnode

    IPADDRESS is the IP address of the source system, such as svl-akrishni2, and DB2_PORT is the port number of the database manager instance on the source system.

  3. Enter asnclp –f crtCtlTablesCaptureServer.asnclp in a DB2 command window on the target system. The crtCtlTablesCaptureServer.asnclp script creates the capture control tables on the SALES source database.
  4. Enter asnclp –f crtCtlTablesApplyCtlServer.asnclp in a DB2 command window on the target system. The crtCtlTablesApplyCtlServer.asnclp script creates the apply control tables on the STAGEDB target database.
  5. Enter asnclp –f crtRegistration.in in a DB2 command window on the target system. The crtRegistration.asnclp script registers the Product and Inventory source tables.
  6. Enter asnclp –f crtSubscriptionSetAndAddMembers.in in a DB2 command window on the target system. The crtSubscriptionSetAndAddMembers.asnclp script creates a subscription set and adds the two members (one for each source table).

Operating the SQL capture and apply

Complete the following steps to run the SQL capture and apply.

  1. Start the SQL capture server by entering asncap CAPTURE_SERVER=SALES on a DB2 command window on the source system. The SQL capture program starts capturing the changes in the Product and the Inventory tables. Instead, you can use the startSQLCapture.bat script provided in the StagedInt\setupSQLRep directory of the download.
  2. Start the SQL apply server by entering asnapply CONTROL_SERVER=STAGEDB APPLY_QUAL=AQ00 in a DB2 command window on the target system. The SQL apply program starts applying the changes in the Product and the Inventory tables onto the CCD tables. Instead, you can use the startSQLApply.sh or startSQLApply.bat script provided in the StagedInt\setupSQLRep directory of the download.

Setting up DataStage

For the example, the DataStage job for staged integration uses a sequence job, as shown in Figure 21, to sequence different jobs (parallel jobs and operating system commands) to extract and prune the CCD tables and to trigger SQL apply.

Figure 21. Staged integration: DataStage job
Staged Integration: DataSTage Job

The job activities stage Extract_From_StagedTable1 and Extract_From_StagedTable2 (one for each CCD table) calls a parallel job that extracts from a CCD table and appends the data to three different data sets based on the type of insert, update, or delete operation. If the extract job is successful, an execute command stage is used to execute a script that prunes the CCD table. If the prune step is successful for all the CCD tables, an execute command stage can execute a script that updates the event in the IBMSNAP_SUBS_EVENT control table on the STAGEDB target database. A sequencer stage ensures this. If either the extract job or the prune job is not successful, a terminator stage stops all the jobs.

The StageDataIntSQLRep is the master sequence job that controls the flow of all the jobs in staged integration. The job sequence starts with two job activity stages: Extract_From_StagedTable1 and Extract_From_StagedTable2. These job activity stages call the parallel jobs extractFromProduct_CCD and extractFromInventory_CCD, respectively, to extract data from the CCD tables.

Complete the following steps to complete the stage sequence.

  1. Enter two triggers of expression type conditional OK and Otherwise, respectively, on the Job tab of the job activity stage. The trigger Extract_OK leads to the prune step, and the trigger Extract_Not_OK leads to the termination of the job. Figure 22 shows the triggers tab of the Extract_From_StagedTable1 job activity.
Figure 22. Triggers tab of the job activity stage
Triggers Tab of Job Activity Stage
  1. Select the Send STOP requests to all running jobs option for this stage, as shown in Figure 23. All jobs are terminated.
Figure 23. Termination activity
Termination Activity
  1. Run the script prune_ProductCCDTable.sh, as shown in Listing 3, to prune the CCD target table for the source Product. Note that you need to set DB2PATH to the DB2 installation directory before you use the Windows version of the script prune_ProductCCDTable.bat in the download in the StagedInt\scripts\Win directory.
Listing 3. Execute command activity script
cd /home/db2inst1
. ./sqllib/db2profile
db2 connect to STAGEDB user admin using temp4fvt
db2 "delete from admin.PRODUCT_CCD"

As with the extract step, the prune step has two triggers. The trigger Prune_OK1 leads to the event signal step through a job sequencer stage, and Prune_Not_OK leads to the termination of the job. Figure 24 shows two triggers of expression type conditional OK and Otherwise, respectively.

Figure 24. Triggers for the Prune Step
Triggers for the Prune Step

The Prune_OK trigger from the Prune_From_StagedTable1 execute command leads to a job sequencer in All mode, as shown in Figure 25. This sequencer ensures that the event trigger is fired only if the extract and prune steps of all the CCD tables for all the members of a subscription set are successful.

Figure 25. Prune OK trigger for the prune step
Prune OK Trigger for the Prune Step

Listing 4 shows the event trigger script continueSQLApply.sh that triggers the SQLApply. Note that you need to set DB2PATH to the DB2 installation directory before you use the Windows version of the script prune_ continueSQLApply.bat in the download in the StagedInt\scripts\Win directory.

Listing 4. The continueSQLApply.sh script
cd /home/db2inst1
. ./sqllib/db2profile
db2 connect to STAGEDB user db2inst1 using temp4now
db2 "update ASN.IBMSNAP_SUBS_EVENT set EVENT_TIME=CURRENT_TIMESTAMP,
                    END_OF_PERIOD=CURRENT_TIMESTAMP WHERE EVENT_NAME='
                    DSEXTRACTDONE'"

Creating a parallel job to extract from CCD

The parallel jobs extractFromProduct_CCD and extractFromInventory_CCD extract from the staged tables using a DB2 connector. The jobs use a transformer stage to separate the rows based on the insert, update, or delete operation that was performed on the data. The jobs write the data into three different data sets: one each for insert, update and delete.

Applications that consume the data from the DataStage job are expected to consume these data sets. Typically the DataStage processing separates the data based on the insert, update, or delete row operation. However, if there are referential constraints with the source data, the change data needs to be written to a single data set instead of three different data sets.

Figure 26 shows the parallel job used in the example.

Figure 26. Parallel job that extracts from Product CCD table
Parallel Job that extracts from Product CCD table

The following sections describe the steps for you to create the parallel job.

Creating table definitions

Complete the following steps to create a table definition.

  1. Create a table definition for the PRODUCT_CCD and INVENTORY_CCD tables (CCD target for the Product and Inventory source tables by selecting Import > Table Definitions > Start Connector Import Wizard, as shown in Figure 27.
Figure 27. Create table definition: Start Connector Import wizard
Create Table Definition: Start Connector Import Wizard
  1. In the Connector Import Wizard, select the DB2 Connector option, as shown in Figure 28.
Figure 28. Create table definition: DB2 connector
Create Table Definition: DB2 Connector
  1. In the Connection details window, enter the instance name, the database name, the user ID, and the password, as shown in Figure 29.
Figure 29. Create table definition: Connection details
Create Table Definition: Connection Details
  1. Optionally test the connection to ensure the values you entered are correct.
  2. In the datasource dialog, select the database name from the drop-down menu, such as STAGEDB (db2inst1), as shown in Figure 30.
Figure 30.Create table definition: Database name
Create Table Definition: Database Name
  1. Click the schema name under which the target CCD tables PRODUCT_CCD and INVENTORY_CCD were created, as shown in Figure 31.
Figure 31. Create table definition: Schema name
Create Table Definition: Schema Name
  1. Select the PRODUCT_CCD table in the resulting selection dialog, and click Import to import the table definition.
  2. Repeat the steps for importing the table definition for the INVENTORY_CCD table.

Configure the DB2 connector

Complete the following steps to configure the DB2 connector.

  1. Open the properties window of the DB2 connector, as shown in Figure 32.
Figure 32. Properties of DB2 connector
Properties of DB2 Connector
  1. Build the SQL select statement by clicking Build and Build new SQL (DB2 UDB SQL 8.2 Syntax), and leave the default values for the rest of the properties, as shown in Figure 33.
Figure 33. DB2 connector: SQL builder
DB2 Connector: SQL Builder
  1. In the SQL Builder window, browse to Table Definition on the left pane, and browse to the PRODUCT_CCD table under the database STAGEDB.
  2. Drag and drop the PRODUCT_CCD table onto the right pane.
  3. Select all the columns (before the image, after the image, and the IBMSNAP columns), and drag and drop onto the Select Columns pane.
  4. Click OK for the SQL builder to build the simple select statement.

Configure the transformer stage Process_I_U_D_Operations_Data

After you collect the data from all the columns in the PRODUCT_CCD table, configure the transformer stage Process_I_U_D_Operations to separate the data based on the insert, update, or delete row operations. The row operation is stored as I, U, or D in the IBMSNAP_OPERATION column in the CCD table. You can use the I, U, or D to separate the row data.

Complete the following steps to configure the transformer stage Process_I_U_D_Operations_Data.

  1. Set the constraint on each of the output links from the transformer stage to check whether the operation is I, U, or D for each row data.
  2. Define a stage variable called OPERATION, and set the derivation to the input IBMSNAP_OPERATION column by dragging the IBMSNAP_OPERATION column from the select_from_CCD input onto the Derivation column of the stage variable, as shown in Figure 34.
Figure 34. Configuration of the transformer stage
Configuration of the Transformer Stage
  1. Define the constraints for each of the output links from the transformer stage, as shown in Figure 35.
Figure 35. Transformer stage constraints definition
Transformer Stage Constraints Definition
  1. Map the columns in the transformer stage for the PRODUCT_CCD table, as shown in Figure 36.
Figure 36. Column mapping in the transformer stage for the PRODUCT_CCD table
Column Mapping in the Transformer Stage for PRODUCT CCD Table

Note the following about the column mapping:

  • Define all the columns from the PRODUCT table as the output, and map the same columns from the input Select_From_CCD link.
  • The row data for update operations have both before and after-image values.
  • The row data for insert operations have valid after-image values Before-image columns are empty.
  • Delete operations have valid before-image values.

Configure the data set stages

Complete the following steps to configure the data set stages.

  1. Open the data set stage properties window for the data from an insert operation in the Product table. Note that the update policy is Append.
Figure 37. Data set properties for insert data sets
Data Set Properties for Inserts DataSet
  1. Check that the update policy for all the data sets is Append, as shown in Figure 38.
Figure 38. Column definitions of the insert data set
Column Definitions of Data Set Stage Inserts_DataSet

Figure 39 shows the column definitions of the Update_DataSet Data Set stage. Note that the before and after-image values are captured for update operations.

Figure 39. Column definitions of the updates data set
Column Definitions of the Updates DataSet

Figure 40 shows the column definitions for the Deletes_DataSet1. Note that these are before-image values in the Product table.

Figure 40. Column definitions of the deletes data set
Column Definitions of the Deletes DataSet
  1. Follow the same steps to complete the insert, update, and delete data sets for the Inventory table.

Testing the staged data configuration setup

Complete the following steps to test the staged data configuration setup.

  1. Start SQL replication, as shown in Figure 41.
Figure 41. Start the SQL Capture DB2 Cmd Window
Start SQL Capture DB2 Cmd Window

Note the following about starting SQL replication:

  • The start capture console shows that two registrations are in an inactive state.
  • The registrations become active only when the SQL apply server is started and there is a hand-shake between SQL capture and SQL apply.
  1. Start the DataStage Designer, as shown in Figure 42.
Figure 42. Start capture console
Start Capture Console
  1. Import the DataStage job StagedDataInt.dsx by browsing to the StagedDataInt directory from the download and clicking OK in the DataStage Import window, as shown in Figure 43.
Figure 43. Import Staged Integration DataStage Job
Import Staged Integration DataStage JobImport Staged Integration DataStage Job Contd

After you import the dsx file, you see three jobs in the StagedInt directory.

  1. Open the extractFromProduct_CCD parallel job.
  2. Make sure that the database name (if it is different from STAGEDB), username, and password are set correctly in the DB2 connector stage.
  3. Make sure that all three data sets point to existing files with the path set correctly.
  4. Save and compile the job by clicking the green icon in the designer.
  5. Open extractFromInventory_CCD parallel job, and repeat the same process.
  6. Open the sequence job StagedDataIntSQLRep.
  7. Make sure that all the job activities and exec command stages point to the correct jobs and scripts, respectively, as shown in Figure 44.
  8. Save and compile the job.
Figure 44. Compile staged integration job
Compile Staged Integration Job
  1. Open a DB2 command window, cd to the StageDataInt\setupSQLRep directory, and run the script updateSourceTables.sql. This script does an insert, update, and delete on the Product and the Inventory tables.
  2. Open a DB2 command window, connect to the target database STAGEDB, and enter the following SQL statements:
    Db2 select count(*) from ADMIN.PRODUCT_CCD
    Db2 select count(*) from ADMIN.INVENTORY_CCD

    Both of these statements should return a count of three records, because you introduced three changes to the source data insert, update, and delete in both the source tables.

  3. Run the DataStage job by clicking the Run icon on the DataStage Director menu, as shown in Figure 45. You can start the DataStage Director from the All Programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Director shortcut.
Figure 45. Run staged integration job
Run Staged Integration Job
  1. Check the job status. Figure 46 shows that the jobs finished successfully.
Figure 46. Check job status
Check Job Status
  1. View the job logs, as shown in Figure 47.
Figure 47. View the job log
View Job LogView Job Log Contd
  1. View the data in the data sets to confirm that they were populated, as shown in Figure 48.
Figure 48. View data set
View DataSet
  1. View the data in the Updates_DataSet for the updates in the PRODUCT_CCD table (CCD target table for source table Product), as shown in Figure 49.
Figure 49. View data set for updates
View DataSet Contd
  1. Verify that the END_OF_PERIOD in IBMSNAP_SUBS_EVENT table is updated with the current timestamp, as shown in Figure 50.
Figure 50. Verify that the continue SQL apply event is updated in the IBMSNAP_SUBS_EVENT
Verify continue SQL Apply event is updated in the IBMSNAP_SUBS_EVENT
  1. The data stage log file in Figure 51 shows that the Execute Command Activity ContinueSQLApply job ran successfully.
Figure 51. Job log
Job Log

Conclusion

Part 1 addresses the technologies of the InfoSphere Replication Server and DataStage products and the different ways of integrating the two to feed the Warehouse. This Part 2 detailed two specific integration options: using MQ messages and using staging tables. The article also went through the setup and configuration instructions with screenshots and instructions of the two integration options. With the information in this series, your enterprise can easily integrate InfoSphere Replication Server and InfoSphere Information Server's DataStage to deliver a high-performance solution to feeding the data warehouse.


Download

DescriptionNameSize
Scripts for sample configurationsscripts-paper-mastercopy.zip486KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, WebSphere
ArticleID=515022
ArticleTitle=High-performance solution to feeding a data warehouse with real-time data, Part 2: Explore the integration options with staging tables and WebSphere MQ messages
publish-date=09022010