Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

High-performance solution to feeding a data warehouse with real-time data, Part 2: Explore the integration options with staging tables and WebSphere MQ messages

Integrating InfoSphere Replication Server and InfoSphere DataStage to trickle feed the data warehouse

Anand Krishniyer (akrishni@us.ibm.com), Staff Software Engineer, InfoSphere Replication, IBM
Anand Krishniyer photo
Anand Krishniyer is a staff engineer with the InfoSphere Replication development organization. As a member of the admin team, his responsibilities include developing line items as well as providing technical assistance to customers with respect to installation, setup, and configuration of the Replication Server. Prior to his current role, Anand worked as a project lead for the Integration and Tools team for a process management company, Savvion (now part of Progress Software).
Tony Lee (tonylee@us.ibm.com ), Senior Certified IT Specialist, IBM
Tony Lee photo
Tony Lee is a certified senior IT specialist in the InfoSphere Replication Center of Competency (COC), a team within the InfoSphere Replication development organization. As a member of the COC, Tony has provided technical assistance to customers and business partners on InfoSphere replication technologies. Tony has many years of consulting experience with various IBM replication technologies, covering SQL replication, Q replication, and, most recently, InfoSphere Change Data Capture. Prior to his current role, Tony provided Information Management technical consultation to customers and partners for nearly a decade, covering a wide range of topics, from DB2 tuning, to Information Server and Master Data Management. Prior to becoming a consultant, Tony worked in many different roles in the Information Management area, ranging from management to development.
James Yau (jamesyau@us.ibm.com), Technical Solution Architect, InfoSphere Information Server, IBM
James Yau photo
James Yau is a senior solution architect certified on the InfoSphere Information Server DataStage product. Currently, he is part of the InfoSphere Technology Enablement organization responsible for Information Server Boot Camp content development and delivery. James has many years of consulting experience with the Information Server Suite of products, including InfoSphere DataStage, QualityStage, Information Analyzer, and FastTrack. Prior to his current role, James was part of a Business Partner Technical Enablement team, in which he was the technical program manager for InfoSphere Information Server. His role included course content development and delivery with various delivery vehicles, such as instructor-led, on-line, and self-paced learning. In the past, James worked in many different roles, ranging from software developer to marketing manager, both in IBM and outside of IBM.

Summary:  Feeding a data warehouse with changes from the source database can be very expensive. If the extraction is only done with SQL, there is no way to easily identify the rows that have been changed. IBM InfoSphere™ Replication Server can detect changed data by reading only the database log. This series shows how to use InfoSphere Replication Server to efficiently extract only the changed data and how to pass the changes to IBM InfoSphere DataStage® to feed the data warehouse. Part 1 of the 2-part series provided an overview of these products and how they can work together. In this Part 2, explore two integration options: using WebSphere® MQ messages with InfoSphere Event Publisher and using staging tables.

View more content in this series

Date:  02 Sep 2010
Level:  Intermediate PDF:  A4 and Letter (1877 KB | 52 pages)Get Adobe® Reader®

Activity:  24467 views
Comments:  

Staged data integration with SQL replication

Setting up SQL replication

In the example staged integration setup, the SQL replication configuration that contains the source data is on a Windows system (svl-akrishni2). The operational data store (ODS) on which InfoServer DataStage runs is on the AIX system scoobydoo. The data warehouse setup is not shown in the setup. The data warehouse location does not matter and can be configured on an entirely different system.

The SQL capture program captures the changes in the source tables (the Product and Inventory tables from database SALES in the example setup) from the log and populates the change data (CD) table on the same source database SALES. The SQL apply reads the CD table from the source database and populates the staging table (CCD table) on the target database STAGEDB. The SALES database and the SQL capture are on the Windows system (svl-akrishni2), and STAGEDB and the SQL apply program are on the AIX system. The connection to the SALES database is cataloged on the target system (the scoobydoo AIX system). The SQL apply is configured to be event-driven. The SQL apply applies the changes to the target CCD table in response to an event, such as a row in the IBMSNAP_SUBS_EVENT control table with an updated current timestamp.

The DataStage job running on the AIX system extracts from the CCD table in database STAGEDB and writes to three different data sets, depending whether the operation is of type insert, update, or delete. If the extract from the CCD table is successful, the job proceeds to prune the CCD table. If all the extracts and prunes are successful for all CCD tables, the event in the IBMSNAP_SUBS_EVENT table that triggers the SQL apply is updated so that SQL apply can process another replication cycle.

Note that the SQL apply program wakes up every five minutes to check for updates in the IBMSNAP_SUBS_EVENT table. So if you schedule the Staged Integration DataStage job to run automatically through a scheduler (either the DataStage Director or an external scheduler), there should be a five-minute interval between consecutive cycles. An interval less than five minutes between two consecutive Staged Integration DataStage job cycles might result in a data loss, because the DataStage job might prune the CCD table while the SQL apply program is applying the changes to it.


SQL replication configuration

The high-level steps for the SQL replication configuration are as follows:

  1. Create DB2 objects

    Note: This first step is only needed for this example scenario; it is not a step in the configuration setup. Normally, the DB2 objects would already exist, and you would identify the source tables and determine the subset of data to be replicated.
  2. Set up SQL replication and create SQL replication objects
  3. Operate the SQL capture and apply
  4. Set up DataStage

The following sections describe the example in detail. All the scripts needed to configure this sample are included with this tutorial in the Download section.

Create the DB2 objects

For the sample setup, you will use a provided database called SALES and import the PRODUCT and INVENTORY tables, which are ixf files provided with the download. These are the source tables for the scenario. Alternatively, you can either use the PRODUCT and INVENTORY tables that come with the DB2 V9.7 sample database, or you can import the tables onto another database of your choice.

To import the Product and Inventory tables, change directories to setupDB in the downloaded attachment, where you will find the product.ixf file and the inventory.ixf file. Open a DB2 command window by entering db2cmd, and run the following commands:

import from product.ixf of ixf create into product
import from inventory.ixf of ixf create into inventory

You can create a target database STAGEDB on the target system, such as the example scoobydoo on an AIX system, or you can use a database of your choice as the SQL replication target database.

Setting up SQL replication and create SQL replication objects

You can set up the SQL replication configuration by using the replication administration GUI tool (the replication center) or by using the asnclp command-line tool. The following steps describe how to set SQL replication using the command-line tool.

  1. To create the SQL replication objects, catalog the connection to the SALES source database on the target system, which is the scoobydoo AIX system.
  2. Run all the scripts from the target system. For the example, catalog the connection to the SALES database on the target system by issuing the following commands on the target system:
    db2 catalog tcpip node srcnode remote IPADDRESS server DB2_PORT
    db2 catalog db sales as sales at node srcnode
    

    IPADDRESS is the IP address of the source system, such as svl-akrishni2, and DB2_PORT is the port number of the database manager instance on the source system.

  3. Enter asnclp –f crtCtlTablesCaptureServer.asnclp in a DB2 command window on the target system. The crtCtlTablesCaptureServer.asnclp script creates the capture control tables on the SALES source database.
  4. Enter asnclp –f crtCtlTablesApplyCtlServer.asnclp in a DB2 command window on the target system. The crtCtlTablesApplyCtlServer.asnclp script creates the apply control tables on the STAGEDB target database.
  5. Enter asnclp –f crtRegistration.in in a DB2 command window on the target system. The crtRegistration.asnclp script registers the Product and Inventory source tables.
  6. Enter asnclp –f crtSubscriptionSetAndAddMembers.in in a DB2 command window on the target system. The crtSubscriptionSetAndAddMembers.asnclp script creates a subscription set and adds the two members (one for each source table).

Operating the SQL capture and apply

Complete the following steps to run the SQL capture and apply.

  1. Start the SQL capture server by entering asncap CAPTURE_SERVER=SALES on a DB2 command window on the source system. The SQL capture program starts capturing the changes in the Product and the Inventory tables. Instead, you can use the startSQLCapture.bat script provided in the StagedInt\setupSQLRep directory of the download.
  2. Start the SQL apply server by entering asnapply CONTROL_SERVER=STAGEDB APPLY_QUAL=AQ00 in a DB2 command window on the target system. The SQL apply program starts applying the changes in the Product and the Inventory tables onto the CCD tables. Instead, you can use the startSQLApply.sh or startSQLApply.bat script provided in the StagedInt\setupSQLRep directory of the download.

Setting up DataStage

For the example, the DataStage job for staged integration uses a sequence job, as shown in Figure 21, to sequence different jobs (parallel jobs and operating system commands) to extract and prune the CCD tables and to trigger SQL apply.


Figure 21. Staged integration: DataStage job
Staged Integration: DataSTage Job

The job activities stage Extract_From_StagedTable1 and Extract_From_StagedTable2 (one for each CCD table) calls a parallel job that extracts from a CCD table and appends the data to three different data sets based on the type of insert, update, or delete operation. If the extract job is successful, an execute command stage is used to execute a script that prunes the CCD table. If the prune step is successful for all the CCD tables, an execute command stage can execute a script that updates the event in the IBMSNAP_SUBS_EVENT control table on the STAGEDB target database. A sequencer stage ensures this. If either the extract job or the prune job is not successful, a terminator stage stops all the jobs.

The StageDataIntSQLRep is the master sequence job that controls the flow of all the jobs in staged integration. The job sequence starts with two job activity stages: Extract_From_StagedTable1 and Extract_From_StagedTable2. These job activity stages call the parallel jobs extractFromProduct_CCD and extractFromInventory_CCD, respectively, to extract data from the CCD tables.

Complete the following steps to complete the stage sequence.

  1. Enter two triggers of expression type conditional OK and Otherwise, respectively, on the Job tab of the job activity stage. The trigger Extract_OK leads to the prune step, and the trigger Extract_Not_OK leads to the termination of the job. Figure 22 shows the triggers tab of the Extract_From_StagedTable1 job activity.

Figure 22. Triggers tab of the job activity stage
Triggers Tab of Job Activity Stage
  1. Select the Send STOP requests to all running jobs option for this stage, as shown in Figure 23. All jobs are terminated.

Figure 23. Termination activity
Termination Activity
  1. Run the script prune_ProductCCDTable.sh, as shown in Listing 3, to prune the CCD target table for the source Product. Note that you need to set DB2PATH to the DB2 installation directory before you use the Windows version of the script prune_ProductCCDTable.bat in the download in the StagedInt\scripts\Win directory.

Listing 3. Execute command activity script
cd /home/db2inst1
. ./sqllib/db2profile
db2 connect to STAGEDB user admin using temp4fvt
db2 "delete from admin.PRODUCT_CCD"

As with the extract step, the prune step has two triggers. The trigger Prune_OK1 leads to the event signal step through a job sequencer stage, and Prune_Not_OK leads to the termination of the job. Figure 24 shows two triggers of expression type conditional OK and Otherwise, respectively.


Figure 24. Triggers for the Prune Step
Triggers for the Prune Step

The Prune_OK trigger from the Prune_From_StagedTable1 execute command leads to a job sequencer in All mode, as shown in Figure 25. This sequencer ensures that the event trigger is fired only if the extract and prune steps of all the CCD tables for all the members of a subscription set are successful.


Figure 25. Prune OK trigger for the prune step
Prune OK Trigger for the Prune Step

Listing 4 shows the event trigger script continueSQLApply.sh that triggers the SQLApply. Note that you need to set DB2PATH to the DB2 installation directory before you use the Windows version of the script prune_ continueSQLApply.bat in the download in the StagedInt\scripts\Win directory.


Listing 4. The continueSQLApply.sh script
cd /home/db2inst1
. ./sqllib/db2profile
db2 connect to STAGEDB user db2inst1 using temp4now
db2 "update ASN.IBMSNAP_SUBS_EVENT set EVENT_TIME=CURRENT_TIMESTAMP,
                    END_OF_PERIOD=CURRENT_TIMESTAMP WHERE EVENT_NAME='
                    DSEXTRACTDONE'"


Creating a parallel job to extract from CCD

The parallel jobs extractFromProduct_CCD and extractFromInventory_CCD extract from the staged tables using a DB2 connector. The jobs use a transformer stage to separate the rows based on the insert, update, or delete operation that was performed on the data. The jobs write the data into three different data sets: one each for insert, update and delete.

Applications that consume the data from the DataStage job are expected to consume these data sets. Typically the DataStage processing separates the data based on the insert, update, or delete row operation. However, if there are referential constraints with the source data, the change data needs to be written to a single data set instead of three different data sets.

Figure 26 shows the parallel job used in the example.


Figure 26. Parallel job that extracts from Product CCD table
Parallel Job that extracts from Product CCD table

The following sections describe the steps for you to create the parallel job.

Creating table definitions

Complete the following steps to create a table definition.

  1. Create a table definition for the PRODUCT_CCD and INVENTORY_CCD tables (CCD target for the Product and Inventory source tables by selecting Import > Table Definitions > Start Connector Import Wizard, as shown in Figure 27.

Figure 27. Create table definition: Start Connector Import wizard
Create Table Definition: Start Connector Import Wizard
  1. In the Connector Import Wizard, select the DB2 Connector option, as shown in Figure 28.

Figure 28. Create table definition: DB2 connector
Create Table Definition: DB2 Connector
  1. In the Connection details window, enter the instance name, the database name, the user ID, and the password, as shown in Figure 29.

Figure 29. Create table definition: Connection details
Create Table Definition: Connection Details
  1. Optionally test the connection to ensure the values you entered are correct.
  2. In the datasource dialog, select the database name from the drop-down menu, such as STAGEDB (db2inst1), as shown in Figure 30.

Figure 30.Create table definition: Database name
Create Table Definition: Database Name
  1. Click the schema name under which the target CCD tables PRODUCT_CCD and INVENTORY_CCD were created, as shown in Figure 31.

Figure 31. Create table definition: Schema name
Create Table Definition: Schema Name
  1. Select the PRODUCT_CCD table in the resulting selection dialog, and click Import to import the table definition.
  2. Repeat the steps for importing the table definition for the INVENTORY_CCD table.

Configure the DB2 connector

Complete the following steps to configure the DB2 connector.

  1. Open the properties window of the DB2 connector, as shown in Figure 32.

Figure 32. Properties of DB2 connector
 Properties of DB2 Connector
  1. Build the SQL select statement by clicking Build and Build new SQL (DB2 UDB SQL 8.2 Syntax), and leave the default values for the rest of the properties, as shown in Figure 33.

Figure 33. DB2 connector: SQL builder
DB2 Connector: SQL Builder
  1. In the SQL Builder window, browse to Table Definition on the left pane, and browse to the PRODUCT_CCD table under the database STAGEDB.
  2. Drag and drop the PRODUCT_CCD table onto the right pane.
  3. Select all the columns (before the image, after the image, and the IBMSNAP columns), and drag and drop onto the Select Columns pane.
  4. Click OK for the SQL builder to build the simple select statement.

Configure the transformer stage Process_I_U_D_Operations_Data

After you collect the data from all the columns in the PRODUCT_CCD table, configure the transformer stage Process_I_U_D_Operations to separate the data based on the insert, update, or delete row operations. The row operation is stored as I, U, or D in the IBMSNAP_OPERATION column in the CCD table. You can use the I, U, or D to separate the row data.

Complete the following steps to configure the transformer stage Process_I_U_D_Operations_Data.

  1. Set the constraint on each of the output links from the transformer stage to check whether the operation is I, U, or D for each row data.
  2. Define a stage variable called OPERATION, and set the derivation to the input IBMSNAP_OPERATION column by dragging the IBMSNAP_OPERATION column from the select_from_CCD input onto the Derivation column of the stage variable, as shown in Figure 34.

Figure 34. Configuration of the transformer stage
Configuration of the Transformer Stage
  1. Define the constraints for each of the output links from the transformer stage, as shown in Figure 35.

Figure 35. Transformer stage constraints definition
Transformer Stage Constraints Definition
  1. Map the columns in the transformer stage for the PRODUCT_CCD table, as shown in Figure 36.

Figure 36. Column mapping in the transformer stage for the PRODUCT_CCD table
Column Mapping in the Transformer Stage for PRODUCT CCD Table

Note the following about the column mapping:

  • Define all the columns from the PRODUCT table as the output, and map the same columns from the input Select_From_CCD link.
  • The row data for update operations have both before and after-image values.
  • The row data for insert operations have valid after-image values Before-image columns are empty.
  • Delete operations have valid before-image values.

Configure the data set stages

Complete the following steps to configure the data set stages.

  1. Open the data set stage properties window for the data from an insert operation in the Product table. Note that the update policy is Append.

Figure 37. Data set properties for insert data sets
Data Set Properties for Inserts DataSet
  1. Check that the update policy for all the data sets is Append, as shown in Figure 38.

Figure 38. Column definitions of the insert data set
Column Definitions of Data Set Stage Inserts_DataSet

Figure 39 shows the column definitions of the Update_DataSet Data Set stage. Note that the before and after-image values are captured for update operations.


Figure 39. Column definitions of the updates data set
Column Definitions of the Updates DataSet

Figure 40 shows the column definitions for the Deletes_DataSet1. Note that these are before-image values in the Product table.


Figure 40. Column definitions of the deletes data set
Column Definitions of the Deletes DataSet
  1. Follow the same steps to complete the insert, update, and delete data sets for the Inventory table.

Testing the staged data configuration setup

Complete the following steps to test the staged data configuration setup.

  1. Start SQL replication, as shown in Figure 41.

Figure 41. Start the SQL Capture DB2 Cmd Window
Start SQL Capture DB2 Cmd Window

Note the following about starting SQL replication:

  • The start capture console shows that two registrations are in an inactive state.
  • The registrations become active only when the SQL apply server is started and there is a hand-shake between SQL capture and SQL apply.
  1. Start the DataStage Designer, as shown in Figure 42.

Figure 42. Start capture console
Start Capture Console
  1. Import the DataStage job StagedDataInt.dsx by browsing to the StagedDataInt directory from the download and clicking OK in the DataStage Import window, as shown in Figure 43.

Figure 43. Import Staged Integration DataStage Job
Import Staged Integration DataStage Job Import Staged Integration DataStage Job Contd

After you import the dsx file, you see three jobs in the StagedInt directory.

  1. Open the extractFromProduct_CCD parallel job.
  2. Make sure that the database name (if it is different from STAGEDB), username, and password are set correctly in the DB2 connector stage.
  3. Make sure that all three data sets point to existing files with the path set correctly.
  4. Save and compile the job by clicking the green icon in the designer.
  5. Open extractFromInventory_CCD parallel job, and repeat the same process.
  6. Open the sequence job StagedDataIntSQLRep.
  7. Make sure that all the job activities and exec command stages point to the correct jobs and scripts, respectively, as shown in Figure 44.
  8. Save and compile the job.

Figure 44. Compile staged integration job
Compile Staged Integration Job
  1. Open a DB2 command window, cd to the StageDataInt\setupSQLRep directory, and run the script updateSourceTables.sql. This script does an insert, update, and delete on the Product and the Inventory tables.
  2. Open a DB2 command window, connect to the target database STAGEDB, and enter the following SQL statements:
    Db2 select count(*) from ADMIN.PRODUCT_CCD
    Db2 select count(*) from ADMIN.INVENTORY_CCD
    

    Both of these statements should return a count of three records, because you introduced three changes to the source data insert, update, and delete in both the source tables.

  3. Run the DataStage job by clicking the Run icon on the DataStage Director menu, as shown in Figure 45. You can start the DataStage Director from the All Programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Director shortcut.

Figure 45. Run staged integration job
Run Staged Integration Job
  1. Check the job status. Figure 46 shows that the jobs finished successfully.

Figure 46. Check job status
Check Job Status
  1. View the job logs, as shown in Figure 47.

Figure 47. View the job log
View Job Log View Job Log Contd
  1. View the data in the data sets to confirm that they were populated, as shown in Figure 48.

Figure 48. View data set
View DataSet
  1. View the data in the Updates_DataSet for the updates in the PRODUCT_CCD table (CCD target table for source table Product), as shown in Figure 49.

Figure 49. View data set for updates
View DataSet Contd
  1. Verify that the END_OF_PERIOD in IBMSNAP_SUBS_EVENT table is updated with the current timestamp, as shown in Figure 50.

Figure 50. Verify that the continue SQL apply event is updated in the IBMSNAP_SUBS_EVENT
Verify continue SQL Apply event is updated in the IBMSNAP_SUBS_EVENT
  1. The data stage log file in Figure 51 shows that the Execute Command Activity ContinueSQLApply job ran successfully.

Figure 51. Job log
Job Log

Conclusion

Part 1 addresses the technologies of the InfoSphere Replication Server and DataStage products and the different ways of integrating the two to feed the Warehouse. This Part 2 detailed two specific integration options: using MQ messages and using staging tables. The article also went through the setup and configuration instructions with screenshots and instructions of the two integration options. With the information in this series, your enterprise can easily integrate InfoSphere Replication Server and InfoSphere Information Server's DataStage to deliver a high-performance solution to feeding the data warehouse.

3 of 6 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, WebSphere
ArticleID=515022
TutorialTitle=High-performance solution to feeding a data warehouse with real-time data, Part 2: Explore the integration options with staging tables and WebSphere MQ messages
publish-date=09022010
author1-email=akrishni@us.ibm.com
author1-email-cc=
author2-email=tonylee@us.ibm.com
author2-email-cc=
author3-email=jamesyau@us.ibm.com
author3-email-cc=