Staged data integration with SQL replication
In the example staged integration setup, the SQL replication configuration that contains the source data is on a Windows system (svl-akrishni2). The operational data store (ODS) on which InfoServer DataStage runs is on the AIX system scoobydoo. The data warehouse setup is not shown in the setup. The data warehouse location does not matter and can be configured on an entirely different system.
The SQL capture program captures the changes in the source tables (the Product and Inventory tables from database SALES in the example setup) from the log and populates the change data (CD) table on the same source database SALES. The SQL apply reads the CD table from the source database and populates the staging table (CCD table) on the target database STAGEDB. The SALES database and the SQL capture are on the Windows system (svl-akrishni2), and STAGEDB and the SQL apply program are on the AIX system. The connection to the SALES database is cataloged on the target system (the scoobydoo AIX system). The SQL apply is configured to be event-driven. The SQL apply applies the changes to the target CCD table in response to an event, such as a row in the IBMSNAP_SUBS_EVENT control table with an updated current timestamp.
The DataStage job running on the AIX system extracts from the CCD table in database STAGEDB and writes to three different data sets, depending whether the operation is of type insert, update, or delete. If the extract from the CCD table is successful, the job proceeds to prune the CCD table. If all the extracts and prunes are successful for all CCD tables, the event in the IBMSNAP_SUBS_EVENT table that triggers the SQL apply is updated so that SQL apply can process another replication cycle.
Note that the SQL apply program wakes up every five minutes to check for updates in the IBMSNAP_SUBS_EVENT table. So if you schedule the Staged Integration DataStage job to run automatically through a scheduler (either the DataStage Director or an external scheduler), there should be a five-minute interval between consecutive cycles. An interval less than five minutes between two consecutive Staged Integration DataStage job cycles might result in a data loss, because the DataStage job might prune the CCD table while the SQL apply program is applying the changes to it.
The high-level steps for the SQL replication configuration are as follows:
- Create DB2 objects
Note: This first step is only needed for this example scenario; it is not a step in the configuration setup. Normally, the DB2 objects would already exist, and you would identify the source tables and determine the subset of data to be replicated.
- Set up SQL replication and create SQL replication objects
- Operate the SQL capture and apply
- Set up DataStage
The following sections describe the example in detail. All the scripts needed to configure this sample are included with this tutorial in the Download section.
For the sample setup, you will use a provided database called SALES and import the PRODUCT and INVENTORY tables, which are ixf files provided with the download. These are the source tables for the scenario. Alternatively, you can either use the PRODUCT and INVENTORY tables that come with the DB2 V9.7 sample database, or you can import the tables onto another database of your choice.
To import the Product and Inventory tables, change directories to setupDB in the downloaded attachment, where you will find the product.ixf file and the inventory.ixf file. Open a DB2 command window by entering db2cmd, and run the following commands:
import from product.ixf of ixf create into product import from inventory.ixf of ixf create into inventory
You can create a target database STAGEDB on the target system, such as the example scoobydoo on an AIX system, or you can use a database of your choice as the SQL replication target database.
You can set up the SQL replication configuration by using the replication administration GUI tool (the replication center) or by using the asnclp command-line tool. The following steps describe how to set SQL replication using the command-line tool.
- To create the SQL replication objects, catalog the connection to the SALES source database on the target system, which is the scoobydoo AIX system.
- Run all the scripts from the target system. For the example, catalog the connection to
the SALES database on the target system by issuing the following commands on the
db2 catalog tcpip node srcnode remote IPADDRESS server DB2_PORT db2 catalog db sales as sales at node srcnode
IPADDRESS is the IP address of the source system, such as svl-akrishni2, and DB2_PORT is the port number of the database manager instance on the source system.
- Enter asnclp –f crtCtlTablesCaptureServer.asnclp in a DB2 command window on the target system. The crtCtlTablesCaptureServer.asnclp script creates the capture control tables on the SALES source database.
- Enter asnclp –f crtCtlTablesApplyCtlServer.asnclp in a DB2 command window on the target system. The crtCtlTablesApplyCtlServer.asnclp script creates the apply control tables on the STAGEDB target database.
- Enter asnclp –f crtRegistration.in in a DB2 command window on the target system. The crtRegistration.asnclp script registers the Product and Inventory source tables.
- Enter asnclp –f crtSubscriptionSetAndAddMembers.in in a DB2 command window on the target system. The crtSubscriptionSetAndAddMembers.asnclp script creates a subscription set and adds the two members (one for each source table).
Complete the following steps to run the SQL capture and apply.
- Start the SQL capture server by entering asncap CAPTURE_SERVER=SALES on a DB2 command window on the source system. The SQL capture program starts capturing the changes in the Product and the Inventory tables. Instead, you can use the startSQLCapture.bat script provided in the StagedInt\setupSQLRep directory of the download.
- Start the SQL apply server by entering asnapply CONTROL_SERVER=STAGEDB APPLY_QUAL=AQ00 in a DB2 command window on the target system. The SQL apply program starts applying the changes in the Product and the Inventory tables onto the CCD tables. Instead, you can use the startSQLApply.sh or startSQLApply.bat script provided in the StagedInt\setupSQLRep directory of the download.
For the example, the DataStage job for staged integration uses a sequence job, as shown in Figure 21, to sequence different jobs (parallel jobs and operating system commands) to extract and prune the CCD tables and to trigger SQL apply.
Figure 21. Staged integration: DataStage job
The job activities stage Extract_From_StagedTable1 and Extract_From_StagedTable2 (one for each CCD table) calls a parallel job that extracts from a CCD table and appends the data to three different data sets based on the type of insert, update, or delete operation. If the extract job is successful, an execute command stage is used to execute a script that prunes the CCD table. If the prune step is successful for all the CCD tables, an execute command stage can execute a script that updates the event in the IBMSNAP_SUBS_EVENT control table on the STAGEDB target database. A sequencer stage ensures this. If either the extract job or the prune job is not successful, a terminator stage stops all the jobs.
The StageDataIntSQLRep is the master sequence job that controls the flow of all the jobs in staged integration. The job sequence starts with two job activity stages: Extract_From_StagedTable1 and Extract_From_StagedTable2. These job activity stages call the parallel jobs extractFromProduct_CCD and extractFromInventory_CCD, respectively, to extract data from the CCD tables.
Complete the following steps to complete the stage sequence.
- Enter two triggers of expression type conditional OK and Otherwise, respectively, on the Job tab of the job activity stage. The trigger Extract_OK leads to the prune step, and the trigger Extract_Not_OK leads to the termination of the job. Figure 22 shows the triggers tab of the Extract_From_StagedTable1 job activity.
Figure 22. Triggers tab of the job activity stage
- Select the Send STOP requests to all running jobs option for this stage, as shown in Figure 23. All jobs are terminated.
Figure 23. Termination activity
- Run the script prune_ProductCCDTable.sh, as shown in Listing 3, to prune the CCD target table for the source Product. Note that you need to set DB2PATH to the DB2 installation directory before you use the Windows version of the script prune_ProductCCDTable.bat in the download in the StagedInt\scripts\Win directory.
Listing 3. Execute command activity script
cd /home/db2inst1 . ./sqllib/db2profile db2 connect to STAGEDB user admin using temp4fvt db2 "delete from admin.PRODUCT_CCD"
As with the extract step, the prune step has two triggers. The trigger Prune_OK1 leads to the event signal step through a job sequencer stage, and Prune_Not_OK leads to the termination of the job. Figure 24 shows two triggers of expression type conditional OK and Otherwise, respectively.
Figure 24. Triggers for the Prune Step
The Prune_OK trigger from the Prune_From_StagedTable1 execute command leads to a job sequencer in All mode, as shown in Figure 25. This sequencer ensures that the event trigger is fired only if the extract and prune steps of all the CCD tables for all the members of a subscription set are successful.
Figure 25. Prune OK trigger for the prune step
Listing 4 shows the event trigger script continueSQLApply.sh that triggers the SQLApply. Note that you need to set DB2PATH to the DB2 installation directory before you use the Windows version of the script prune_ continueSQLApply.bat in the download in the StagedInt\scripts\Win directory.
Listing 4. The continueSQLApply.sh script
cd /home/db2inst1 . ./sqllib/db2profile db2 connect to STAGEDB user db2inst1 using temp4now db2 "update ASN.IBMSNAP_SUBS_EVENT set EVENT_TIME=CURRENT_TIMESTAMP, END_OF_PERIOD=CURRENT_TIMESTAMP WHERE EVENT_NAME=' DSEXTRACTDONE'"
The parallel jobs extractFromProduct_CCD and extractFromInventory_CCD extract from the staged tables using a DB2 connector. The jobs use a transformer stage to separate the rows based on the insert, update, or delete operation that was performed on the data. The jobs write the data into three different data sets: one each for insert, update and delete.
Applications that consume the data from the DataStage job are expected to consume these data sets. Typically the DataStage processing separates the data based on the insert, update, or delete row operation. However, if there are referential constraints with the source data, the change data needs to be written to a single data set instead of three different data sets.
Figure 26 shows the parallel job used in the example.
Figure 26. Parallel job that extracts from Product CCD table
The following sections describe the steps for you to create the parallel job.
Complete the following steps to create a table definition.
- Create a table definition for the PRODUCT_CCD and INVENTORY_CCD tables (CCD target for the Product and Inventory source tables by selecting Import > Table Definitions > Start Connector Import Wizard, as shown in Figure 27.
Figure 27. Create table definition: Start Connector Import wizard
- In the Connector Import Wizard, select the DB2 Connector option, as shown in Figure 28.
Figure 28. Create table definition: DB2 connector
- In the Connection details window, enter the instance name, the database name, the user ID, and the password, as shown in Figure 29.
Figure 29. Create table definition: Connection details
- Optionally test the connection to ensure the values you entered are correct.
- In the datasource dialog, select the database name from the drop-down menu, such as STAGEDB (db2inst1), as shown in Figure 30.
Figure 30.Create table definition: Database name
- Click the schema name under which the target CCD tables PRODUCT_CCD and INVENTORY_CCD were created, as shown in Figure 31.
Figure 31. Create table definition: Schema name
- Select the PRODUCT_CCD table in the resulting selection dialog, and click Import to import the table definition.
- Repeat the steps for importing the table definition for the INVENTORY_CCD table.
Complete the following steps to configure the DB2 connector.
- Open the properties window of the DB2 connector, as shown in Figure 32.
Figure 32. Properties of DB2 connector
- Build the SQL select statement by clicking Build and Build new SQL (DB2 UDB SQL 8.2 Syntax), and leave the default values for the rest of the properties, as shown in Figure 33.
Figure 33. DB2 connector: SQL builder
- In the SQL Builder window, browse to Table Definition on the left pane, and browse to the PRODUCT_CCD table under the database STAGEDB.
- Drag and drop the PRODUCT_CCD table onto the right pane.
- Select all the columns (before the image, after the image, and the IBMSNAP columns), and drag and drop onto the Select Columns pane.
- Click OK for the SQL builder to build the simple select statement.
After you collect the data from all the columns in the PRODUCT_CCD
table, configure the transformer stage Process_I_U_D_Operations to separate the data
based on the insert, update, or delete row operations. The row operation is
in the IBMSNAP_OPERATION column in the CCD table. You can use the
separate the row data.
Complete the following steps to configure the transformer stage Process_I_U_D_Operations_Data.
- Set the constraint on each of the
output links from the transformer stage to check whether the operation is
Dfor each row data.
- Define a stage variable called OPERATION, and set the derivation to the input IBMSNAP_OPERATION column by dragging the IBMSNAP_OPERATION column from the select_from_CCD input onto the Derivation column of the stage variable, as shown in Figure 34.
Figure 34. Configuration of the transformer stage
- Define the constraints for each of the output links from the transformer stage, as shown in Figure 35.
Figure 35. Transformer stage constraints definition
- Map the columns in the transformer stage for the PRODUCT_CCD table, as shown in Figure 36.
Figure 36. Column mapping in the transformer stage for the PRODUCT_CCD table
Note the following about the column mapping:
- Define all the columns from the PRODUCT table as the output, and map the same columns from the input Select_From_CCD link.
- The row data for update operations have both before and after-image values.
- The row data for insert operations have valid after-image values Before-image columns are empty.
- Delete operations have valid before-image values.
Complete the following steps to configure the data set stages.
- Open the data set stage properties window for the data from an insert operation
in the Product table. Note that the update policy is
Figure 37. Data set properties for insert data sets
- Check that the update policy for all the data sets is
Append, as shown in Figure 38.
Figure 38. Column definitions of the insert data set
Figure 39 shows the column definitions of the Update_DataSet Data Set stage. Note that the before and after-image values are captured for update operations.
Figure 39. Column definitions of the updates data set
Figure 40 shows the column definitions for the Deletes_DataSet1. Note that these are before-image values in the Product table.
Figure 40. Column definitions of the deletes data set
- Follow the same steps to complete the insert, update, and delete data sets for the Inventory table.
Complete the following steps to test the staged data configuration setup.
- Start SQL replication, as shown in Figure 41.
Figure 41. Start the SQL Capture DB2 Cmd Window
Note the following about starting SQL replication:
- The start capture console shows that two registrations are in an inactive state.
- The registrations become active only when the SQL apply server is started and there is a hand-shake between SQL capture and SQL apply.
- Start the DataStage Designer, as shown in Figure 42.
Figure 42. Start capture console
- Import the DataStage job StagedDataInt.dsx by browsing to the StagedDataInt directory from the download and clicking OK in the DataStage Import window, as shown in Figure 43.
Figure 43. Import Staged Integration DataStage Job
After you import the dsx file, you see three jobs in the StagedInt directory.
- Open the extractFromProduct_CCD parallel job.
- Make sure that the database name (if it is different from STAGEDB), username, and password are set correctly in the DB2 connector stage.
- Make sure that all three data sets point to existing files with the path set correctly.
- Save and compile the job by clicking the green icon in the designer.
- Open extractFromInventory_CCD parallel job, and repeat the same process.
- Open the sequence job StagedDataIntSQLRep.
- Make sure that all the job activities and exec command stages point to the correct jobs and scripts, respectively, as shown in Figure 44.
- Save and compile the job.
Figure 44. Compile staged integration job
- Open a DB2 command window, cd to the StageDataInt\setupSQLRep directory, and run the script updateSourceTables.sql. This script does an insert, update, and delete on the Product and the Inventory tables.
- Open a DB2 command window, connect to the target database STAGEDB, and
following SQL statements:
Db2 select count(*) from ADMIN.PRODUCT_CCD Db2 select count(*) from ADMIN.INVENTORY_CCD
Both of these statements should return a count of three records, because you introduced three changes to the source data insert, update, and delete in both the source tables.
- Run the DataStage job by clicking the Run icon on the DataStage Director menu, as shown in Figure 45. You can start the DataStage Director from the All Programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Director shortcut.
Figure 45. Run staged integration job
- Check the job status. Figure 46 shows that the jobs finished successfully.
Figure 46. Check job status
- View the job logs, as shown in Figure 47.
Figure 47. View the job log
- View the data in the data sets to confirm that they were populated, as shown in Figure 48.
Figure 48. View data set
- View the data in the Updates_DataSet for the updates in the PRODUCT_CCD table (CCD target table for source table Product), as shown in Figure 49.
Figure 49. View data set for updates
- Verify that the END_OF_PERIOD in IBMSNAP_SUBS_EVENT table is updated with the current timestamp, as shown in Figure 50.
Figure 50. Verify that the continue SQL apply event is updated in the IBMSNAP_SUBS_EVENT
- The data stage log file in Figure 51 shows that the Execute Command Activity ContinueSQLApply job ran successfully.
Figure 51. Job log
Part 1 addresses the technologies of the InfoSphere Replication Server and DataStage products and the different ways of integrating the two to feed the Warehouse. This Part 2 detailed two specific integration options: using MQ messages and using staging tables. The article also went through the setup and configuration instructions with screenshots and instructions of the two integration options. With the information in this series, your enterprise can easily integrate InfoSphere Replication Server and InfoSphere Information Server's DataStage to deliver a high-performance solution to feeding the data warehouse.