An efficient change data capture using IBM InfoSphere CDC Transaction Stage

This article describes a solution that is based on integration of the IBM® InfoSphere® DataStage® with IBM InfoSphere Change Data Capture that ensures you capture the changed operational data and transmit it in real-time to the data warehouse, keeping it in an updated state at every moment. This enables you to retrieve current information for intelligent decision making, providing substantial business value, guaranteed data delivery, cost efficiency, and improved speed.

Share:

Kiran Jogu (kirankumarj@in.ibm.com), Software QA Engineer, IBM

Author photo of Dr. Kiran JoguDr. Kiran Kumar Jogu works for IBM India Software Labs in the area of Software Testing since 2005. He is a certified Teradata Master and has a wealth of experience on various Information Server connectivity components like Teradata and CDC Transaction Stage. He has completed his Ph.D. in Computer Science and Engineering, and received the Doctorate from JNTU, AP, India. He received his M.Tech. in Computer Science and Engineering from the same university. He has published more than 10 papers in International Journals and Conferences in Software Engineering area. He is a member of IEEE and ACM.



25 October 2012

Also available in Chinese

Introduction

Traditionally, a data warehouse was updated on a yearly, monthly, or weekly basis by using an ETL process in batches during non-operational hours. In today's business environment, front-line business users need to make decisions based on information that is refreshed hourly, every minute, or even every second. These real-time data requirements made the batch window shrink, and requires an efficient solution to address the issue.

The integration of the IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture is achieved by the InfoSphere CDC Transaction Stage, which is a DataStage component. The CDC Transaction Stage utilizes the services provided by the CDC to capture the changes made to the source databases, and updates the data warehouse with the captured change data in real time, without any data loss even in case of failures, and lets you apply the business logic transformations on the captured change data before it is delivered to the target database. This solution helps the front-line business users to make intelligent business decisions based on the most current updated information in the data warehouse.


IBM InfoSphere DataStage integration with IBM InfoSphere Change Data Capture

The integration between the IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture is implemented by the InfoSphere CDC Transaction Stage. Figure 1 shows how data flows when InfoSphere CDC captures changes at a source database and uses a DataStage job (CDC Transaction Stage) to deliver the change data to a target database.

Figure 1. Data flow in a CDC Transaction Stage job
This fugire shows the data flow in a CDC Transaction Stage job

The detailed description of this data flow is as follows.

  • The InfoSphere CDC server for source database monitors and captures the change in the source database.
  • The InfoSphere CDC server for source database transfers the change data according to the replication definition.
  • The InfoSphere CDC for InfoSphere DataStage server sends data to the CDC Transaction Stage through a TCP/IP session that is created when replication begins. The CDC for DataStage server also sends a COMMIT message (along with bookmark information) periodically, to mark the transaction boundary in the captured log.
  • In the DataStage job, the data flows over links from the CDC Transaction Stage to the target database connector stage. The bookmark information is sent over a bookmark link. For each COMMIT message sent by the CDC for DataStage server, the CDC Transaction Stage creates end-of-wave (EOW) markers that are sent on all output links to the target database connector stage.
  • The target database connector stage connects to the target database and sends data over the session. When the target database connector stage receives an end-of-wave marker on all input links, it writes bookmark information to a bookmark table and then commits the transaction to the target database.
  • Periodically, the CDC for DataStage server requests bookmark information from a bookmark table on the target database. In response to the request, the CDC Transaction Stage fetches the bookmark information through ODBC connection and returns it to the CDC for DataStage server.
  • The CDC for DataStage server receives the bookmark information, which is used to determine the starting point in the transaction log where changes are read when replication begins, and to determine if the existing transaction log can be cleaned up.

The bookmark is committed synchronously with the data, so even if the job fails, the bookmark information and the written data are consistent. If the job fails, replication begins at the point that is indicated by the bookmark, and there is no loss of data.


Sample use case: Updating the target database in real time by using the CDC Transaction stage

This section demonstrates a sample use case which monitors the source database for the changes, and updates the target database in real-time with the change data by using the CDC Transaction Stage. This DataStage job includes a CDC Transaction Stage that specifies details about accessing the InfoSphere CDC and a database connector stage to update the target database with the data received from the CDC Transaction Stage. CDC TS in a DataStage job contains a minimum of two output links. One of these output links transfers the actual data, and the second output link transfers the bookmark information, as shown in Figure 2.

Figure 2. Sample use case
This figure shows the sample Use Case with output link transfers

The detailed description of the steps required to achieve the real-time change data update in the target database is as follows. The source database can be any database that is supported by the InfoSphere CDC for source database servers. In this use case, Oracle is considered as a source database and DB2 as a target database.

  • Connect to the Oracle database and create the source table SOURCE_TABLE using the following DDL:
    • create table SOURCE_TABLE (VC1 VARCHAR(5), CH2 CHAR(5), DT3 TIMESTAMP, NU4 NUMBER , NU5 NUMBER(5,3));
  • Connect to the Target DB2 database and create the target table TARGET_TABLE and bookmark table BOOKMARK_TABLE using the following DDLs:
    • create table TARGET_TABLE (DM_SORTKEY decimal(20), DM_OPERATION_TYPE char(1), DM_TIMESTAMP timestamp, DM_TXID DECIMAL(24), DM_USER varchar(30), VC1 varchar(5), CH2 CHAR(5), DT3 timestamp, NU4 double, NU5 decimal(5,3))
    • create table BOOKMARK_TABLE (DM_KEY smallint, DM_BOOKMARK varchar(1024))

Setting up replication

IBM InfoSphere CDC Management Console is used to set up the replication, including adding and configuring a new subscription.

Prerequisite: Install and configure the CDC for Oracle, CDC for DataStage servers, CDC Access server, and the CDC Management Console.

The setup procedure is as follows.

  1. Log on to the InfoSphere CDC Management Console (MC) using CDC Access server credentials.
  2. Add new data stores for the source and the target in the Access Manager tab of the CDC Management Console.
    1. Add a new data store called datastore_orcl to access the Oracle database.
    2. Add a new data store called datastore_ds to access the InfoSphere DataStage.

      Figure 3 shows the Datastore properties window for the datastore_ds datastore.
      Figure 3. Datastore properties
      This figure shows the datastore properties for datastore_ds.
    3. Assign a user to each data store in the Connection Management tab. Figure 4 shows the datastore_ds which has been assigned the user Admin.
      Figure 4. Connection Management tab
      This figure shows the datastore_ds assigned User in the Connection Management tab
  3. In the Configuration tab of the CDC Management Console, add a new subscription called SUB001. In the DataStores section, select datastore_orcl as source, and select datastore_ds as target. Figure 5 shows the values that are specified for the new subscription SUB001.
    Figure 5. New subscription
    This figure shows the values specified in the new subscription
  4. In the Subscriptions tab, right click on the subscription for which the source tables needs to be mapped (SUB001), and then click the Map Tables option. This will start the Map Tables wizard.

    The detailed table mapping process is as follows.
    1. Select InfoSphere DataStage, which is the default option for Direct Connect Method. Then click Next, as showing in Figure 6.
      Figure 6. Select mapping type
      This figure shows the default option and select Mapping Type
    2. Select the DataStage connection method as Direct Connect and then click Next, as shown in Figure 7.
      Figure 7. Select connection method
      This figure shows the selection of Direct Connect select connection method
    3. The Select Source Tables page shows all of the available tables in the source database. From this list, select one or more tables for which the change data need to be captured, as shown in Figure 8.
      Figure 8. Select source tables
      This figure shows the select source tables page with available tables.
    4. Select the required record format output, either Single Record or Multiple Records, as shown in Figure 9.
      Figure 9. Record Format
      This figure shows the InfoSphere DataStage Direct Connect page for selecting the Record Format
    5. Review the mapping details and if they are correct, click Finish, as shown in Figure 10.
      Figure 10. Review Mapping details
      This figure shows the Review Mapping page.
    6. The table mappings for the subscription are displayed in the Table Mappings tab, as shown in Figure 11.
      Figure 11. Table Mappings
      This figure shows the source and target as well as the Mapping Type and method.
  5. In the Subscriptions tab, right-click the subscription SUB001, and select InfoSphere DataStage > InfoSphere DataStage Properties. Specify the following properties.
    1. In the Direct Connect area, as shown in Figure 12, specify the following values for the InfoSphere DataStage job.
      • In the Project Name field, specify the name of the DataStage project dstage1.
      • In the Job Name field, specify the job name SUB001.
      • In the Connection Key field, specify the connection key SUB001_CONNECTIONKEY.
      Figure 12. DataStage Properties
      This figure shows the DataStage Properties page where you specify batch size threshhold, large object truncation size, and Direct Connect details.
    2. Optional: To configure the job to start automatically when the mirroring on the subscription starts, select the Auto-start InfoSphere DataStage Job check box. To enable the auto-start option, the PATH environment variable must contain the path to the dsjob executable file at the time of the instance creation.
    3. Click OK.

Generating a CDC Transaction Stage job template

After completing the setting up replication process described in the previous section, generate a InfoSphere DataStage (CDC Transaction Stage) job template using the following steps.

  1. In the Subscriptions pane of the Configuration tab in the Management Console, right-click Subscription, and click InfoSphere DataStage > Generate InfoSphere DataStage Job Definition.
  2. In the Generate InfoSphere Job Definition dialog box, select the location where you want to save the template SUB001.dsx, and click Save.
  3. Copy the .dsx file on the computer where InfoSphere DataStage and QualityStage Designer is installed.

Importing and configuring the job template

Import the job template generated into the DataStage project dstage1 using the DataStage Designer. Then provide the required information in the CDC Transaction Stage and Target DB2 Connector. The details of this process are as follows.

  1. Open the DataStage Designer client, click Import > DataStage components, and specify the path to the SUB001.dsx file, and then click OK.
  2. In the DataStage Repository pane, expand Jobs, and double-click SUB001 to open the job.
  3. Configure the stage level and link level properties of the CDC Transaction Stage as follows.
    1. Double-click the CDC Transaction stage in the job.
    2. In the navigator window, select the stage as shown in Figure 13, and specify the required values.
      Figure 13. CDC Transaction Stage level properties
      This figure shows the CDC Transaction Stage Stage level properties
      The description of these properties is as follows.
      • Bookmark DSN: The ODBC data source name for bookmark table.
      • ODBC user name: The user name to access the ODBC data source.
      • ODBC password: The password to access the ODBC data source.
      • Subscription: Name of CDC subscription. The CDC server uses only the first 8 characters to identify the subscription, so it's necessary to put only the first 8 characters of the subscription name registered on CDC GUI. Input strings longer than 8 characters are truncated to 8 characters.
      • Port number: The port number to communicate with CDC server.
      • Host name: The host name of the CDC server.
      • User name: The user name used to access CDC server.
      • Password: The password used to access CDC server.
      • Heartbeat interval: The interval time in seconds of heartbeat message that the CDC server sends when there is no change data during the time.
    3. In the navigator window, select the output links one by one as shown in Figure 14, and specify the required values.
      Figure 14. Link-level properties
      This figure shows the link level properties and required values.
      The description of these link-level properties is given as follows.
      • Bookmark link: This is true if the link is for bookmark.
      • Table name: This is the table name of change data or bookmark table name.
    4. Click OK, and then save the job.
  4. Configure the Target DB2 connector.
    1. Open the Properties page of the DB2 Connector by double-clicking the DB2 Connector stage.
    2. In the navigation window of the stage, select the Bookmark link and specify BOOKMARK_TABLE for the Table name property as shown in Figure 15.
      Figure 15. DB2 Connector Bookmark link properties
      This figure shows the DB2 Connector 'Bookmark link' properties.
    3. In the navigation window of the stage, select the SOURCE_TABLE input link and specify TARGET_TABLE for the Table name property as shown in Figure 16.
      Figure 16. DB2 Connector data link properties
      This figure shows the DB2 Connector data link properties, SOURCE_TABLE and TARGET_TABLE.
    4. In the navigation window, select the DB2 Connector icon and provide the details of the target database as shown in Figure 17.
      Figure 17. DB2 Connector stage level properties
      DB2 Connector Stage level properties
    5. Click the input link and specify TARGET_TABLE for the Table name property.
  5. Click OK, and then save the job.
  6. Compile the job using the DataStage Designer menu.

Start Mirroring on the subscription and execute the job

After importing the job and configuring the stages as described in the previous section, you can start the subscription in the InfoSphere CDC Management Console and execute the job. The details of this process are as follows.

  1. Log on to the CDC Management Console.
  2. Select the subscription SUB001.
  3. Right-click on the subscription and select Start Mirroring to open the Start Mirroring dialog box.
  4. Select Continuous mirroring method to replicate continuously until the subscription is ended.
  5. If the DataStage job is configured to start automatically, the DataStage job SUB001 automatically starts running. If the DataStage job is not configured to run automatically, run the job from the DataStage Designer.
  6. The subscription in the InfoSphere CDC Management Console will be in continuous mirroring state, and the DataStage will be running continuously.

Updating the target table with the change data

After completing the steps mentioned in the previous section, the DataStage job and subscription will be in continuously running state. At this point, the CDC for Oracle Server continuously monitors the source database table for the changes, and captures it if any change occurs to the source table. This captured change data is sent to the CDC Transaction Stage which, in-turn, sends the data to the Target DB2 Connector to update the target table with the change data. This process is described as follows.

  1. Connect to the source database (Oracle database) and issue the following SQL query to update the SOURCE_TABLE:
    • Insert into SOURCE_TABLE values ('111','AAA');
    • Insert into SOURCE_TABLE values ('222','BBB');
  2. Issue the following select statement in the source database to view the data in the SOURCE_TABLE table:
    SELECT * FROM SOURCE_TABLE;
    The select statement returns the data shown in Figure 18.
    Figure 18. Data in the source table
    This figure shows the data in the Source table and the select statement.
  3. After changes are committed to the source database, connect to the target database, and issue the following queries to view the data in the target table:
    SELECT * FROM TARGET_TABLE;
    The select statement returns the data shown in Figure 19.
    Figure 19. Data in the target table
    This figure shows the data in the Target table and the select statement that is returned.
  4. Issue additional queries to update, insert, and delete data. Issue select statements on the target tables to see how the change data is replicated to the target.

As demonstrated in the previous steps, the CDC Transaction Stage job runs continuously and updates the target table with the change data as soon as it receives it from the CDC Server.

Applying business logic on the change data

The CDC Transaction Stage allows you to apply the business logic related functions on the change data before updating this data in the target database. The DataStage job shown in Figure 20 describes how a CDC Transaction Stage can be used to apply the business logic on the change data.

Figure 20. Applying the business logic on the change data
This figure shows how a CDC Transaction stage can be used to apply the Business Logic on the Change Data

In this job, whenever the Order table is updated with a new order, the CDC Server captures that change in the table, and transfers that change data to the CDC Transaction Stage. The CDC Transaction Stage passes the change data to the lookup stage in the job. The lookup stage performs the lookup operation with the Inventory table and sends the resulting data to the Sales_Table. The DB2Connector (Sales_Table) updates the target sales table with the resulting data.

The CDC Transaction Stage allows you to apply the business logic on the change data by using various active stages (processing stages) available in the DataStage, as shown in the previous use case. This provides greater flexibility to apply various business logics on the change data according your business requirements.


Advantages of the integrated solution

The solution based on integration of IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture offers the following advantages.

  • Improves operational efficiency and saves time and resources by eliminating redundant data transfer and saving network bandwidth.
  • Helps to make better decisions with a solution that provides in-flight data transformations to meet specific business requirements. It also securely delivers sensitive data and makes it accessible to authorized recipients only.
  • Provides guaranteed data delivery in case of failures.
  • Allows transforming the change data by using the pre-built stages that are provided by DataStage.
  • Integrates data by performing lookups.
  • Applies custom business logic to the change data before delivering the data to a target database.

Conclusions

In the current business environment, decisions by front-line users need to be based on the information that is refreshed hourly, every minute, or even every second. Traditional ETL systems can not achieve this objective. In this article, IBM's solution is based on integration of the IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture, which ensures updating of the change data in real time with guaranteed data delivery in case of failures, and allows users to apply the business logic on the captured change data before it is updated in the target database. The solution based on integration described in this article provides substantial business value to the organizations, including cost efficiency and improved speed.


Acknowledgements

Thanks to Manish Bhide, Architect, Information Server and IBM Master Inventor, for reviewing this article several times and providing his valuable input to this article.

My Sincere thanks to SriLakhsmi Kotwal, QA Manager, Information Server, for reviewing this article and encouraging me in writing this quality article from day one.


References

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=841538
ArticleTitle=An efficient change data capture using IBM InfoSphere CDC Transaction Stage
publish-date=10252012