Traditionally, a data warehouse was updated on a yearly, monthly, or weekly basis by using an ETL process in batches during non-operational hours. In today's business environment, front-line business users need to make decisions based on information that is refreshed hourly, every minute, or even every second. These real-time data requirements made the batch window shrink, and requires an efficient solution to address the issue.
The integration of the IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture is achieved by the InfoSphere CDC Transaction Stage, which is a DataStage component. The CDC Transaction Stage utilizes the services provided by the CDC to capture the changes made to the source databases, and updates the data warehouse with the captured change data in real time, without any data loss even in case of failures, and lets you apply the business logic transformations on the captured change data before it is delivered to the target database. This solution helps the front-line business users to make intelligent business decisions based on the most current updated information in the data warehouse.
IBM InfoSphere DataStage integration with IBM InfoSphere Change Data Capture
The integration between the IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture is implemented by the InfoSphere CDC Transaction Stage. Figure 1 shows how data flows when InfoSphere CDC captures changes at a source database and uses a DataStage job (CDC Transaction Stage) to deliver the change data to a target database.
Figure 1. Data flow in a CDC Transaction Stage job
The detailed description of this data flow is as follows.
- The InfoSphere CDC server for source database monitors and captures the change in the source database.
- The InfoSphere CDC server for source database transfers the change data according to the replication definition.
- The InfoSphere CDC for InfoSphere DataStage server sends data to the CDC Transaction Stage through a TCP/IP session that is created when replication begins. The CDC for DataStage server also sends a COMMIT message (along with bookmark information) periodically, to mark the transaction boundary in the captured log.
- In the DataStage job, the data flows over links from the CDC Transaction Stage to the target database connector stage. The bookmark information is sent over a bookmark link. For each COMMIT message sent by the CDC for DataStage server, the CDC Transaction Stage creates end-of-wave (EOW) markers that are sent on all output links to the target database connector stage.
- The target database connector stage connects to the target database and sends data over the session. When the target database connector stage receives an end-of-wave marker on all input links, it writes bookmark information to a bookmark table and then commits the transaction to the target database.
- Periodically, the CDC for DataStage server requests bookmark information from a bookmark table on the target database. In response to the request, the CDC Transaction Stage fetches the bookmark information through ODBC connection and returns it to the CDC for DataStage server.
- The CDC for DataStage server receives the bookmark information, which is used to determine the starting point in the transaction log where changes are read when replication begins, and to determine if the existing transaction log can be cleaned up.
The bookmark is committed synchronously with the data, so even if the job fails, the bookmark information and the written data are consistent. If the job fails, replication begins at the point that is indicated by the bookmark, and there is no loss of data.
Sample use case: Updating the target database in real time by using the CDC Transaction stage
This section demonstrates a sample use case which monitors the source database for the changes, and updates the target database in real-time with the change data by using the CDC Transaction Stage. This DataStage job includes a CDC Transaction Stage that specifies details about accessing the InfoSphere CDC and a database connector stage to update the target database with the data received from the CDC Transaction Stage. CDC TS in a DataStage job contains a minimum of two output links. One of these output links transfers the actual data, and the second output link transfers the bookmark information, as shown in Figure 2.
Figure 2. Sample use case
The detailed description of the steps required to achieve the real-time change data update in the target database is as follows. The source database can be any database that is supported by the InfoSphere CDC for source database servers. In this use case, Oracle is considered as a source database and DB2 as a target database.
- Connect to the Oracle database and create the source table
SOURCE_TABLE using the following DDL:
- create table SOURCE_TABLE (VC1 VARCHAR(5), CH2 CHAR(5), DT3 TIMESTAMP, NU4 NUMBER , NU5 NUMBER(5,3));
- Connect to the Target DB2 database and create the target table
TARGET_TABLE and bookmark table BOOKMARK_TABLE using the following
- create table TARGET_TABLE (DM_SORTKEY decimal(20), DM_OPERATION_TYPE char(1), DM_TIMESTAMP timestamp, DM_TXID DECIMAL(24), DM_USER varchar(30), VC1 varchar(5), CH2 CHAR(5), DT3 timestamp, NU4 double, NU5 decimal(5,3))
- create table BOOKMARK_TABLE (DM_KEY smallint, DM_BOOKMARK varchar(1024))
Setting up replication
IBM InfoSphere CDC Management Console is used to set up the replication, including adding and configuring a new subscription.
Prerequisite: Install and configure the CDC for Oracle, CDC for DataStage servers, CDC Access server, and the CDC Management Console.
The setup procedure is as follows.
- Log on to the InfoSphere CDC Management Console (MC) using CDC Access server credentials.
- Add new data stores for the source and the target in the
Access Manager tab of the CDC Management Console.
- Add a new data store called datastore_orcl to access the Oracle database.
- Add a new data store called datastore_ds to
access the InfoSphere DataStage.
Figure 3 shows the Datastore properties window for the datastore_ds datastore.
Figure 3. Datastore properties
- Assign a user to each data store in the Connection
Management tab. Figure 4 shows the
datastore_ds which has been assigned the user
Figure 4. Connection Management tab
- In the Configuration tab of the CDC Management
Console, add a new subscription called SUB001. In the
DataStores section, select datastore_orcl as source,
and select datastore_ds as target. Figure 5 shows the
values that are specified for the new subscription
Figure 5. New subscription
- In the Subscriptions tab, right click on the
subscription for which the source tables needs to be mapped
(SUB001), and then click the Map
Tables option. This will start the Map
The detailed table mapping process is as follows.
- Select InfoSphere DataStage, which is the
default option for Direct Connect Method. Then click
Next, as showing in Figure 6.
Figure 6. Select mapping type
- Select the DataStage connection method as Direct
Connect and then click Next, as
shown in Figure 7.
Figure 7. Select connection method
- The Select Source Tables page shows all of
the available tables in the source database. From this list,
select one or more tables for which the change data need to be
captured, as shown in Figure 8.
Figure 8. Select source tables
- Select the required record format output, either
Single Record or Multiple
Records, as shown in Figure 9.
Figure 9. Record Format
- Review the mapping details and if they are correct, click
Finish, as shown in Figure 10.
Figure 10. Review Mapping details
- The table mappings for the subscription are displayed in the
Table Mappings tab, as shown in Figure
Figure 11. Table Mappings
- Select InfoSphere DataStage, which is the default option for Direct Connect Method. Then click Next, as showing in Figure 6.
- In the Subscriptions tab, right-click the
subscription SUB001, and select InfoSphere
DataStage > InfoSphere DataStage Properties. Specify
the following properties.
- In the Direct Connect area, as shown in
Figure 12, specify the following values for the InfoSphere
- In the Project Name field, specify the name of the DataStage project dstage1.
- In the Job Name field, specify the job name SUB001.
- In the Connection Key field, specify the connection key SUB001_CONNECTIONKEY.
Figure 12. DataStage Properties
- Optional: To configure the job to start automatically when the mirroring on the subscription starts, select the Auto-start InfoSphere DataStage Job check box. To enable the auto-start option, the PATH environment variable must contain the path to the dsjob executable file at the time of the instance creation.
- Click OK.
- In the Direct Connect area, as shown in Figure 12, specify the following values for the InfoSphere DataStage job.
Generating a CDC Transaction Stage job template
After completing the setting up replication process described in the previous section, generate a InfoSphere DataStage (CDC Transaction Stage) job template using the following steps.
- In the Subscriptions pane of the Configuration tab in the Management Console, right-click Subscription, and click InfoSphere DataStage > Generate InfoSphere DataStage Job Definition.
- In the Generate InfoSphere Job Definition dialog box, select the location where you want to save the template SUB001.dsx, and click Save.
- Copy the .dsx file on the computer where InfoSphere DataStage and QualityStage Designer is installed.
Importing and configuring the job template
Import the job template generated into the DataStage project dstage1 using the DataStage Designer. Then provide the required information in the CDC Transaction Stage and Target DB2 Connector. The details of this process are as follows.
- Open the DataStage Designer client, click Import > DataStage components, and specify the path to the SUB001.dsx file, and then click OK.
- In the DataStage Repository pane, expand Jobs, and double-click SUB001 to open the job.
- Configure the stage level and link level properties of the CDC
Transaction Stage as follows.
- Double-click the CDC Transaction stage in the job.
- In the navigator window, select the stage as shown in Figure 13,
and specify the required values.
Figure 13. CDC Transaction Stage level properties
The description of these properties is as follows.
- Bookmark DSN: The ODBC data source name for bookmark table.
- ODBC user name: The user name to access the ODBC data source.
- ODBC password: The password to access the ODBC data source.
- Subscription: Name of CDC subscription. The CDC server uses only the first 8 characters to identify the subscription, so it's necessary to put only the first 8 characters of the subscription name registered on CDC GUI. Input strings longer than 8 characters are truncated to 8 characters.
- Port number: The port number to communicate with CDC server.
- Host name: The host name of the CDC server.
- User name: The user name used to access CDC server.
- Password: The password used to access CDC server.
- Heartbeat interval: The interval time in seconds of heartbeat message that the CDC server sends when there is no change data during the time.
- In the navigator window, select the output links one by one as
shown in Figure 14, and specify the required values.
Figure 14. Link-level properties
The description of these link-level properties is given as follows.
- Bookmark link: This is true if the link is for bookmark.
- Table name: This is the table name of change data or bookmark table name.
- Click OK, and then save the job.
- Configure the Target DB2 connector.
- Open the Properties page of the DB2 Connector by double-clicking the DB2 Connector stage.
- In the navigation window of the stage, select the
Bookmark link and specify
BOOKMARK_TABLE for the Table
name property as shown in Figure 15.
Figure 15. DB2 Connector Bookmark link properties
- In the navigation window of the stage, select the
SOURCE_TABLE input link and specify
TARGET_TABLE for the Table
name property as shown in Figure 16.
Figure 16. DB2 Connector data link properties
- In the navigation window, select the DB2
Connector icon and provide the details of the
target database as shown in Figure 17.
Figure 17. DB2 Connector stage level properties
- Click the input link and specify TARGET_TABLE for the Table name property.
- Click OK, and then save the job.
- Compile the job using the DataStage Designer menu.
Start Mirroring on the subscription and execute the job
After importing the job and configuring the stages as described in the previous section, you can start the subscription in the InfoSphere CDC Management Console and execute the job. The details of this process are as follows.
- Log on to the CDC Management Console.
- Select the subscription SUB001.
- Right-click on the subscription and select Start Mirroring to open the Start Mirroring dialog box.
- Select Continuous mirroring method to replicate continuously until the subscription is ended.
- If the DataStage job is configured to start automatically, the DataStage job SUB001 automatically starts running. If the DataStage job is not configured to run automatically, run the job from the DataStage Designer.
- The subscription in the InfoSphere CDC Management Console will be in continuous mirroring state, and the DataStage will be running continuously.
Updating the target table with the change data
After completing the steps mentioned in the previous section, the DataStage job and subscription will be in continuously running state. At this point, the CDC for Oracle Server continuously monitors the source database table for the changes, and captures it if any change occurs to the source table. This captured change data is sent to the CDC Transaction Stage which, in-turn, sends the data to the Target DB2 Connector to update the target table with the change data. This process is described as follows.
- Connect to the source database (Oracle database) and issue the
following SQL query to update the SOURCE_TABLE:
- Insert into SOURCE_TABLE values ('111','AAA');
- Insert into SOURCE_TABLE values ('222','BBB');
- Issue the following select statement in the source database to view
the data in the SOURCE_TABLE table:
SELECT * FROM SOURCE_TABLE;
The select statement returns the data shown in Figure 18.
Figure 18. Data in the source table
- After changes are committed to the source database, connect to the
target database, and issue the following queries to view the data in
the target table:
SELECT * FROM TARGET_TABLE;
The select statement returns the data shown in Figure 19.
Figure 19. Data in the target table
- Issue additional queries to update, insert, and delete data. Issue select statements on the target tables to see how the change data is replicated to the target.
As demonstrated in the previous steps, the CDC Transaction Stage job runs continuously and updates the target table with the change data as soon as it receives it from the CDC Server.
Applying business logic on the change data
The CDC Transaction Stage allows you to apply the business logic related functions on the change data before updating this data in the target database. The DataStage job shown in Figure 20 describes how a CDC Transaction Stage can be used to apply the business logic on the change data.
Figure 20. Applying the business logic on the change data
In this job, whenever the Order table is updated with a new order, the CDC Server captures that change in the table, and transfers that change data to the CDC Transaction Stage. The CDC Transaction Stage passes the change data to the lookup stage in the job. The lookup stage performs the lookup operation with the Inventory table and sends the resulting data to the Sales_Table. The DB2Connector (Sales_Table) updates the target sales table with the resulting data.
The CDC Transaction Stage allows you to apply the business logic on the change data by using various active stages (processing stages) available in the DataStage, as shown in the previous use case. This provides greater flexibility to apply various business logics on the change data according your business requirements.
Advantages of the integrated solution
The solution based on integration of IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture offers the following advantages.
- Improves operational efficiency and saves time and resources by eliminating redundant data transfer and saving network bandwidth.
- Helps to make better decisions with a solution that provides in-flight data transformations to meet specific business requirements. It also securely delivers sensitive data and makes it accessible to authorized recipients only.
- Provides guaranteed data delivery in case of failures.
- Allows transforming the change data by using the pre-built stages that are provided by DataStage.
- Integrates data by performing lookups.
- Applies custom business logic to the change data before delivering the data to a target database.
In the current business environment, decisions by front-line users need to be based on the information that is refreshed hourly, every minute, or even every second. Traditional ETL systems can not achieve this objective. In this article, IBM's solution is based on integration of the IBM InfoSphere DataStage with IBM InfoSphere Change Data Capture, which ensures updating of the change data in real time with guaranteed data delivery in case of failures, and allows users to apply the business logic on the captured change data before it is updated in the target database. The solution based on integration described in this article provides substantial business value to the organizations, including cost efficiency and improved speed.
Thanks to Manish Bhide, Architect, Information Server and IBM Master Inventor, for reviewing this article several times and providing his valuable input to this article.
My Sincere thanks to SriLakhsmi Kotwal, QA Manager, Information Server, for reviewing this article and encouraging me in writing this quality article from day one.
- Get more information about IBM InfoSphere Information Server from the Information Center.
- Get more information about InfoSphere Change Data Capture.
- Visit the developerWorks Information Management zone to find more resources for DB2 developers and administrators.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Participate in the discussion forum.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.