Configure InfoSphere CDC with InfoSphere DataStage for real-time ETL

Using Direct Connect

In this article, learn to configure the IBM® data replication product called IBM InfoSphere® Change Data Capture with the extract, transform, and load (ETL) tool InfoSphere DataStage®, a component of IBM InfoSphere Information Server. Full integration of these two products makes real-time ETL possible.

Srinivasan R. Mottupalli (srinimr@gmail.com), Architect

Author Photo: Srinivasan MottupalliSrinivasan R Mottupalli was a senior software engineer at IBM, India Software Lab, Bangalore where he was in the product development team of InfoSphere Change Data Capture. He was with IBM for more than a decade, and worked on various features of IBM products, including Informix Dynamic Server (IDS), Extended Parallel Server (XPS), DB2, and Informix 4GL (I4GL), as a design and development engineer.



01 March 2012

Overview

InfoSphere Change Data Capture (CDC) and InfoSphere DataStage have introduced an integration option called Direct Connect. Direct Connect enables seamless integration with both real-time replication and ETL tools to provide real-time ETL. This removes the need for a batch window that is needed for the ETL tool to build, for example, an enterprise data warehouse. The data warehouse or data mart can be kept real-time to make appropriate decisions based on more current data. Additionally, metadata integration between InfoSphere CDC and Information Server is provided for end-to-end data lineage utilizing the IBM InfoSphere Metadata Workbench.

For ease of understanding and simplicity, this article uses IBM DB2® for Linux®, UNIX®, and Windows® as the source and target databases. However, it is also possible to support heterogeneous databases with this integration solution.

Prerequisites

To understand this article, you should have basic knowledge about the following.

  • Windows environment
  • Using the Management Console
  • SQL data manipulation (DML) statements

To follow the step-by-step instruction, you will need the following to be installed.

  • IBM DB2 for Linux, UNIX, and Windows
  • IBM InfoSphere CDC Management Console Version 6.5
  • IBM InfoSphere CDC Access Server Version 6.5
  • IBM InfoSphere CDC for DataStage Version 6.5
  • IBM InfoSphere Information Server (for DataStage)
    • DataStage Designer Version 8.5
    • DataStage Director Version 8.5

This article assumes the following configurations.

  • Windows Machine A (9.126.91.125) installed items 4, 5, 6, 7
  • Virtual Windows Machine B (IP: 192.168.67.130) installed items 4, 8

High-level architecture

The left side of Figure 1 shows the replication tool InfoSphere CDC for source database and the InfoSphere CDC for DataStage. In this example, the source database is IBM DB2 for Linux, UNIX, and Windows. The Management Console GUI is used to create the CDC subscription to specify the replication details between the CDC source database, and the CDC for DataStage and Access Server provides the required credentials to access both source database and the target server.

The right side of Figure 1 shows the ETL tool, InfoSphere DataStage, running a job that can be created by the CDC tool, Management Console. DataStage has various connectivity options with the target database for applying the changes that are received by it, and also contains an ODBC connection for applying to a bookmark table. The target database that was assumed here is IBM DB2 for Linux, UNIX, and Windows.

Figure 1. Architecture diagram for Direct Connect integration option
Shows management console with CDC access server on left, DataStage Client on right

The Direct Connect feature enables the communication between the two products. This communication uses a new protocol over TCP/IP. As seen in the diagram, this feature connects both sides and makes the hand-shake for further communications to exchange data between InfoSphere CDC and InfoSphere DataStage.

In order to configure InfoSphere CDC for DataStage, you will need the following.

  1. Connectivity from InfoSphere CDC to DataStage (via TCP/IP).
  2. Appropriate job with necessary information for DataStage to compile and run.
  3. Appropriate tables with expected schema created at the source (captured by CDC) and target databases (applied by DataStage).
  4. InfoSphere CDC aware of ODBC credentials to supply to DataStage.

You shall independently configure InfoSphere CDC and use the running IBM Information Server (DataStage) to perform the replication.


Step by step configuration

The following steps will show you how to configure InfoSphere CDC with DataStage to perform replication for real-time ETL.

Configure the source InfoSphere CDC

  1. Click Start --> All Programs --> IBM InfoSphere Change Data Capture --> Replication Engine for IBM DB2 --> Configure IBM InfoSphere Change Data Capture (IBM DB2), as shown in Figure 2.
    Figure 2. Invoke the Configure tool for CDC for DB2
    screen cap: selecting Configure IBM InfoSphere Change Data Capture (IBM DB2)

    (View a larger version of Figure 2.)

  2. Provide the name of the instance and port number where the CDC source instance is listening, and the database name, user name and password that has privilege to access the source database, as shown in Figure 3.
    Figure 3. CDC new instance screen
    screen cap: enter instance and database details
  3. After creating the instance, click Start to start the instance and wait until the status says running, as shown in Figure 4.
    Figure 4. Start the instance
    screen cap: shows instance in running status

Configure the InfoSphere CDC for the DataStage

  1. Click Start --> All Programs --> IBM InfoSphere Change Data Capture --> Replication Engine for IBM InfoSphere Datastage --> Configure IBM InfoSphere Change Data Capture (IBM InfoSphere DataStage), as shown in figure 5.
    Figure 5. Invoke the Configure tool for CDC for DS
    screen cap: select Configure IBM InfoSphere Change Data Capture (IBM InfoSphere DataStage)

    (View a larger version of Figure 5.)

  2. Click the Add button, as shown in Figure 6.
    Figure 6. Add option in CDC configuration
    screen cap: selecting add
  3. The tool automatically detects if the DataStage is also installed on the same machine, and also detects the absolute path for the dsjob.exe that should be used to start the job automatically, as shown in Figure 7.
    Figure 7. Auto-start of Datastage job
    screen cap: shows path for dsjob.exe
    The screen prompts you as to whether or not you should choose this path for starting DataStage. This tool uses the PATH environment to detect the available dsjob.exe. If you click No, then it can take a new path where the valid binary dsjob.exe is expected to be present. You can review Figure 26 to see the option to automatically start the DataStage job for the specific CDC subscription.
  4. The new instance screen, as shown in Figure 8, is where you can enter the name of the instance and port number where the CDC for DS is listening. Also provide the user name that will be used for CDC authentication (such as "tsuser" used in this example) for the Transaction Stage.
    Figure 8. Target instance credentials
    screen cap: new instance details and authentication
  5. After creating the instance, click Start to start the instance and wait until the status says running, as shown in Figure 9.
    Figure 9. Start the target instance
    Shows running status

Creating data stores for source and target

  1. Click Start --> All Programs --> IBM InfoSphere Change Data Capture --> Management Console, as shown in Figure 10.
    Figure 10. Invoke management console
    screen cap: starting the management console

    (View a larger version of Figure 10.)

  2. As shown in Figure 11, click the Ping button to get the properties of the datastore. Create the datastore for CDC Source (DB2) with the host name (or IP address) and port number corresponding to the CDC source instance, along with a name to identify the data store. The term datasource is synonymous to the database.
    Figure 11. Datastore properties
    name, description, host name, port, and properties
  3. Click Connection Parameters and provide a user name and password that has privileges to access the source database, as shown in Figure 12.
    Figure 12. Provide credentials to access source
    shows login parameters
  4. Similarly, create a datastore for the CDC Target (Data Stage). In the case of the DataStage, the target data store is the component that will be used to propagate data to the InfoSphere DataStage. As shown in Figure 13, click the Ping button to get the properties of the target.
    Figure 13. New datastore screen
    shows identification information for the target CDC database
  5. Click Connection Parameters and type the user name and password that has privilege to access the target. Note that the user name and password shown in Figure 14 are the ones that were shown previously in Figure 8 to authenticate with CDC for DS by the Access Server. InfoSphere CDC normally can have a target database for which user name and password are provided. However, this user name and password are provided to make sure the right and intended user sends the data via the DataStage channel.
    Figure 14. Provide credentials to apply on the target
    screen for credentials
  6. After providing the credentials for the data stores, note from the Configuration tab that the Management Console has established the connection with both the source and target datastores, as shown in Figure 15.
    Figure 15. Observe both source and target are connected
    shows source and target datastores

Creating subscriptions

A subscription is a container that contains table mappings and where you specify the source and target data stores. From within the subscription, you can create the table mappings to specify the source and target tables.

  1. Click the Create new subscription button, as shown in Figure 16.
    Figure 16. Creating a new subscription from the Management Console
    Button is on the Subscriptions tab
  2. Provide the name of the subscription to uniquely identify the replication between source and target tables. Note that it takes datastores from which the tables need to be mapped for the replication, as shown in Figure 17.
    Figure 17. Create new subscription
    source and target datastores

Mapping the tables

From the specific subscription of mapping type, you can map tables from different combinations of source data stores and target datastores.

  1. Click the Map tables icon, as shown in Figure 18.
    Figure 18. Mapping tables in the Management Console
    shows map tables icon

    (View a larger version of Figure 18.)

  2. Select the mapping type. The InfoSphere DataStage option automatically shows up when the target is DataStage. Choose this to map the source tables to the DataStage job, as shown in Figure 19.
    Figure 19. Mapping type
    shows InfoSphere DataStage selected
  3. Choose the method of connection to DataStage. You can choose Flat File to get the text file output from the CDC for DataStage that complies with the input format that DataStage would expect. In this case, there is no need for the DataStage job to be running. In the example shown in Figure 20, Direct Connect is chosen to directly replicate the changes via the CDC Transaction Stage that can be fed into the InfoSphere DataStage directly. In this case, the DataStage job must be running in order for CDC to establish the connection and spool the data via TCP/IP.
    Figure 20. Method of connection
    Shows Direct Connect selected
  4. Choose the source table from which CDC should capture the changes, as shown in Figure 21.
    Figure 21. Select source table
    list of source tables
  5. As shown in Figure 22, choose if the format should be single record or multiple record. If you choose Single Record, the before and after image of the updated record are sent as the single record. This is not storage optimal when the majority of the operations are INSERT and DELETE.

    If you choose Multiple Records, the before and after image are sent as a separate records with the DM_OPERATION_TYPE indicator B to represent the before image and A to represent the after image. This is optimal for storage, however the demand on the network bandwidth is high because each row change would send 2 records.

    Figure 22. Single record or multiple record
    shows multiple records selected
  6. Finally, as shown in Figure 23, verify the consolidated report of the mapping of the table, then click Finish.
    Figure 23. Consolidated report
    shows review of mapping settings
  7. Verify the details in the Management Console. In the window shown in Figure 24, notice the following.
    • The source and target datastores and their connection status.
    • Subscriptions created are listed and their corresponding table mappings.
    • Mapping type being Direct Connect.
    Figure 24. Datastore details on Management Console
    shows subscription, datastore, and source and target tables

    (View a larger version of Figure 24.)

Creating the DataStage job

Use the Management Console to create the DataStage job that interacts with the subscription.

  1. Click InfoSphere DataStage --> InfoSphere DataStage Properties, as shown in Figure 25.
    Figure 25. Invoke the InfoSphere DataStage Properties option
    screen cap: selecting DataStage properties in the management console
  2. Specify the Direct Connect properties, as shown in Figure 26. Ensure that you specify the Connection Key, which is later used to authenticate via DataStage Designer.
    Figure 26. Direct Connect properties
    project name, job name, and connection key

    As shown previously in Figure 26, the Auto-start InfoSphere DataStage Job check box was disabled. When both InfoSphere CDC and InfoSphere Information Server for DataStage are installed on the same machine, this option is automatically enabled for the end-user to select this box. This might be disabled when an incorrect path is given.

  3. After entering the properties, select the Generate InfoSphere DataStage Job Definition option, as shown in Figure 27. This generates the job file that can be directly imported via DataStage Designer. This file also encapsulates all the inputs that were given via the properties.
    Figure 27. Direct Connect properties
    shows selecting generate InfoSphere datastage job definition
  4. As shown in Figure 28, the generated job definition can be saved in any folder for the Designer shown in Figure 37 to import later.
    Figure 28. Generated job definition
    shows filename to save

Start the mirroring of the subscription

  1. As shown in Figure 29, from the Management Console, click the Monitoring tab, then right-click the subscription and choose Start Mirroring.
    Figure 29. Starting the mirroring process
    start mirror selected

    (View a larger version of Figure 29.)

  2. This changes the state of the subscription to Starting, as shown in Figure 30. It remains in this state until the connection is established with the target DataStage job which will be shown in subsequent steps.
    Figure 30. Status of the subscription
    shows starting as the status

    (View a larger version of Figure 30.)

You have completed the configurations of the InfoSphere CDC. Now, assuming that the InfoSphere Information Server is running, you shall configure it using the DataStage Designer.

Configuring ODBC for target DB

In the architecture diagram shown previously in Figure 1, notice that DataStage establishes an ODBC connection to the target database. In order to provide the credentials to connect to the target database, you must configure ODBC.

  1. Click Start --> All Programs --> Administrative Tools --> Data Sources (ODBC), as shown in Figure 31.
    Figure 31. Selecting ODBC data sources
    shows selecting ODBC as the data source

    (View a larger version of Figure 31.)

  2. Choose the data sources (ODBC) where the target database is present. In the example shown in Figure 32, the name of the target database is TARGETDB. Click Add and provide the following information.
    • Data Source Name: TARGETDB
    • Description
    • IP Address
    • TCP Port
    • Target Database Name
  3. Also as shown in Figure 32, test the connection by selecting Test Connect. Enter a user name and password that has privilege to access the TARGETDB.
    Figure 32. Test connectivity
    entering user name and password
  4. If sufficient privileges are present, the connection is established, as shown in Figure 33.
    Figure 33. Connectivity successful
    shows connection established
  5. The System Data Source Name (DSN) is then added with the supplied name, as shown in Figure 34.
    Figure 34. Adding DSN
    shows TARGETDB added

Import, compile, and run the DataStage job

The DataStage job generated via the Management Console can be directly imported to the data stage client and then configured. This was provided by the InfoSphere CDC for the ease of integration.

  1. Click Start --> All Programs --> IBM InfoSphere Information Server --> IBM InfoSphere DataStage and QualityStage Designer, as shown in Figure 35.
    Figure 35. Invoke InfoSphere DataStage Designer Client
    QualityStage Designer selected

    (View a larger version of Figure 35.)

  2. Log in to InfoSphere DataStage Designer client. Machine B (with IP 192.168.67.130) is running the Information Server and hence uses the credentials to login via DataStage Designer client, as shown in Figure 36.
    Figure 36. Logging in to the DataStage Designer client
    enter user name and password
  3. Import the generated DataStage job definition via Designer Client by clicking the DataStage Components option, and then choosing the file generated by Management console, as shown in Figure 37.
    Figure 37. Choose Import
    Import datastage components
  4. Choose the DS job file, as shown in Figure 38.
    Figure 38. DataStage job file
    shows SUB1_CDCDB2_TO_CDCDS.dsx selected
  5. Import the DataStage job, as shown in Figure 39.
    Figure 39. Importing the job
    Import all radio button selected
  6. Import the DataStage job, as shown in Figure 40.
    Figure 40. Import Datastage job
    shows datastage job in the designer tool

    (View a larger version of Figure 40.)

  7. The screen shown in Figure 41 is displayed after importing (or by explicitly opening the file via File menu option). Right-click the CDC_Transaction stage and provide the input. Here you specify the Host name (or IP) and Port number where the InfoSphere CDC for DataStage is listening.
    Figure 41. CDC transaction properties
    whows hostname and port number
    It also takes the connection key that was previously shown in the properties of DataStage job in Figure 26.
  8. Right-click the Database Connector (DB2_Connector in this example as you used Target Database as DB2). Click the destination database and type the database name and the credentials to connect to this database. In the example shown in Figure 42, the target database is TARGETDB.
    Figure 42. DB2 Connector
    shows connection properties
  9. Select Bookmark from the Link box, as shown in Figure 43, and specify TARGETDB.BOOKMARKTABLE. This table keeps track of the CDC bookmark of a specific subscription that contains the current log position of the data that has been applied to the target database, and is the default name for the book mark table. Note that this subscription specific bookmark table is created in the target database with two columns, DM_KEY SMALLINT and DM_BOOKMARK VARCHAR(1024).
    Figure 43. Import Datastage job
    Bookmark link
  10. As shown in Figure 44, select the source table (SRCTAB in this example) from the Link box, which should now lead to the target table (TARGETDB.TGTTAB in this example).
    Figure 44. Import Datastage job
    SCRTAB properties
  11. Compile the DataStage job that was configured, as shown in Figure 45.
    Figure 45. Compiling the DataStage job
    shows job successfully compiled with no errors

    (View a larger version of Figure 45.)

  12. Click the Run icon to run the job now, as shown in Figure 46.
    Figure 46. Running the job
    shows job run options

    (View a larger version of Figure 46.)

Checking the status of DataStage jobs

Use the InfoSphere DataStage Director client to see the status of the DataStage job.

  1. Invoke the Director by selecting Tools --> Run Director from the DataStage and QualityStage Designer, as shown in Figure 47.
    Figure 47. Starting the Director
    shows run director selected

    (View a larger version of Figure 47.)

  2. When invoking Director through the Designer, it automatically uses the credentials used for the Designer. When directly invoking from the InfoSphere Information Server installation, log in to the Director client using the user name and password. Since the Information Server is assumed to be running on Machine B, the same IP, user name and password can be used to log in.
  3. From the View menu, select Log, as shown in Figure 48.
    Figure 48. Selecting to log
    Log selected
  4. The output generated by the DataStage job runner is displayed in Figure 49 to indicate the success or failure of the job.
    Figure 49. Log output
    shows warning and informational messages

    (View a larger version of Figure 49.)

    Make sure the job is successfully running and the subscription from the Management console changed its status from Starting to Mirror Continuous.

Testing the Replication

Test the mirroring between the source database (DB2) that gets replicated to the target database in a different machine via InfoSphere CDC and InfoSphere DataStage.

  1. Invoke the IBM DB2 Command Line Processor. Click Start --> All Programs --> IBM DB2 --> DB2COPY --> Command Line Tools, as shown in Figure 50.
    Figure 50. Invoking the IBM DB2 Command Line Processor
    shows selecting CLP

    (View a larger version of Figure 50.)

  2. Insert some rows into the source database table, as shown in Figure 51. These insertions (or any DML operations that changes data) generate changes in the database logs. The generated database log entries are captured by InfoSphere CDC and the changes will be propagated to DataStage.
    Figure 51. Inserting rows
    shows some SQL insert statements
  3. Verify the replication by selecting from the target database, as shown in Figure 52.
    Figure 52. Verify replication
    target database shows rows that were inserted at the source
    The remote database TARGETDB, which contains table TGTTAB, is momentarily populated with the same data that was replicated by the InfoSphere CDC to the InfoSphere DataStage via Direct Connect in the continuous mirroring mode.

Conclusion

This article has shown step-by-step instructions to configure the IBM integration products for replication and ETL. The heterogeneous replication product InfoSphere Change Data Capture can directly feed real-time data into the ETL product InfoSphere DataStage.

Acknowledgements

The author sincerely thanks his colleagues Vishwas Manjunath, Aniket Kadam, Vishwajit Iyer, Glen Sakuth, and Elaine Pang for reviewing and providing their valuable feedback that helped to refine this article.

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=796019
ArticleTitle=Configure InfoSphere CDC with InfoSphere DataStage for real-time ETL
publish-date=03012012