Use change data capture technology in InfoSphere Data Replication with InfoSphere BigInsights

Learn how to capture the changes made on source transactional databases such as IBM DB2® and Oracle, and replicate them to the Apache Hadoop Distributed File System in IBM InfoSphere® BigInsights™. Use change data capture replication technology in InfoSphere Data Replication 10.2 for InfoSphere DataStage® with InfoSphere BigInsights support.

Share:

Srinidhi Hiriyannaiah (srihiriy@in.ibm.com), Software Engineer, IBM

Srinidhi Hiriyannaiah works as a software engineer for IBM InfoSphere Data Replication Change Data Capture at India Software Labs. He is responsible for assuring quality for the product through automation. He holds a master's degree in software engineering from M.S. Ramaiah Institute Of Technology, Visvesvarya Technological University in Bangalore, India. His main areas of interest include data replication and information management, big data and its applications, and Apache Hadoop, Hive, and HBase.



Sunil Perla (sunilperla@in.ibm.com), Software Engineer, Manager, IBM

Sunil Perla has more than 15 years of software industry experience and has been with the IBM Information Management group at India Software Labs for more than nine years. He is responsible for IBM InfoSphere Data Replication/InfoSphere Change Data Capture product development. He holds a bachelor's degree in computer science from Sri Venkateswara University in India. He has keen interest in data integration, big data, and cloud technologies.



Dev Sarkar (devsarka@in.ibm.com), Software Engineer, Senior Manager, IBM

Dev Sarkar has more than 18 years of experience in the software industry and more than five years of experience with IBM InfoSphere Data Replication/InfoSphere Change Data Capture. He is responsible for product development for InfoSphere Data Replication/InfoSphere Change Data Capture.



25 March 2014

Also available in Chinese Russian Japanese Vietnamese Spanish

Overview

InfoSphere Data Replication change data capture (CDC) technology captures the changes made on source transactional databases such as DB2 and Oracle, and replicates it to target databases, message queues, and extract, transform, and load (ETL) solutions such as IBM InfoSphere DataStage.

InfoSphere BigInsights analyzes and visualizes Internet-scale data volumes based on the Apache Hadoop project. It includes the core Hadoop Distributed File System (HDFS) and several other projects in the Apache Hadoop ecosystem, such as Pig, Hive, HBase, and ZooKeeper. This article gives the step-by-step instructions to configure InfoSphere Data Replication CDC for InfoSphere DataStage with HDFS in InfoSphere BigInsights. In a sample scenario, a smart metering application, CDC helps to replicate incremental changes to the HDFS in InfoSphere BigInsights.


Software to install

The following components need to be installed:


High-level architecture

In the architecture illustrated below, changes made in the source databases flow through the InfoSphere CDC source and target engines and are replicated to the HDFS flat files stored in InfoSphere BigInsights.

Paths that start with hdfs:// indicate InfoSphere BigInsights HDFS paths. The flat files are written according to the specified HDFS directory in InfoSphere BigInsights. You can customize the format of these files by using the sample custom data formatter (SampleDataFormatForHdfs.java), which comes with InfoSphere Data Replication. This Java™ file is available in the file samples.jar in the directory cdc_install_dir/lib folder.

Changes flow from source databases to HDFS system in InfoSphere BigInsights

Sample scenario: Smart metering system

Utility companies that use smart meters to record the usage of electricity, water, and gas have to deal with large volumes of data that changes frequently. Analysis of this meter data can provide insight into the usage patterns of customers and the cost involved for the utility. For example, the company can measure usage during peak times, it can charge more for use during peak hours, it can set the charge based on the usage, it can create incentives for customers to reduce consumption during particular hours, etc. In this situation, the utility company can use InfoSphere BigInsights to analyze patterns of data, capture changes in the meter data as they occur, and replicate changes to the HDFS files InfoSphere BigInsights.

The company can use a replication system, such as InfoSphere Data Replication CDC, as shown below, to capture the changes from the meter as they flow into transactional systems and replicate them to the HDFS files in InfoSphere BigInsights.

Changes flow from meter through CDC to InfoSphere BigInsights files on HDFS

Step-by-step configuration to enable replication to HDFS

To configure replication:

Ensure that InfoSphere BigInsights environment is initialized

The required environment for the Hadoop cluster is automatically set up when InfoSphere BigInsights is installed. To confirm, make sure the CLASSPATH points to the Hadoop core JAR files and the HADOOP_CONF_DIR environment variable points to the directory that contains the Hadoop configuration files.

If these are not set, initialize the Hadoop environment by running the biginsights-env.sh script in the directory BigInsights_install_dir/conf.

By default, the CLASSPATH environment variable points only to the Hadoop core JAR file hadoop-core-1.0.3.jar. To write to HDFS, specify the following JAR files in the CLASSPATH environment variable. These JAR files are available in the BigInsights_install_dir/IHC/lib directory:

  • commons-configuration-1.6.jar
  • commons-logging-1.1.1.jar
  • commons-lang-2.4.jar

Start the HDFS and Hive components of the Hadoop cluster

InfoSphere BigInsights ships with several Hadoop components, such as Apache MapReduce, HDFS, Hive, Catalog, HBase, Oozie, and others. These services can be started through the InfoSphere BigInsights console or through the command line.

  • To open the InfoSphere BigInsights console in a web browser, go to http://your-server:8080/data/html/index.html#redirect-welcome and start the HDFS and Hive services by selecting them from Cluster Status tab of the console.
  • To start the services through command line, run the script start-all.sh under the BigInsights_install_dir/bin directory.
    Note: There are no scripts to selectively start the services through command line.

Create an HDFS directory for flat files and set up instance of InfoSphere Data Replication CDC for InfoSphere DataStage

  1. Navigate to the Files tab in the InfoSphere BigInsights console. Click the Files tab in the InfoSphere BigInsights GUI
  2. As shown below, create the directory where you need the flat files to be created. Click the file folder icon to create a new directory
  3. Create an instance of InfoSphere Data Replication CDC for InfoSphere DataStage with HDFS by following these steps.
  4. Create a subscription to map a table using these step-by-step instructions, specify the entire path of HDFS directory in the format hdfs://your-server:9000/location of the directory, as shown below. Note: Include the port number as shown. Specify the directory under Map tables > Location > Directory field
  5. Optional: Enable the custom formatter by compiling the sample formatter Java file using these instructions.

Start replication and check the flat files generated in HDFS

The setup is now complete. Start mirroring on the subscription created in the previous step, as shown below, using these instructions.

Monitoring > Subscriptions > Select project HDFS

On the Files tab in InfoSphere BigInsights, verify that the flat file is written to the specified HDFS directory.

Files > HDFS > List of files

Troubleshooting problems

If the subscription fails with the error message An exception occurred in DataStage target, as shown below, the required Hadoop JAR files are not specified in the CLASSPATH or the HDFS directory doesn't exist.

Ensure the directory exists and CLASSPATH includes JAR files

If the subscription fails with the error message An error occurred in User Exit table CUST.TS_FLATFILE, as shown below, the Hadoop environment has not been initialized correctly. Follow the steps to manually initialize the Hadoop environment.

Ensure that the Hadoop environment variables are initialized correctly

Conclusion

This article describes how to configure InfoSphere Data Replication CDC to capture the changes made on source transactional databases and replicate them to HDFS in InfoSphere BigInsights. Use the following resources to learn more about the products used in the sample scenario.

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=966282
ArticleTitle=Use change data capture technology in InfoSphere Data Replication with InfoSphere BigInsights
publish-date=03252014