Contents


Use change data capture technology in InfoSphere Data Replication with InfoSphere BigInsights

Comments

Overview

InfoSphere Data Replication change data capture (CDC) technology captures the changes made on source transactional databases such as DB2 and Oracle, and replicates it to target databases, message queues, and extract, transform, and load (ETL) solutions such as IBM InfoSphere DataStage.

InfoSphere BigInsights analyzes and visualizes Internet-scale data volumes based on the Apache Hadoop project. It includes the core Hadoop Distributed File System (HDFS) and several other projects in the Apache Hadoop ecosystem, such as Pig, Hive, HBase, and ZooKeeper. This article gives the step-by-step instructions to configure InfoSphere Data Replication CDC for InfoSphere DataStage with HDFS in InfoSphere BigInsights. In a sample scenario, a smart metering application, CDC helps to replicate incremental changes to the HDFS in InfoSphere BigInsights.

Software to install

The following components need to be installed:

High-level architecture

In the architecture illustrated below, changes made in the source databases flow through the InfoSphere CDC source and target engines and are replicated to the HDFS flat files stored in InfoSphere BigInsights.

Paths that start with hdfs:// indicate InfoSphere BigInsights HDFS paths. The flat files are written according to the specified HDFS directory in InfoSphere BigInsights. You can customize the format of these files by using the sample custom data formatter (SampleDataFormatForHdfs.java), which comes with InfoSphere Data Replication. This Java™ file is available in the file samples.jar in the directory cdc_install_dir/lib folder.

Changes flow from source databases to HDFS system in InfoSphere BigInsights
Changes flow from source databases to HDFS system in InfoSphere BigInsights

Sample scenario: Smart metering system

Utility companies that use smart meters to record the usage of electricity, water, and gas have to deal with large volumes of data that changes frequently. Analysis of this meter data can provide insight into the usage patterns of customers and the cost involved for the utility. For example, the company can measure usage during peak times, it can charge more for use during peak hours, it can set the charge based on the usage, it can create incentives for customers to reduce consumption during particular hours, etc. In this situation, the utility company can use InfoSphere BigInsights to analyze patterns of data, capture changes in the meter data as they occur, and replicate changes to the HDFS files InfoSphere BigInsights.

The company can use a replication system, such as InfoSphere Data Replication CDC, as shown below, to capture the changes from the meter as they flow into transactional systems and replicate them to the HDFS files in InfoSphere BigInsights.

Changes flow from meter through CDC to InfoSphere BigInsights                     files on HDFS
Changes flow from meter through CDC to InfoSphere BigInsights files on HDFS

Step-by-step configuration to enable replication to HDFS

To configure replication:

Ensure that InfoSphere BigInsights environment is initialized

The required environment for the Hadoop cluster is automatically set up when InfoSphere BigInsights is installed. To confirm, make sure the CLASSPATH points to the Hadoop core JAR files and the HADOOP_CONF_DIR environment variable points to the directory that contains the Hadoop configuration files.

If these are not set, initialize the Hadoop environment by running the biginsights-env.sh script in the directory BigInsights_install_dir/conf.

By default, the CLASSPATH environment variable points only to the Hadoop core JAR file hadoop-core-1.0.3.jar. To write to HDFS, specify the following JAR files in the CLASSPATH environment variable. These JAR files are available in the BigInsights_install_dir/IHC/lib directory:

  • commons-configuration-1.6.jar
  • commons-logging-1.1.1.jar
  • commons-lang-2.4.jar

Start the HDFS and Hive components of the Hadoop cluster

InfoSphere BigInsights ships with several Hadoop components, such as Apache MapReduce, HDFS, Hive, Catalog, HBase, Oozie, and others. These services can be started through the InfoSphere BigInsights console or through the command line.

  • To open the InfoSphere BigInsights console in a web browser, go to http://your-server:8080/data/html/index.html#redirect-welcome and start the HDFS and Hive services by selecting them from Cluster Status tab of the console.
  • To start the services through command line, run the script start-all.sh under the BigInsights_install_dir/bin directory.
    Note: There are no scripts to selectively start the services through command line.

Create an HDFS directory for flat files and set up instance of InfoSphere Data Replication CDC for InfoSphere DataStage

  1. Navigate to the Files tab in the InfoSphere BigInsights console. Click the Files tab in the InfoSphere BigInsights GUI
    Click the Files tab in the InfoSphere BigInsights GUI
  2. As shown below, create the directory where you need the flat files to be created. Click the file folder icon to create a new directory
    Click the file folder icon to create a new directory
  3. Create an instance of InfoSphere Data Replication CDC for InfoSphere DataStage with HDFS by following these steps.
  4. Create a subscription to map a table using these step-by-step instructions, specify the entire path of HDFS directory in the format hdfs://your-server:9000/location of the directory, as shown below. Note: Include the port number as shown. Specify the directory under Map tables > Location > Directory field
    Specify the directory under Map tables > Location > Directory field
  5. Optional: Enable the custom formatter by compiling the sample formatter Java file using these instructions.

Start replication and check the flat files generated in HDFS

The setup is now complete. Start mirroring on the subscription created in the previous step, as shown below, using these instructions.

Monitoring > Subscriptions > Select project HDFS
Monitoring > Subscriptions > Select project HDFS

On the Files tab in InfoSphere BigInsights, verify that the flat file is written to the specified HDFS directory.

Files > HDFS > List of files
Files > HDFS > List of files

Troubleshooting problems

If the subscription fails with the error message An exception occurred in DataStage target, as shown below, the required Hadoop JAR files are not specified in the CLASSPATH or the HDFS directory doesn't exist.

Ensure the directory exists and CLASSPATH includes JAR files
Ensure the directory exists and CLASSPATH includes JAR files

If the subscription fails with the error message An error occurred in User Exit table CUST.TS_FLATFILE, as shown below, the Hadoop environment has not been initialized correctly. Follow the steps to manually initialize the Hadoop environment.

Ensure that the Hadoop environment variables are initialized                     correctly
Ensure that the Hadoop environment variables are initialized correctly

Conclusion

This article describes how to configure InfoSphere Data Replication CDC to capture the changes made on source transactional databases and replicate them to HDFS in InfoSphere BigInsights. Use the following resources to learn more about the products used in the sample scenario.


Downloadable resources


Related topic


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=966282
ArticleTitle=Use change data capture technology in InfoSphere Data Replication with InfoSphere BigInsights
publish-date=03252014