InfoSphere Data Replication change data capture (CDC) technology captures the changes made on source transactional databases such as DB2 and Oracle, and replicates it to target databases, message queues, and extract, transform, and load (ETL) solutions such as IBM InfoSphere DataStage.
InfoSphere BigInsights analyzes and visualizes Internet-scale data volumes based on the Apache Hadoop project. It includes the core Hadoop Distributed File System (HDFS) and several other projects in the Apache Hadoop ecosystem, such as Pig, Hive, HBase, and ZooKeeper. This article gives the step-by-step instructions to configure InfoSphere Data Replication CDC for InfoSphere DataStage with HDFS in InfoSphere BigInsights. In a sample scenario, a smart metering application, CDC helps to replicate incremental changes to the HDFS in InfoSphere BigInsights.
Software to install
The following components need to be installed:
- InfoSphere Data Replication 10.2 CDC Management Console
- InfoSphere Data Replication 10.2 CDC Access Server
- InfoSphere Data Replication 10.2 CDC for InfoSphere DataStage (Interim Fix 2 or later)— Installed on the same machine as InfoSphere BigInsights
- InfoSphere BigInsights 2.0
In the architecture illustrated below, changes made in the source databases flow through the InfoSphere CDC source and target engines and are replicated to the HDFS flat files stored in InfoSphere BigInsights.
Paths that start with hdfs:// indicate InfoSphere BigInsights HDFS paths. The flat files are written according to the specified HDFS directory in InfoSphere BigInsights. You can customize the format of these files by using the sample custom data formatter (SampleDataFormatForHdfs.java), which comes with InfoSphere Data Replication. This Java™ file is available in the file samples.jar in the directory cdc_install_dir/lib folder.
Sample scenario: Smart metering system
Utility companies that use smart meters to record the usage of electricity, water, and gas have to deal with large volumes of data that changes frequently. Analysis of this meter data can provide insight into the usage patterns of customers and the cost involved for the utility. For example, the company can measure usage during peak times, it can charge more for use during peak hours, it can set the charge based on the usage, it can create incentives for customers to reduce consumption during particular hours, etc. In this situation, the utility company can use InfoSphere BigInsights to analyze patterns of data, capture changes in the meter data as they occur, and replicate changes to the HDFS files InfoSphere BigInsights.
The company can use a replication system, such as InfoSphere Data Replication CDC, as shown below, to capture the changes from the meter as they flow into transactional systems and replicate them to the HDFS files in InfoSphere BigInsights.
Step-by-step configuration to enable replication to HDFS
To configure replication:
- Ensure that the InfoSphere BigInsights environment is initialized
- Create an HDFS directory for the flat files and set up an instance of InfoSphere Data Replication CDC for InfoSphere DataStage
- Start replication and check the flat files generated in HDFS
Ensure that InfoSphere BigInsights environment is initialized
The required environment for the Hadoop cluster is automatically set up
when InfoSphere BigInsights is installed. To confirm, make sure the
CLASSPATH points to the Hadoop core JAR files and the
HADOOP_CONF_DIR environment variable points to the directory
that contains the Hadoop configuration files.
If these are not set, initialize the Hadoop environment by running the
biginsights-env.sh script in the directory
By default, the
CLASSPATH environment variable points only to
the Hadoop core JAR file hadoop-core-1.0.3.jar. To write to HDFS, specify
the following JAR files in the
variable. These JAR files are available in the
Start the HDFS and Hive components of the Hadoop cluster
InfoSphere BigInsights ships with several Hadoop components, such as Apache MapReduce, HDFS, Hive, Catalog, HBase, Oozie, and others. These services can be started through the InfoSphere BigInsights console or through the command line.
- To open the InfoSphere BigInsights console in a web browser, go to http://your-server:8080/data/html/index.html#redirect-welcome and start the HDFS and Hive services by selecting them from Cluster Status tab of the console.
- To start the services through command line, run the script
start-all.sh under the BigInsights_install_dir/bin
Note: There are no scripts to selectively start the services through command line.
Create an HDFS directory for flat files and set up instance of InfoSphere Data Replication CDC for InfoSphere DataStage
- Navigate to the Files tab in the InfoSphere BigInsights console.
- As shown below, create the directory where you need the flat files to be created.
- Create an instance of InfoSphere Data Replication CDC for InfoSphere DataStage with HDFS by following these steps.
- Create a subscription to map a table using these step-by-step instructions, specify the entire path of HDFS
directory in the format
hdfs://your-server:9000/location of the directory, as shown below. Note: Include the port number as shown.
- Optional: Enable the custom formatter by compiling the sample formatter Java file using these instructions.
Start replication and check the flat files generated in HDFS
The setup is now complete. Start mirroring on the subscription created in the previous step, as shown below, using these instructions.
On the Files tab in InfoSphere BigInsights, verify that the flat file is written to the specified HDFS directory.
If the subscription fails with the error message
An exception occurred in DataStage
as shown below, the required Hadoop JAR files are not specified in the
CLASSPATH or the HDFS directory doesn't exist.
If the subscription fails with the error message
An error occurred in User Exit
as shown below, the Hadoop environment has not been initialized correctly.
Follow the steps to manually initialize the
This article describes how to configure InfoSphere Data Replication CDC to capture the changes made on source transactional databases and replicate them to HDFS in InfoSphere BigInsights. Use the following resources to learn more about the products used in the sample scenario.
- Learn more about how IBM InfoSphere Change Data Capture integrates information across heterogeneous data stores in real time.
- Participate in the discussion forum.
- Get involved in the CDC (Change Data Capture) community and connect with other CDC users while exploring the developer-driven blogs, forums, groups, and wikis.
- Get involved in the My developerWorks community and connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.