Feed your data lake with change data capture for real-time integration and analytics

By | 3 minute read | November 22, 2019

Haruto Sakamoto, the Chief Information Officer at a Japanese multinational imaging company, had a few challenges to contend with. His business units had a presence in 180 countries worldwide with geographically-dispersed data warehouses and business intelligence applications in various locations. The data stored in those warehouses and applications were in different formats, and all of these unorganized sources of data were leading to synchronization issues when it came to running time-sensitive queries and reporting by business users.

On the other side of the world, Chris Roberts, the data management architecture lead at a bank in North America, had a different problem. The Bank wanted to improve its customer experience through real-time notifications for new offers to compete with emerging Fintech companies, which had the potential to drive online traffic and customer interaction away from the traditional banks. One of the main issues was that transactional data was located in many systems, including IBM Db2 and Oracle databases whereas notification applications were located on a data lake.

Though at first glance it would appear that the issues Haruto and Chris were grappling with were unique to their organizations, they actually struggled with a similar problem.

Both of them need to access real-time data for reporting and trends analysis to make informed business decisions, increase revenue opportunities and provide improved customer experiences.

Traditionally, the businesses where Haruto and Chris operate in rely on these options to address their problems:

  • Classical Data Integration (extract, transform, load, or ETL) would have allowed batch, near-real time or event / service driven bulk data transformation for high volumes of complex data to be fed to data warehouses.
  • Replication using change data capture (CDC) technology provides bidirectional synchronization and simple transformations for event-driven or real time integration, and disaster recovery; it also enables real-time customer notifications from the cloud.
  • Virtualization to create virtual views of data from multiple databases, help with creating a single view of the data spread out across geographical locations based on simple transformations.

But what if Haruto and Chris had to decide on only one solution to address their challenges?

Enter data integration with real-time capture.

IBM InfoSphere DataStage with fully built-in CDC technology for real time capture deployed as containers can provide Haruto and Chris the best of both the Data Integration and Data replication worlds. DataStage allows for complex transformation with large data sets while CDC captures log based changes as they occur, transforms them using complex transformations and delivers to target databases on the cloud and data lakes using Kafka-based message queues.

Here’s how it works: DataStage real-time capture receives updates from enabled sources. The capture engine puts the updates as messages onto an internal Kafka topic. The DataStage real-time connector then consumes these messages and functions as the source side of any job that is receiving the updates. You can use the DataStage real-time connector like any other object on the canvas of a job in InfoSphere DataStage, the industry leader in Data Integration.

The 3 key benefits Haruto and Chris would both receive with this solution are:

  1. A single tool with a common user experience (using DataStage flow designer UI), with no need to manage two separate tools or make complex configurations.
  2. Faster time-to-value by removing the need for knowledge of CDC and management of agent technology
  3. Support for cloud data sources because remote capture is not dependent on data source agent technology, helping future-proof your solution.

With InfoSphere DataStage, Haruto and his team could deliver up-to-date and high availability of data for end-users, improved availability of data warehouses for real-time reporting and ensure Peak performance of production systems by elimination of batch windows. According to Haruto, “IBM InfoSphere change data capture offers us new functionality to meet our growing business demand for real-time reporting. It has helped us address business impact challenges of timeliness, completeness and correctness.”

Chris and his business were able to improve customer service by providing notifications of changes to their customers in real-time, such as when large transactions occur, or when balances slip below a predetermined level. Core transactional data is also available for other use cases which are expected to be developed over time.

To find out more about how IBM can perform bulk data movement, read this blog post on DataStage multi-cloud capabilities and for cases where simple replication (without complex transformations) will suffice, learn about IBM Data Replication.