What is change data capture?

Woman works on laptop in warehouse.

Authors

Alice Gomstyn

Staff Writer

IBM Think

Alexandra Jonker

Staff Editor

IBM Think

What is change data capture?

Change data capture, or CDC, is a technique for identifying and recording data changes in a database. CDC delivers these changes in real-time to different target systems, enabling the synchronization of data across an organization immediately after a database change occurs.

 

Change data capture is a method of real-time data integration, which functions to combine and harmonize data that may be siloed or inconsistent across the organization. Other methods include stream data integration, data virtualization and application integration.

The ability of CDC to keep systems up to date in real-time (and with low latency) is instrumental to the success of real-time data analytics, cloud migrations and even AI models. It has a variety of use cases across sectors, from retail to finance to healthcare, assisting with fraud detection, supply chain management and regulatory compliance.

There are multiple approaches to change data capture, with log-based CDC, timestamp-based CDC and trigger-based CDC among the most common. Enterprises can implement change data capture through database-native tools, open source platforms and third-party solutions.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

What are the benefits of change data capture?

In modern data management, change data capture has emerged as a critical data engineering mechanism. Today’s enterprise data environments are increasingly large and complex. They might contain data from Internet of Things (IoT) devices, distributed databases, applications and other diverse sources. Maintaining consistent, quality data across this growing data ecosystem is an ongoing challenge.

At the same time, the business demands accurate, up-to-date information that can be leveraged for real-time decision making. Change data capture is one of several methods that helps organizations meet this demand.

Change data capture enables a low-latency data pipeline that delivers fresh data in a way that’s more efficient and less resource-intensive than other data integration methods. For instance, data replication entails copying full datasets. In contrast, CDC sends only the data that has changed, thereby reducing the load on source systems, network traffic and demands for compute power.

It helps them access the latest, most accurate information quickly and efficiently, leading to multiple benefits, including:

Real-time decision-making

A real-time stream of data changes enables real-time data analytics and business intelligence. With these capabilities, businesses can support the demands of today’s time-sensitive, 24/7 business environment.

Successful cloud migration

During cloud migrations, CDC quickly delivers data changes that occur on premises to relevant cloud-based data tables, ensuring consistency between both environments. This capability also minimizes system downtime during the migration.

ETL process improvement

ETL (extract, transform, load) data pipelines are integral for data analytics and machine learning workstreams. But ETL execution, which relies on batch processing, tends to move slowly and tax system resources. Integrating CDC into ETL can optimize resource use and accelerate data movement.

Better artificial intelligence (AI) performance

Implementing change data capture can help ensure model source data is up to date, so that large language models (LLMs) can deliver accurate, timely outputs. For instance, in retrieval augmented generation (RAG) use cases, AI models connect with external knowledge bases for more relevant responses.

Mixture of Experts | 12 December, episode 85

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

How does change data capture work?

Change data capture identifies and records change events taking place in various data sources. These sources can include relational databases such as Oracle, PostgreSQL, MySQL, Microsoft’s Azure SQL, Microsoft’s SQL Server, as well as non-relational (NoSQL) databases such as Apache Cassandra and MongoDB.

After changes are identified, they are transferred from the source database in real time or near-real time to target systems. Data stores such as data lakes and data warehouses; real-time analytics and streaming data platforms such as Apache Kafka and Apache Spark; and ETL (extract, transform, load) and ELT (extract, load, transform) solutions are all examples of target systems.

Change data capture may be initiated by either the target systems (what’s known as a “push” approach) or the source systems (a “pull” approach). In the former, a source system “pushes” or sends changes to target systems. In the latter, a target system regularly polls source systems and “pulls” changes when they’re found.

Common methods for change data capture

There are several methods for executing change data capture. Common types of CDC include: 

  • Log-based CDC
  • Timestamp-based CDC
  • Trigger-based CDC

Log-based CDC

Database transaction logs are a standard feature of databases and are used to record all database transactions. (Transaction log files can be used to recover databases in the event of a system failure.)

In log-based CDC, a CDC application processes the database changes recorded in the log and shares the updates with other systems. Log-based CDC has become increasingly popular, in part because of its reliance on logs instead of queries that might degrade source system performance. However, variation in transaction log formats can complicate log-based CDC execution across different databases.

Timestamp-based CDC

Timestamp-based change data capture, also known as query-based CDC, requires that database table schemas feature columns, such as timestamp columns, noting the date and time of record changes. A CDC tool can be used to identify changed records through the timestamp column in a source table and then deliver updates to target systems.

While timestamp-based CDC can be simple to implement, it can also put additional load on a system when polling for timestamp data occurs frequently. Timestamp-based CDC also fails to capture delete operations when the timestamp is deleted along with the rest of a row.

Trigger-based CDC

In trigger-based change data capture, stored procedures or functions known as database triggers are executed once specific modifications (such as insertions, deletions and updates) occur in a database. The changed data is then stored in what’s often called a change table or shadow table.

Like timestamp-based CDC, trigger-based CDC can be simple to implement. However, it can also tax source systems because triggers are “fired” each time a transaction occurs in the source table.

Change data capture tools

Tools that perform change data capture may be native to specific environments and database systems, such as AWS Database Migration Service, or may be implemented more widely. Non-native change data capture software solutions include open source platforms such as Debezium and commercial platforms such as IBM Streamsets and Oracle GoldenGate.

As companies mull which solution to choose, they may consider factors such as pricing, connectors to source and target systems, and application programming interfaces (APIs) for system integration.

Use cases for change data capture

Businesses can deploy change data capture for a variety of uses, including:

Fraud detection

Continuously tracking changes in financial records through change data capture can enable the detection of fraudulent activity before it results in substantial losses.

Internet of Things (IoT) enablement

CDC can efficiently integrate the massive amounts of real-time data generated by IoT devices, enabling predictive maintenance and real-time monitoring.

Inventory and supply chain management

Access to real-time sales, inventory and supply chain information supported by change data capture can help companies avoid stock-outs and make lucrative pricing decisions.

Regulatory compliance

Change data capture can help highly regulated companies keep accurate records necessary for reporting and compliance with regulations and laws such as GDPR, the Sarbanes-Oxley (SOX) Act and HIPAA in the US.

Related solutions
IBM® watsonx.data® integration

Transform raw data into AI-ready data with a streamlined user experience for integrating any data using any style.

Explore watsonx.data integration
Data integration solutions

Create resilient, high performing and cost optimized data pipelines for your generative AI initiatives, real-time analytics, warehouse modernization and operational needs with IBM data integration solutions.

Explore data integration solutions
Data and AI consulting services

Successfully scale AI with the right strategy, data, security and governance in place.

Explore data and AI consulting services
Take the next step

Integrate both structured and unstructured data using a mix of styles—including batch, real-time streaming and replication—so you’re not wasting time and money toggling between tools.

Explore IBM watsonx.data integration Explore data integration solutions