The latest tech news, backed by expert insights
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Change data capture (CDC) is a technique for detecting and capturing changes made to data in a database and transmitting those changes to downstream systems. CDC enables near real-time or real-time data synchronization, replication and event-driven processing across systems after database changes occur.
Change data capture is a method of real-time data integration that combines and harmonizes data that may otherwise remain siloed or inconsistent across an organization. Other methods of data integration include stream data integration, data virtualization and application integration.
CDC’s ability to keep downstream processes and systems updated in near real-time or real-time with low latency is instrumental to the success of real-time data analytics, cloud migrations and artificial intelligence (AI) models. It supports a variety of use cases across industries, including fraud detection, supply chain management and regulatory compliance in sectors such as retail, finance and healthcare.
There are several approaches to change data capture, with log-based CDC, timestamp-based CDC and trigger-based CDC among the most common. Enterprises can implement change data capture through database-native tools, open source platforms and third-party solutions.
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
In modern data management, change data capture has emerged as a critical data engineering mechanism. Today’s enterprise data environments are increasingly large and complex. They might contain data from Internet of Things (IoT) devices, distributed databases, applications and other diverse sources. Maintaining consistent, quality data across this growing data ecosystem is an ongoing challenge.
At the same time, the business demands accurate, up-to-date information that can be leveraged for real-time decision making. Change data capture is one of several methods that helps organizations meet this demand.
Change data capture enables a low-latency data pipeline that delivers fresh data in a way that’s more efficient and less resource-intensive than other data integration methods. For instance, data replication entails copying full datasets. In contrast, CDC sends only the data that has changed, thereby reducing the load on source systems, network traffic and demands for compute power.
It helps them access the latest, most accurate information quickly and efficiently, leading to multiple benefits, including:
CDC helps organizations stream operational data into real-time data analytics platforms and dashboards for more accurate and up-to-date reporting, business insights and decision-making. With these capabilities, businesses can support the demands of today’s time-sensitive, 24/7 business environment.
Continuous synchronization between data sources and target systems supports data migrations between databases, cloud environments or applications with minimal downtime or disruption. For example, during cloud migrations, CDC quickly delivers data changes that occur on premises to relevant cloud-based data tables, ensuring consistency between both environments.
ETL (extract, transform, load) data pipelines are integral for data analytics and machine learning workstreams. But ETL execution, which relies on batch processing, tends to move slowly and tax system resources. Integrating CDC into ETL can optimize resource use and accelerate data movement.
Implementing change data capture can help ensure model source data is up to date, so that large language models (LLMs) can deliver accurate, timely outputs. For instance, in retrieval augmented generation (RAG) use cases, AI models connect with external knowledge bases for more relevant responses.
Change data capture identifies and records inserts, updates and deletes occurring in source data systems. These sources can include relational databases such as Oracle Database, PostgreSQL, MySQL, Microsoft SQL Server and Azure SQL Database, as well as non-relational (NoSQL) databases such as Apache Cassandra and MongoDB.
Modern CDC systems commonly use log-based CDC, in which tools read database transaction logs (files that record data changes in a database) to identify changes. Each change event within a transaction log is associated with an ordered log position, such as a log sequence number (LSN). These help CDC systems determine exactly when modifications occur.
After changes are captured, they are streamed in real time or near real time to downstream systems such as data lakes, data warehouses, streaming data platforms such as Apache Kafka, stream processing engines such as Apache Spark, and ETL/ELT pipelines.
Change data capture can be initiated by either the source system (a push-based approach) or the target system (a pull-based approach). The core difference lies in which system is responsible for capturing and transmitting changes.
In a push-based CDC model, the source system detects changes and immediately “pushes” or sends them to target systems. This approach is commonly implemented using database transaction logs, event streams or message brokers such as Apache Kafka.
Since changes are sent as they occur, push-based CDC typically supports use cases that require real-time or near real-time data movement, such as streaming analytics, event-driven architectures and AI/ML systems.
In pull-based CDC, the target system regularly polls source systems and “pulls” changes when they’re found. Polling can occur on a fixed schedule, which makes pull-based CDC well-suited for batch-oriented workloads or systems that don’t require immediate updates.
While this approach is simpler and requires less complex infrastructure than push-based CDC, it can introduce higher latency and increase query loads on source databases, affecting database performance. Many modern data platforms will support both approaches depending on data needs and operational requirements.
There are several methods for executing change data capture. Common types of CDC include:
Database transaction logs are a standard feature of databases and are used to record all database transactions. (Transaction log files can be used to recover databases in the event of a system failure.)
In log-based CDC, a CDC application processes the database changes (to both data and metadata) recorded in the log and shares the updates with other systems. Log-based CDC has become increasingly popular due to its efficiency—it relies on logs instead of queries, which can place substantial loads on source systems. However, variation in transaction log formats can complicate log-based CDC execution across different databases.
Timestamp-based change data capture, also known as query-based CDC, requires that database table schemas feature columns, such as timestamp columns, noting the date and time of record changes. A CDC tool can be used to identify changed records through the timestamp column in a source table and then deliver updates to target systems.
While timestamp-based CDC can be simple to implement, it can also put additional, invasive loads on a system when polling for timestamp data occurs frequently. Timestamp-based CDC also fails to capture delete operations when the timestamp is deleted along with the rest of a row.
In trigger-based change data capture, stored procedures or functions known as database triggers are executed once specific modifications (such as insertions, deletions and updates) occur in a database. The changed data is then stored in what’s often called a change table or shadow table.
Like timestamp-based CDC, trigger-based CDC can be simple to implement. However, it can also tax source systems because triggers are “fired” each time a transaction occurs in the source table.
To help paint a complete picture of CDC, let’s review some common CDC sources and destinations.
CDC sources are the systems where data originates, such as:
CDC destinations are the systems that data is streamed or replicated to, such as:
Connecting sources and destinations typically requires CDC tools, connectors and data integration platforms.
ETL (extract, transform, load) and change data capture are both widely used data integration approaches, but designed for different purposes.
Below are some of the key differences between ETL and CDC:
Today’s organizations commonly use both ETL and CDC, often together. For example, CDC complements ETL pipelines by transmitting incremental updates after the initial data load. This allows datasets to stay updated in real-time as changes occur in source systems, without having to wait for the next ETL job to run.
CDC and slowly changing dimensions (SCDs) work together to keep target systems accurate and up to date.
While CDC captures and transmits changes from source systems, SCDs define how those changes are managed and stored within dimension tables in a data warehouse.
(In this context, dimension data typically refers to dimension tables in data warehouses that store descriptive attributes such as customer addresses or phone numbers.)
There are two common types of SCDs: Type 1 and Type 2.
SCD Type 1: Overwrites existing data in a dimension table with new data, without retaining history
SCD Type 2: Adds a new row to a dimension table, preserving full historical changes over time
Change data capture (CDC) tools capture and stream database changes in real time, helping organizations support modern data integration, analytics and event-driven architectures.
CDC capabilities may be native to specific database environments, such as AWS Database Migration Service (DMS), or may be implemented more widely. Common CDC solutions include open-source tools such as Debezium and commercial platforms such as IBM StreamSets and Oracle GoldenGate.
Many organizations use Apache Kafka as the foundation for CDC pipelines. Kafka-based CDC architectures can capture database changes, stream them through Kafka topics and deliver them to downstream applications, data warehouses, analytics platforms and AI systems.
When evaluating CDC tools, organizations often consider:
Businesses can deploy change data capture for a variety of uses, including:
Continuously tracking changes in financial records through change data capture can enable the detection of fraudulent activity before it results in substantial losses.
CDC can efficiently integrate the massive amounts of real-time data generated by IoT devices, enabling predictive maintenance and real-time monitoring.
Access to real-time sales, inventory and supply chain information supported by change data capture can help companies avoid stock-outs and make lucrative pricing decisions.
Change data capture can help highly regulated companies keep accurate records necessary for reporting and compliance with regulations and laws such as GDPR, the Sarbanes-Oxley (SOX) Act and HIPAA in the US.
Transform raw data into AI-ready data with a streamlined user experience for integrating any data using any style.
Create resilient, high performing and cost optimized data pipelines for your generative AI initiatives, real-time analytics, warehouse modernization and operational needs with IBM data integration solutions.
Successfully scale AI with the right strategy, data, security and governance in place.