What is change data capture?

By Alice Gomstyn , Alexandra Jonker

Change data capture, defined

Change data capture (CDC) is a technique for detecting and capturing changes made to data in a database and transmitting those changes to downstream systems. CDC enables near real-time or real-time data synchronization, replication and event-driven processing across systems after database changes occur.

Change data capture is a method of real-time data integration that combines and harmonizes data that may otherwise remain siloed or inconsistent across an organization. Other methods of data integration include stream data integration, data virtualization and application integration.

CDC’s ability to keep downstream processes and systems updated in near real-time or real-time with low latency is instrumental to the success of real-time data analytics, cloud migrations and artificial intelligence (AI) models. It supports a variety of use cases across industries, including fraud detection, supply chain management and regulatory compliance in sectors such as retail, finance and healthcare.

There are several approaches to change data capture, with log-based CDC, timestamp-based CDC and trigger-based CDC among the most common. Enterprises can implement change data capture through database-native tools, open source platforms and third-party solutions.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

What are the benefits of change data capture?

In modern data management, change data capture has emerged as a critical data engineering mechanism. Today’s enterprise data environments are increasingly large and complex. They might contain data from Internet of Things (IoT) devices, distributed databases, applications and other diverse sources. Maintaining consistent, quality data across this growing data ecosystem is an ongoing challenge.

At the same time, the business demands accurate, up-to-date information that can be leveraged for real-time decision making. Change data capture is one of several methods that helps organizations meet this demand.

Change data capture enables a low-latency data pipeline that delivers fresh data in a way that’s more efficient and less resource-intensive than other data integration methods. For instance, data replication entails copying full datasets. In contrast, CDC sends only the data that has changed, thereby reducing the load on source systems, network traffic and demands for compute power.

It helps them access the latest, most accurate information quickly and efficiently, leading to multiple benefits, including:

Faster decision-making
Zero-downtime migrations
ETL process improvement
Improved AI performance

Faster decision-making

CDC helps organizations stream operational data into real-time data analytics platforms and dashboards for more accurate and up-to-date reporting, business insights and decision-making. With these capabilities, businesses can support the demands of today’s time-sensitive, 24/7 business environment.

Zero-downtime migrations

Continuous synchronization between data sources and target systems supports data migrations between databases, cloud environments or applications with minimal downtime or disruption. For example, during cloud migrations, CDC quickly delivers data changes that occur on premises to relevant cloud-based data tables, ensuring consistency between both environments.

ETL process improvement

ETL (extract, transform, load) data pipelines are integral for data analytics and machine learning workstreams. But ETL execution, which relies on batch processing, tends to move slowly and tax system resources. Integrating CDC into ETL can optimize resource use and accelerate data movement.

Better artificial intelligence (AI) performance

Implementing change data capture can help ensure model source data is up to date, so that large language models (LLMs) can deliver accurate, timely outputs. For instance, in retrieval augmented generation (RAG) use cases, AI models connect with external knowledge bases for more relevant responses.

What is Apache Kafka?

In this video, you will learn what Apache Kafka is, how it works and the core concepts behind building real-time event streaming applications.

Explore Confluent

How does change data capture work?

Change data capture identifies and records inserts, updates and deletes occurring in source data systems. These sources can include relational databases such as Oracle Database, PostgreSQL, MySQL, Microsoft SQL Server and Azure SQL Database, as well as non-relational (NoSQL) databases such as Apache Cassandra and MongoDB.

Modern CDC systems commonly use log-based CDC, in which tools read database transaction logs (files that record data changes in a database) to identify changes. Each change event within a transaction log is associated with an ordered log position, such as a log sequence number (LSN). These help CDC systems determine exactly when modifications occur.

After changes are captured, they are streamed in real time or near real time to downstream systems such as data lakes, data warehouses, streaming data platforms such as Apache Kafka, stream processing engines such as Apache Spark, and ETL/ELT pipelines.

CDC approaches: Push vs. pull

Change data capture can be initiated by either the source system (a push-based approach) or the target system (a pull-based approach). The core difference lies in which system is responsible for capturing and transmitting changes.

Push-based CDC

In a push-based CDC model, the source system detects changes and immediately “pushes” or sends them to target systems. This approach is commonly implemented using database transaction logs, event streams or message brokers such as Apache Kafka.

Since changes are sent as they occur, push-based CDC typically supports use cases that require real-time or near real-time data movement, such as streaming analytics, event-driven architectures and AI/ML systems.

Pull-based CDC

In pull-based CDC, the target system regularly polls source systems and “pulls” changes when they’re found. Polling can occur on a fixed schedule, which makes pull-based CDC well-suited for batch-oriented workloads or systems that don’t require immediate updates.

While this approach is simpler and requires less complex infrastructure than push-based CDC, it can introduce higher latency and increase query loads on source databases, affecting database performance. Many modern data platforms will support both approaches depending on data needs and operational requirements.

Common methods for change data capture

There are several methods for executing change data capture. Common types of CDC include:

Log-based CDC
Timestamp-based CDC
Trigger-based CDC

Log-based CDC

Database transaction logs are a standard feature of databases and are used to record all database transactions. (Transaction log files can be used to recover databases in the event of a system failure.)

In log-based CDC, a CDC application processes the database changes (to both data and metadata) recorded in the log and shares the updates with other systems. Log-based CDC has become increasingly popular due to its efficiency—it relies on logs instead of queries, which can place substantial loads on source systems. However, variation in transaction log formats can complicate log-based CDC execution across different databases.

Timestamp-based CDC

Timestamp-based change data capture, also known as query-based CDC, requires that database table schemas feature columns, such as timestamp columns, noting the date and time of record changes. A CDC tool can be used to identify changed records through the timestamp column in a source table and then deliver updates to target systems.

While timestamp-based CDC can be simple to implement, it can also put additional, invasive loads on a system when polling for timestamp data occurs frequently. Timestamp-based CDC also fails to capture delete operations when the timestamp is deleted along with the rest of a row.

Trigger-based CDC

In trigger-based change data capture, stored procedures or functions known as database triggers are executed once specific modifications (such as insertions, deletions and updates) occur in a database. The changed data is then stored in what’s often called a change table or shadow table.

Like timestamp-based CDC, trigger-based CDC can be simple to implement. However, it can also tax source systems because triggers are “fired” each time a transaction occurs in the source table.

Common CDC sources and destinations

To help paint a complete picture of CDC, let’s review some common CDC sources and destinations.

CDC sources are the systems where data originates, such as:

Relational databases (PostgreSQL, MySQL, Microsoft SQL Server)
NoSQL databases (Apache Cassandra, MongoDB)
Enterprise SaaS applications (Salesforce, SAP)

CDC destinations are the systems that data is streamed or replicated to, such as:

Data streaming platforms (Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub)
Data warehouses and lakehouses (Snowflake, Amazon Redshift, Google BigQuery)
Cloud object storage (AWS S3, Azure Blob Storage, Google Cloud Storage)

Connecting sources and destinations typically requires CDC tools, connectors and data integration platforms.

ETL vs. CDC: Key differentiators

ETL (extract, transform, load) and change data capture are both widely used data integration approaches, but designed for different purposes.

Below are some of the key differences between ETL and CDC:

Data movement: ETL pipelines typically ingest entire datasets or large batches of data. CDC only captures and transmits changes.
Processing speed and latency: ETL is commonly batch-oriented at scheduled intervals. CDC is designed for low-latency data movement and continuous synchronization.
Primary use cases: ETL is often used for business intelligence, historical reporting and machine learning. CDC is commonly used for real-time analytics, fraud detection and event-driven architectures.
Data transformation: ETL pipelines cleanse and transform data before loading. CDC systems just identify and replicate changes without further processing.
System impact: Traditional ETL processes place heavier strain on source systems with repeated batch workloads. CDC minimizes overhead by only transmitting changes.

Today’s organizations commonly use both ETL and CDC, often together. For example, CDC complements ETL pipelines by transmitting incremental updates after the initial data load. This allows datasets to stay updated in real-time as changes occur in source systems, without having to wait for the next ETL job to run.

SCD vs. CDC: How do they differ from one another?

CDC and slowly changing dimensions (SCDs) work together to keep target systems accurate and up to date.

While CDC captures and transmits changes from source systems, SCDs define how those changes are managed and stored within dimension tables in a data warehouse.

(In this context, dimension data typically refers to dimension tables in data warehouses that store descriptive attributes such as customer addresses or phone numbers.)

There are two common types of SCDs: Type 1 and Type 2.

SCD Type 1: Overwrites existing data in a dimension table with new data, without retaining history

SCD Type 2: Adds a new row to a dimension table, preserving full historical changes over time

Change data capture tools

Change data capture (CDC) tools capture and stream database changes in real time, helping organizations support modern data integration, analytics and event-driven architectures.

CDC capabilities may be native to specific database environments, such as AWS Database Migration Service (DMS), or may be implemented more widely. Common CDC solutions include open-source tools such as Debezium and commercial platforms such as IBM StreamSets and Oracle GoldenGate.

Many organizations use Apache Kafka as the foundation for CDC pipelines. Kafka-based CDC architectures can capture database changes, stream them through Kafka topics and deliver them to downstream applications, data warehouses, analytics platforms and AI systems.

When evaluating CDC tools, organizations often consider:

Scalability
Pricing
Latency
Connector support
Kafka integration
Reliability
Deployment flexibility
API support

Use cases for change data capture

Businesses can deploy change data capture for a variety of uses, including:

Fraud detection

Continuously tracking changes in financial records through change data capture can enable the detection of fraudulent activity before it results in substantial losses.

Internet of Things (IoT) enablement

CDC can efficiently integrate the massive amounts of real-time data generated by IoT devices, enabling predictive maintenance and real-time monitoring.

Inventory and supply chain management

Access to real-time sales, inventory and supply chain information supported by change data capture can help companies avoid stock-outs and make lucrative pricing decisions.

Regulatory compliance

Change data capture can help highly regulated companies keep accurate records necessary for reporting and compliance with regulations and laws such as GDPR, the Sarbanes-Oxley (SOX) Act and HIPAA in the US.

Authors

Alice Gomstyn

Staff Writer

IBM Think

Alexandra Jonker

Staff Editor

IBM Think

3D render of two lines of several icons such as a camera, volume knob and a clipboard

Discover how an AI-powered data integration approach unlocks the full potential of your data from our ebook.

Resources

Exploded view of electronic device components, including screens, microphone, cables, battery, and layered parts on a light background

AI agents run on data—is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Is your data ready for gen AI?

Explore our IBM Data Matters hub to learn how you can tackle data and AI challenges like integration.

Person holding a smartphone and tapping a settings or options list on the screen while standing on a stone-paved surface

Real-time advising needs real-time data

How Wealth API is powering AI-ready, real-time financial intelligence with trusted streaming data

Abstract illustration of colorful 3D geometric shapes and icons flowing in a wave pattern across a light background

Unleash the power of AI for seamless data integration

Understand why organizations need to adopt a unified approach that lets them manage the full spectrum of integration capabilities from a single pane of glass, eliminating the need to rely on numerous tools.