SQL Server CDC Client
The SQL Server CDC Client origin processes data in Microsoft SQL Server change data capture (CDC) tables. The origin fetches changes in time windows and uses multiple threads to enable parallel processing of data. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
Use the SQL Server CDC Client origin to generate records from CDC tables. To read data from Microsoft SQL Server change tracking tables, use the SQL Server Change Tracking origin. For more information about the differences between CDC and change tracking data, see the Microsoft SQL Server documentation. To read data from SQL Server temporal tables, use the JDBC Multitable Consumer origin or the JDBC Query Consumer origin. For more information about temporal tables, see the Microsoft documentation.
By default, the SQL Server CDC Client origin generates a record for each record in the CDC table. You can configure the origin to combine the two update records that SQL Server generates for each update. This changes the structure of the generated records and produces additional record header attributes.
The SQL Server CDC Client origin includes the CRUD operation type in a record header attribute so generated records can be easily processed by CRUD-enabled destinations. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.
You might use this origin to perform database replication. You can use a separate pipeline with the JDBC Query Consumer or JDBC Multitable Consumer origin to read existing data. Then start a pipeline with the SQL Server CDC Client origin to process subsequent changes.
When you configure the origin, you specify the SQL Server capture instance names - the origin processes the related CDC tables. You can define groups of tables in the same database and any initial offsets to use. When you omit initial offsets, the origin processes all available data in the CDC tables.
You can enable late table processing to allow the origin to process tables that appear after the pipeline starts. You can also configure the origin to check for schema changes in processed tables and to generate an event after discovering a change.
To determine how the origin connects to the database, you specify connection information, a query interval, number of retries, and any custom JDBC configuration properties that you need. You can configure advanced connection properties. You can also use a connection to configure the origin.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
When a pipeline stops, the SQL Server CDC Client origin notes where it stops reading. When the pipeline starts again, the origin continues processing from where it stopped by default. You can reset the origin to process all requested data.
JDBC Driver
When connecting to Microsoft SQL Server, you do not need to install a JDBC driver. Data Collector includes the JDBC driver required for SQL Server.
Supported Operations
The SQL Server CDC Client origin supports the SQL Server insert and delete operations. Updates captured after the update operation are treated as update, and updates captured before the update operation are treated as an unsupported operation.
Time Windows
The SQL Server CDC Client origin fetches changes from tables in time windows. By default, the origin fetches changes in one time window, beginning with the last committed offset and ending with the latest data. When you have large volumes of data, you can improve performance by configuring multiple time windows.
To configure multiple time windows, set the Maximum Transaction Length property to the desired size of the time windows. The first time window starts at the initial offset or last saved offset and ends the specified number of seconds later. The next window starts where the previous ended and ends the specified number of seconds later, and so on. Within each time window, the origin creates batches based on the batch settings.
When you enable event generation, the origin produces a no-more-data event at the end of each time window, even when subsequent time windows remain for processing.
Previewing data shows no values when there are no changes during the first time window.
Multithreaded Processing
The SQL Server CDC Client origin performs parallel processing and enables the creation of a multithreaded pipeline.
When you start the pipeline, the SQL Server CDC Client origin retrieves the list of CDC tables associated with the source tables defined in the table configurations. The origin then uses multiple concurrent threads based on the Number of Threads property. Each thread reads data from a single table.
As the pipeline runs, each thread connects to the origin system, creates a batch of data, and passes the batch to an available pipeline runner. A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin.
Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Example
Say you are reading from 10 tables. You set the Number of Threads property to 5 and the Maximum Pool Size property to 6. When you start the pipeline, the origin retrieves the list of tables. The origin then creates five threads to read from the first five tables, and by default Data Collector creates a matching number of pipeline runners. Upon receiving data, a thread passes a batch to each of the pipeline runners for processing.
At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.
Batch Strategy
You can specify the batch strategy to use when processing data:
- Process all available rows from the table
- Each thread processes all available rows from a table. A thread runs a SQL query and processes all of the results for a table. Then, the thread switches to the next available table.
- Switch tables
- When the origin performs multithreaded table processing for all tables, each thread processes one batch of data, then switches to an available table and repeats the process. When a pipeline starts, each thread runs a SQL query, generates a result set, and processes a batch of records from the result set. The database driver caches the remaining records for the same thread to access again. Then, the thread switches to the next available table.
Table Configuration
When you configure SQL Server CDC Client, you can define multiple CDC tables using a single set of table configuration properties. You can also define multiple table configurations to process multiple groups of CDC tables.
- Capture Instance Name
- Determines the CDC tables to process. The naming convention for Microsoft
SQL Server CDC tables is <capture instance name>_CT. When specifying this
property, use the capture instance name, not the names of the CDC tables to
be processed. For example, specify the
dbo.customer
source table, not the associated CDC table,dbo_customer_CT
. - Table exclusion pattern
- Optionally specify a regex pattern for the table names that you want to
exclude from the query.
For example, say you want to process all CDC tables in the schema except for those that start with "dept". You can use the default % for the table name pattern, and enter dept* for the table exclusion pattern.
For more information about using regular expressions with Data Collector, see Regular Expressions Overview.
- Initial offset
- To process existing data, specify an initial offset. When not set, the origin processes all available CDC data.
Initial Table Order Strategy
You can define the initial order that the origin uses to read the tables.
- None
- Reads the tables in the order that they are listed in the database.
- Alphabetical
- Reads the tables in alphabetical order.
The origin uses the table order strategy only for the initial reading of the tables. When threads switch back to previously read tables, they read from the next available table, regardless of the defined order.
Allow Late Table Processing
You can configure the SQL Server CDC Client to process data in CDC tables that appear after the pipeline starts.
When you allow late table processing, the SQL Server CDC Client origin uses a background thread to check for late CDC tables. The origin checks at regular user-defined intervals.
- On the JDBC tab, select the Allow Late Tables property.
- To define the time to wait before checking for new tables, configure the New Table Discovery Interval property.
- On the Advanced tab, set the Maximum Pool Size and Minimum Idle Connections properties to one thread more than the Number of Threads property.
Checking for Schema Changes
You can configure the SQL Server CDC Client origin to check for schema changes in the tables being processed. When checking for schema changes, the origin includes a schema check statement in the SQL query.
- Compares current table schemas with the original table schemas at regular intervals, based on the Query Interval property.
- If it determines that the schema of a table has changed, it generates a
schema-change event that states the table or capture instance name with the
changed schema.
The origin generates a schema-change event each time that it finds a schema change: one for each table with a schema change.
Note: Since the origin continues to check for schema changes at regular intervals until the pipeline stops, a single schema change can generate a large volume of events. - It can write the exact column name or data type change to the Data Collector log. To enable writing to the log, the log level must be set to Trace. For information about changing the log level, see Modifying the Log Level in the Data Collector documentation.
- On the General tab, select the Produce Events property.
- On the JDBC tab, select the Enable Schema Changes Event property.
Generated Record
The SQL Server CDC Client origin generates records from CDC tables, placing the CDC information, such as the CDC operation type and start LSN values, in record header attributes and generating field attributes for some converted fields.
- Basic - Generates two records for updates, one with the old data and one with the changed data.
- Basic discarding ‘Before Update’ records - Generates one record for updates, containing the changed data.
- Rich - Generates one record for updates with data written to the Data field, OldData field, or both.
The origin generates a record header attribute named record_format
, which
indicates the format of the generated record: 1 indicates basic format, 2 indicates basic
discarding “before update” records, and 3 indicates rich.
Basic Record Format
By default, the Record Format property is set to Basic. The origin creates one record for each record in the CDC table. SQL Server generates two records in the CDC table for each update, one with the old data and one with the changed data. Therefore, each update operation results in two records. The origin writes the data to fields that match the data fields in the CDC table.
For example, suppose a CDC table with three fields (id, name, and dt) has four records: a record for an insert, two records for an update, and a record for a delete. With the Record Format property set to Basic, the origin generates four records, one for each record in the CDC table, as shown below.
Basic Discarding Before Update Record Format
When you set the Record Format to Basic discarding Before Update records, the origin creates one record for each record in the CDC table. For updates, the record contains the changed data.
For example, suppose a CDC table with three fields (id, name, and dt) has four records: a record for an insert, two records for an update, and a record for a delete. With the Record Format property set to Basic discarding Before Update records, the origin generates three records, one for the insert record, one for the delete record, and a single update record showing the changed data, as shown below.

Rich Record Format
- For an insert operation, the origin writes the inserted data in the Data map field and does not create an OldData field.
- For an update operation, the origin writes the data before the update in an OldData map field and the data after the update in a Data map field.
- For a delete operation, the origin writes the data before the deletion in the OldData map field and does not create a Data field.
You might set the Record Format property to Rich to produce one record for each update, similar to how other origins generate records.
For example, suppose a CDC table with three fields (id, name, and dt) has four records: a record for an insert, two records for an update, and a record for a delete. With the Record Format property set to Rich, the origin generates three records, as shown below.
Note that the update record includes the previous data in an OldData map field, and the updated data in a Data map field. If all you need is the updated data, you can use the Field Remover processor to drop the OldData field from the record. And you could use a Field Flattener processor to flatten the fields in the Data field.
Record Header Attributes
The SQL Server CDC Client origin generates JDBC record header attributes that provide the SQL Server CDC data for each record, such as the start or end log sequence numbers (LSNs).
The origin also includes the sdc.operation.type
attribute,
record_format
attribute, and information from the SQL Server CDC
tables. The SQL Server CDC Client header attributes are prefixed with
jdbc
. The names of the SQL Server CDC column names are included in
the header attribute name, as follows: jdbc.<CDC column name>
.
You can use the record:attribute
or
record:attributeOrDefault
functions to access the information
in the attributes. For more information about working with record header attributes,
see Working with Header Attributes.
Header Attribute Name | Description |
---|---|
sdc.operation.type |
Provides one of the following values to represent the operation
type:
|
record_format | Provides one of the following values to represent the generated record
format:
|
jdbc.cdc.source_schema_name | Provides the source schema for the CDC data. |
jdbc.cdc.source_name | Provides the source table for the CDC data. |
jdbc.tables |
Provides a
comma-separated list of source tables for the fields in the
record.
Note: Not all JDBC drivers
provide this information.
|
jdbc.<column name>.jdbcType | Provides
the numeric value of the original SQL data type for each field
in the record. See the Java documentation for
a list of the data types that correspond to numeric
values.
Because the record read from the SQL Server CDC table includes
CDC columns, the generated record also includes corresponding
For example, since the
original data includes a __$start_lsn column, the resulting
record has a |
jdbc.<column name>.jdbc.precision | Provides the original precision for all numeric and decimal fields. |
jdbc.<column name>.jdbc.scale | Provides the original scale for all numeric and decimal fields. |
jdbc.primaryKeySpecification |
Provides a JSON-formatted
string that lists the columns that form the primary key in the table and the
metadata for those columns. For example, a table with a
composite primary key contains the following attribute:
A table without a primary key
contains the attribute with an empty value:
|
jdbc.primaryKey.before.<primary key column name> |
Provides the old
value for the specified primary key column. Generated when the record represents an update operation on a table with a primary key. |
jdbc.primaryKey.after.<primary key column name> |
Provides the new
value for the specified primary key column. Generated when the record represents an update operation on a table with a primary key. |
jdbc. __$command_id | Data from the SQL Server CDC __$command_id column. |
jdbc.__$end_lsn | Data from the SQL Server CDC __$end_lsn column. |
jdbc.__$operation | The CRUD operation type using SQL Server codes, as defined in the SQL Server CDC __$operation column. |
jdbc.__$seqval | Data from the SQL Server CDC __$seqval column. |
jdbc.__$start_lsn | Data from the SQL Server CDC __$start_lsn column. |
jdbc.__$update_mask | Data from the SQL Server CDC __$update_mask column. |
For details about the CDC attributes, see the SQL Server documentation.
CRUD Operation Header Attributes
- sdc.operation.type
- The SQL Server CDC Client origin writes the operation type to the
sdc.operation.type
record header attribute. - jdbc.__$operation
- The SQL Server CDC Client origin places the values from the SQL Server
__$operation column in the
jdbc.__$operation
record header attribute. As a result, thejdbc.__$operation
record header attribute contains the CRUD operation type as defined using SQL Server CDC codes.
Field Attributes
The SQL Server CDC Client origin generates field attributes for columns converted to the Decimal or Datetime data types in Data Collector. The attributes provide additional information about each field.
- SQL Server Decimal and Numeric data types are converted to the Data Collector Decimal data type, which does not store scale and precision.
- SQL Server Datetime, Datetime2, and Smalldatetime data types are converted to the Data Collector Datetime data type, which does not store nanoseconds.
Data Collector Data Type | Generated Field Attribute | Description |
---|---|---|
Decimal | precision | Provides the original precision for every decimal or numeric column. |
Decimal | scale | Provides the original scale for every decimal or numeric column. |
Datetime | nanoSeconds | Provides the original nanoseconds for every datetime, datetime2, or smalldatetime column. |
You can use the record:fieldAttribute
or
record:fieldAttributeOrDefault
functions to access the information
in the attributes. For more information about working with field attributes, see Field Attributes.
Event Generation
The SQL Server CDC Client origin can generate events that you can use in an event stream. When you enable event generation, the origin generates an event when it completes processing the data returned by the specified queries for all tables.
If you enable schema change event generation, the origin also generates an event each time it finds a schema change.
- With the Pipeline Finisher executor to
stop the pipeline and transition the pipeline to a Finished state when
the origin completes processing available data.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.
For an example, see Stopping a Pipeline After Processing All Available Data.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Record
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- no-more-data
-
The origin generates a no-more-data event record when the origin completes processing all available data in a time window and the number of seconds configured for event generation delay elapses without any new files appearing to be processed. The origin generates the event record even when subsequent time windows remain for processing.
The no-more-data event record generated by the origin has the sdc.event.type set to no-more-data and does not include any additional fields.
- schema-change
-
The origin generates a schema-change event record only when you enable the origin to check for schema changes, and the origin discovers a schema change.
The schema-change event record generated by the origin has the sdc.event.type set to schema-change and includes the following fields:Event Record Field Description capture-instance-name The name of the capture instance or CDC table associated with the table with the schema change. source-table-schema-name The name of the schema that contains the data table. source-table-name The name of the data table that has a schema changes.