Validating data consistency using CDC Replication

There may be circumstances in which you want to discover any differences in the data on your source and target database systems. You can configure conflict detection in your table mappings to identify differences that exist in the rows changed on the source system but conflict detection will not identify differences in rows that have not been recently changed on the source system. The Differential Refresh facility in CDC Replication can assist you with this task without the need to reload the target data from the source.

You can use the Differential Refresh facility to both detect and correct any differences that exist in the table but this can only be done while the subscription is not mirroring and so will impact the latency of the data in the target system. The procedure described here instead involves the use of a second subscription and does not impact the latency of replication.

You will create a second subscription that contains the same table mappings as your primary subscription and run Differential Refresh operations using this secondary subscription. On this second subscription you will configured Differential Refresh to only log changes. The two subscriptions will not be synchronized with respect to each other and the primary subscription will be mirroring changes to the target while the Differential Refresh is running. Some of the differences detected by the Differential Refresh will be the result of this lack of synchronization. Any real differences between the source and target data will appear as persistent differences, rows that are found to be different across two or more subsequent Differential Refresh runs. Differences will be minimal when latency is minimized and the volume of changes on the source system is at a minimum.

Compare the sets of differences between runs of the secondary subscription.

If the set of differences between runs changes, the source and target can be considered synchronized.

If there are differences that are present in more than one run, those differences may indicate the rows that are not synchronized between the source and target. This may be validated by manually comparing the values of the row on the source and target to determine whether the row really is different between the source and target or whether the row on the source is changing frequently.

If a particular row is being constantly updated by the source system then it may show up in the results for each run of the secondary subscription even though its value is being correctly maintained by replication.

Persistent differences in rows which are being rarely changed on the source likely indicate that the source and target tables are no longer synchronized. The target system can be re-synchronized by manually adjusting the target rows that are different or by performing a refresh operation for the affected tables using the primary subscription. The scope of the refresh could potentially be reduced by employing a subset refresh if there is an identifiable common characteristic in the differing rows that would contain all the differing rows.

Note: Differential refresh is not available for CDC Replication Engine for InfoSphere® DataStage®, CDC Replication Engine for Netezza® Technology, CDC Replication Engine for FlexRep, CDC Replication Engine for Kafka, or CDC Replication Engine for Db2® for i datastores.