IBM Support

IBM Data Replication Change Data Capture (CDC) Best Practices

Fix Readme


Abstract

This is a collection of IDR CDC Best Practices and rules of thumb.

Content

Along with many of the best practices posts, there are items denoted by "Rule of Thumb".  These are general guidelines that will help in your planning.   The Rules of Thumb should not be treated as hard limits, rather as useful guidance.  If your needs fall significantly outside the guidance, it certainly does not mean that it can not be done.  Rather, it would be best to engage with an InfoSphere CDC subject matter expert, and potentially you may want to consider IBM Services for assistance.

Best Practice - CDC / CDD to DataStage Integration

There are many deployment models available for InfoSphere Data Replication's CDC technology of which DataStage integration is a popular one.  The deployment option selected will significantly affect the complexity, performance, and reliability of the implementation.   If possible, the best solution is always to use CDC direct replication (i.e. do not add DataStage to the mix). 

CDC integration with DataStage is the right solution for replication when:

  • You need to target a database that CDC doesn't directly support and is not appropriate for CDC FlexRep
  • Complex transformations are required that could not be handled natively with CDC, such as complex table look-ups
  • When integrating with MDM

Cons of replicating from CDC to DataStage to an eventual target database:

  • Performance going through DataStage (no matter which integration option is chosen) will be significantly slower than applying via a CDC target directly to the database
    • The exception to this rule is when targeting Teradata, if you use DataStage flatfile integration, the throughput will be higher than CDC direct to Teradata
  • Adding DataStage into the replication stream introduces additional points of failure
  • Having a resilient CDC installation is more complex if DataStage is also involved
  • When integrating with DataStage, there are two independent GUIs for configuration, and two places required to monitor the replication stream
  • There is significant development effort developing DataStage jobs for each additional table added to replication
  • Incorrect DataStage job design can negatively affect transactional integrity and cause data corruption
  • The maximum number of tables per CDC subscription is lower if targeting DataStage
  • The CDC External Refresh does not work when targeting DataStage.  A separate process would have to be put in place to de-dup duplicate records produced during the "in-doubt" period of a refresh (the captured changes that occurred while the source date was being refreshed).

Link to Wiki containing best practices for integration with DataStage

IBM Data Replication Community Wiki - DataStage


Best Practice - Deployment Configurations for LUW

There are multiple deployment models available for InfoSphere CDC.   The deployment model chosen for the source system will significantly affect the complexity of implementation.

Here are the CDC source deployment options from the least complex to the most complex:

1. InfoSphere CDC scraper runs on the source database server

2. InfoSphere CDC scraper runs on a remote tier reading logs from a shared disk (SAN)

  • This configuration is available for Oracle and Sybase.  Db2 has a similar capability, but utilizes a remote client instead of reading from a SAN.

3. InfoSphere CDC scraper runs on a remote tier using log shipping

  • This configuration is only available for Oracle.
Rule of Thumb

You should always use the least complex deployment option that will meet the business needs.  The vast majority of CDC users install InfoSphere CDC on the source database server.


Best Practice - Things to know that you may not be aware of

§ Using ‘Standard’ replication achieves much higher throughput performance than using ‘Consolidation’ or ‘Summarization’
  • Standard replication can do optimizations such as arraying, commit grouping, etc that can not be performed when using the other replication methods
  • Note some optimizations will also be disabled if using Adaptive apply or Conflict Detection & Resolution
§ Be aware when you are parking tables/subscriptions
  • An inactive (not currently replicating) subscription that contains tables with a replication method of Mirror will continue to accumulate change data in the staging store from the current point back to the point where mirroring was stopped. For this reason, you should delete subscriptions or remove tables that are no longer required, or change the replication method of all tables in the subscription to Refresh to prevent the accumulation of change data in the staging store on your source system.
  • The same is true with a parked (idle) table.  You need to insure that the replication method is set to Refresh


Best Practice - Target Considerations

The following items need to be considered and taken into account when you are planning a replication architecture.

§ Target table triggers
–Often if the target is a mirror image of the source, you may have triggers on target tables that if fired will have an affect on other tables that InfoSphere CDC is replicating into (CDC would have mirrored the source trigger effect and will get duplicate actions).  To alleviate this, you should disable the trigger on the target table. 
§ Referential integrity constraints with DELETE CASCADE flag on target tables
–Similar to trigger, having cascade deletes set on the target will cause replication to try and delete a record (based on the delete that CDC would have replicated from the source log) that the database may have already deleted (or visa-versa). The following strategy can be deployed to deal with cascaded deletes:
  • Disable the RI constraints on target prior to starting replication
  • Please note that re-enabling these constraints may take some time during cut-over if you need to fail over to the target
    • Strategy: test how long re-enabling the RI constraints takes. If re-enabling all RI constraints takes too long and would impact your RTO (Recovery Time Objective), investigate whether it is possible to leave the RI constraints enabled and just change the CASCADE DELETE flag at cut-over time.


Best Practice - Logging Requirements

–All log-based replication products require additional logging on the database which will result in additional storage needs.  The following are some of the base logging requirements for InfoSphere CDC:
  • For Db2 for system Z and Db2, the Db2 table is altered for Data Capture Changes
  • For Db2 for system i, journaling is enabled requiring before and after image
  • For Oracle on UNIX/Linux, minimal database level supplemental logging plus table level supplemental logging is required
    • If using rule-based subscriptions, only PRIMARY KEY and UNIQUE INDEX supplemental logging is required
  • For SQL Server, recovery model must be FULL or BULK-LOGGED
  • For Sybase backup logs must be enabled, and truncate log on checkpoint must be disabled
  • For Informix, logging must be enabled, and run the Informix syscdcv1.sql script


Best Practice - Memory Requirements on LUW

It is very important to allocate and configure a suitable amount of physical memory to an InfoSphere CDC instance.   Note that it does need to be physical memory and available to CDC.  For instance on some systems you can use TOP and verify that there is sufficient resident memory available.  Be aware that significant performance degradation will result from insufficient physical memory due to relying on virtual memory, disk I/O and higher CPU due to time spent cleaning up memory.

The– default amount of memory, 1GB, has been carefully chosen to work for most cases.  More memory does not necessarily mean better performance.  If you allocate significantly more memory to your CDC instance than CDC requires, it could actually cause performance to degrade as large garbage collection could occur.   Thus, you want to start with a reasonable amount of memory, and then adjust iteratively as required.   One approach to follow is to install using the defaults and use the performance monitor to monitor how much memory is being used by the instance.  If it is running out of memory frequently or running at over 80% average for a sustained amount of time (more than 30 minutes), increase incrementally until memory usage is around an average of 70% of the available memory

–In cases of high volume, a large number of subscriptions, or if there are large transactions (greater than 1GB), or LOBs, allocating sufficient memory can reduce the need to stage to disk (which you want to avoid whenever possible).


Best Practice - Required CPU on Source

InfoSphere CDC is efficient with it’s use of CPU on the source

  • The amount of CPU used on the source will normally be minimal.  However, in cases where there is a backlog of data or period of very high activity, such as the case with batch jobs, InfoSphere CDC will use as much available resources as required to keep up or catch up with the data generation
  • One way to limit the amount of CPU resource that InfoSphere CDC uses is to change the priority of the job.  In general you will want InfoSphere CDC to have the same priority as the Database
    • On z/OS use IBM Workload manager (WLM) for the started task
    • On Linux/Unix use nice
Rule of Thumb:
  • InfoSphere CDC will normally operate with low CPU but justifiably may use much larger amounts during heavy batch loads
    • Note, the CPU used per unit of work does not go up in these periods, and can actually be less, but more data is being processed in a shorter period of time, so higher CPU will be used for a period of time

Best Practice - Long Running & Large Transactions

Both Long running and Large transactions could potentially impact the restart of InfoSphere CDC since the earliest open log position is tracked and used when InfoSphere CDC restarts replication.

  • If the earliest open log position is not contained in the staging space, then InfoSphere CDC will need to start back in the log, and if a transaction has been open for days, there is risk that the log would not be available
  • For InfoSphere CDC z, if you have an invalid long running transaction (Unit of Recovery), then you can utilize the ENDUR command to dispose of it from replication scope.  Note that this command must be used with great caution as you could incur replication data loss if you actually required the data in the transaction that you forced disposal of.

For very large transactions you need to ensure that the transaction staging space is large enough to contain the number of concurrent transactions being processed

  • InfoSphere CDC LUW – Must set mirror_global_disk_quota_gb large enough to hold the transactions
  • InfoSphere CDC z - Uses a staging store to hold URs in memory above the bar until a commit is received.  It must be large enough to contain all concurrent open URs.  The size of this store is controlled by two parameters:
    • STG64LIMIT total amount of memory which can be used by all users of above the bar memory in the address space
    • MAXSUBSCRSTAGESIZE amount of memory which can be used by the staging space for a single subscription
      • It has an additional argument which specifies the number of completed commit groups in the staging store, defaulting to 10.  Once this number is achieved, InfoSphere CDC will stop reading log data until one has been sent to the target and removed from the store

Best Practice - Shared Scrape

Shared Scrape (sometimes referred to as Single Scrape)

When multiple subscriptions are running in a single instance, it is usually advantageous to utilize a shared scrape mechanism.  If you don't use a shared scrape, and you have 'n' subscriptions, CDC would read the log 'n' times.  If you utilize shared scrape, CDC will only read the log once which will utilize fewer system resources.

  • On by default for InfoSphere CDC LUW
  • You must configure the log cache for InfoSphere CDC z
  • Not available on InfoSphere CDC i or CDC Informix

You need to size the shared scrape cache appropriately for optimal performance:

  • If the cache is too small the following will occur:
    • LUW – A private scraper will be launched which will consume additional resources
      •  Set staging_store_disk_quota_gb system parameter appropriately to avoid
    • Z - With the log cache, each subscription attempts to read its data from the cache – it will read directly from the IFI if the data is no longer available from the cache
      • Use the following to configure CACHELEVEL1SIZE, CACHEBLOCKSIZE, CACHELEVEL1RESERVED


Best Practice - Number of Tables in a Subscription

Number of Tables in a Subscription Rule of Thumb
  • This is certainly not a hard limit, but in general it is best to keep the number of tables in a subscription under 1000

Considerations for the number of tables include:

  • With too many tables (over 1000) in a subscription, loading and managing the tables in the Management Console GUI will be slow
    • This may not be a consideration if you are controlling your replication via scripting/automation
  • If the number of tables exceed 1000 then promotion in the management console will take a significant amount of time, and additional memory would need to be allocated
  • From an engine perspective:
    • With CDC LUW if you want to go beyond 1000 tables you need to increase the memory allocated to the InfoSphere CDC Instance
      • If the target is flatfile or HDFS, then an upper limit on the number of tables in the subscription is 800.  Additionally, you would need to allocate some additional memory if you have more than a couple hundred tables.
    • CDC i can accommodate well over 2000 tables in a subscription
    • CDC z can accommodate well over 1000 tables in a subscription
      • Note, the number can be significantly higher, but there are implications to the number of subscriptions you have due to limits on below the bar memory

Best Practices - Number of CDC Subscriptions Required

Number of CDC Subscriptions Required

A Subscription is a logical container that describes the replication configuration for tables from a source to a target datastore.  Once the subscription is created, you create table mappings within the subscription for the group of tables you wish to replicate

An important part of planning an InfoSphere CDC implementation is to choose the appropriate number of subscriptions to meet your requirements
 

More information can be found in the CDC performance documents :

IBM Data Replication Community Wiki - Performance

Rule of Thumb:
  • Starting with the minimum number of subscriptions and only increasing due to valid reasons, is the optimal approach
    • This will ensure efficient use of resources as well as require a lower level of maintenance

It may require an iterative process before you have a good balance

  • The number of subscriptions will impact the resource utilization of the server (more CPU and RAM are needed) and performance of InfoSphere CDC
  • Note that tables with referential integrity or ones where the data must be synchronized at all times must reside in the same subscription since different subscriptions may be at different points in the log
  • The following are valid reasons to increase the number of subscriptions:
    • Requirement to replicate one source table to multiple targets
    • You need to increase the number of applies once it has been determined that it is the apply that is affecting the performance and you want further parallelism
    • Management of replication for groups of tables, in cases where some tables only require mirroring with a scheduled end time, while others require continuous or they are active at different times of the day
    • You have too many tables in a single subscription which is affecting start-up performance
    • You have multiple independent business applications that you need to mirror, but want to be able to deal with maintenance independently

Best Practice - Number of Subscriptions per CDC Instance

Number of Subscriptions per CDC Instance

For best resource utilization, and easiest management, you want to keep the number of CDC Instances and Subscriptions to the minimum. 

Rule of Thumb:
  • InfoSphere CDC LUW can generally accommodate up to 50 subscriptions per instance (either source or target)
  • InfoSphere CDC z can generally accommodate up to 20 combined source and target subscriptions per instance and a hard maximum of 50 subscriptions per instance
    • Note: For CDC z if you have three or more source subscriptions in an instance, for optimal resource utilization, you need to ensure that the log cache is configured
  • InfoSphere CDC i can generally accommodate up to 25 source subscriptions per instance, and 25 subscriptions in a target instance
    • Note that InfoSphere CDC i does not have the single scrape feature, so each additional subscription will require proportionally extra CPU resource if reading from a single journal.  Thus, if you have multiple subscriptions you will achieve better efficiency if separate journals can be used for each subscription

Best Practice - Setting Up Notifications

Setting Up Notifications (Sometimes referred to as Alerts and Alarms)

There are various means of checking and understanding replication status, performance, etc.  One important aspect is to be able to be notified in the event of a replication issue be it an error, or latency.  Notifications can be sent for any event message that InfoSphere CDC produces.

  • Appropriate notifications settings will alert InfoSphere CDC administrators of issues with the environment in a timely manner so they can be addressed
  • Notifications can be setup for various categories on the source and target and at the datastore or individual subscription level
  • Messages can also be filtered based on severity: Status, informational, operational, error and fatal
  • Latency notifications can be setup to monitor performance issues at a subscription level.  A message can be sent to the event log when a warning threshold is passed, and another message if an error threshold is passed
  • InfoSphere CDC for z/OS also allows users to select specific messages to be directed to the console – see CONSOLEMSGS keyword

Notification can be directed to platform specific destinations or a custom user exit program

For Linux, UNIX and Windows replication engines:
  • E-mail
  • SMTP
  • Specify e-mail address and password
  • Unix System log
  • Custom Java User Exit Program

z/OS

  • CHCPRINT spool file
  • SYSLOG
  • User Exit


IBM i

  • Message Queue
  • User Exit

Rule of Thumb:
  • The general practice is to have notifications set up for all Fatal and Error messages (events), as well as to have a notification for a latency threshold

Best Practice - Number of Instances Required

Number of Instances Required
  • You can have multiple instances of InfoSphere CDC running on the same server, each would have its own copy of storage, metadata, etc.
  • A separate InfoSphere CDC instance is required for each database that you want to replicate from, except in the case where a single instance is being used as both a source and a target

  • Additional instances may be required for the following reasons:
    • If you hit the maximum number of subscriptions for a single instance
    • If you have extremely large log volume and you need to split the source into multiple instances.  For further information on this situation, please refer to the performance tuning presentation

Best practice - Log Retention Policies

Log retention policies
  • For InfoSphere CDC LUW, utilize dmshowlogdependency command to develop your retention procedures.  This command will tell you when InfoSphere CDC has completed with a log
  • For InfoSphere CDC i, utilize the CHGJRNDM command to manage journal receivers
  • For InfoSphere CDC z, there is no command available.  Generally not a requirement as most z shops keep logs around for 10 days.  If required, you can utilize the earliest open position indicated in the event log when InfoSphere CDC z starts replication
  • You need to consider and accommodate for cases when replication will be down for a period of time
Rule of Thumb:
  • Successful implementations typically have 5+ days of logs retained
  • If you do not have sufficient log retention, you need to be prepared to do table refreshes if something unexpected happens in your environment

References:

search results for dmshowlogdepencency

http://www.ibm.com/support/knowledgecenter/search/dmshowlogdependency?scope=SSTRGZ_11.4.0

Oracle dmshowlogdepencency

http://www.ibm.com/support/knowledgecenter/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdcfororacle.doc/refs/dmshowlogdependency.html

SQL Server dmshowlogdepencency

http://www.ibm.com/support/knowledgecenter/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdcformssql.doc/concepts/understandinghowcdcinteractswithyourdatabase.html

Db2 dmshowlogdepencency

http://www.ibm.com/support/knowledgecenter/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdcfordb2luw.doc/refs/dmshowlogdependency.html


Related Information

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTRGZ","label":"InfoSphere Data Replication"},"ARM Category":[{"code":"a8m50000000L3EXAA0","label":"How to \/ Docs->Best Practices"}],"ARM Case Number":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Product Synonym

CDC;CDD;IIDR;IDR;IBM InfoSphere Data Replication;IBM Data Replication

Document Information

Modified date:
27 July 2020

UID

ibm11102497