This post is the answer to one of the FAQs found in License Tips for IBM Data Replication. You may have seen a recent announcement on ibm.com that says IBM would no longer be marketing it's older data replication products in 2013. That includes InfoSphere CDC. Why?
And what happens to the CDC technology? Over the years, IBM provided its data replication technologies
through a lot of different products. For example, IBM used to offer two major
data replication products at the same time -
InfoSphere CDC and InfoSphere Replication Server. That was a little
confusing,
even to some IBM people. To simplify the situation, IBM consolidated all it's replication technologies into a single product called IBM InfoSphere Data Replication
(IIDR). Once IIDR was available, the older products no longer needed to be sold to new customers. That's why the end of marketing was announced. However, the replication technologies - CDC, Q Replication, and SQL Replication - are still alive and well. You can
continue to use them as you always
have. Of course, you may have two related questions:
- Are the older products still being supported?
- How do you move from your old InfoSphere CDC product to IIDR?
If you have any questions, feel free to post them in the comments section of this blog.
|
There are multiple deployment models available for InfoSphere CDC. The deployment model chosen for the source system will significantly affect the complexity of implementation.
Here are the CDC source deployment options from the least complex to the most complex:
1. InfoSphere CDC scraper runs on the source database server
2. InfoSphere CDC scraper runs on a remote tier reading logs from a shared disk (SAN)
-
This configuration is available for Oracle and Sybase. DB2 LUW has a similar capability, but utilizes a remote client instead of reading from a SAN.
3. InfoSphere CDC scraper runs on a remote tier using log shipping
-
This configuration is only available for Oracle.
Rule of Thumb
You should always use the least complex deployment option that will meet the business needs. The vast majority of CDC users install InfoSphere CDC on the source database server.
|
Number of Tables in a Subscription
Rule of Thumb
- This is certainly not a hard limit, but in general it is best to keep the number of tables in a subscription under 1000
Considerations for the number of tables include:
- With too many tables (over 1000) in a subscription, loading and managing the tables in the Management Console GUI will be slow
- This may not be a consideration if you are controlling your replication via scripting/automation
- If the number of tables exceed 1000 then promotion in the management console will take a significant amount of time, and additional memory would need to be allocated
- From an engine perspective:
- With CDC LUW if you want to go beyond 1000 tables you need to increase the memory allocated to the InfoSphere CDC Instance
- If the target is flatfile or HDFS, then an upper limit on the number of tables in the subscription is 800. Additionally, you would need to allocate some additional memory if you have more than a couple hundred tables.
- CDC i can accommodate well over 2000 tables in a subscription
- CDC z can accommodate well over 1000 tables in a subscription
- Note, the number can be significantly higher, but there are implications to the number of subscriptions you have due to limits on below the bar memory
Modified on by GlenSakuth 270003E7N6
|
Log retention policies
- For InfoSphere CDC LUW, utilize dmshowlogdependency command to develop your retention procedures. This command will tell you when InfoSphere CDC has completed with a log
- For InfoSphere CDC i, utilize the CHGJRNDM command to manage journal receivers
- For InfoSphere CDC z, there is no command available. Generally not a requirement as most z shops keep logs around for 10 days. If required, you can utilize the earliest open position indicated in the event log when InfoSphere CDC z starts replication
- You need to consider and accommodate for cases when replication will be down for a period of time
Rule of Thumb:
- Successful implementations typically have 5+ days of logs retained
- If you do not have sufficient log retention, you need to be prepared to do table refreshes if something unexpected happens in your environment
References:
search results for dmshowlogdepencency
http://www.ibm.com/support/knowledgecenter/search/dmshowlogdependency?scope=SSTRGZ_11.3.3
Oracle dmshowlogdepencency
http://www.ibm.com/support/knowledgecenter/SSTRGZ_11.3.3/com.ibm.cdcdoc.cdcfororacle.doc/refs/dmshowlogdependency.html
SQL Server dmshowlogdepencency
http://www.ibm.com/support/knowledgecenter/SSTRGZ_11.3.3/com.ibm.cdcdoc.cdcformssql.doc/concepts/understandinghowcdcinteractswithyourdatabase.html
DB2 (LUW) dmshowlogdepencency
http://www.ibm.com/support/knowledgecenter/SSTRGZ_11.3.3/com.ibm.cdcdoc.cdcfordb2luw.doc/refs/dmshowlogdependency.html
Modified on by Glenn Steffler 270001GKBD
|
The IBM Redbook titled "Smarter Business: Dynamic Information with IBM InfoSphere Data Replication CDC" is now available and can be found: http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg247941.html?Open This Redbook covers a wide range of topics from InfoSphere CDC use cases, solution topologies, features and functionality, performance, environmental considerations and automation. This is a great source of information if you are wondering how best to set up InfoSphere CDC, how do you fit it into a resilient environment, etc.
|
|
Blog Post by : Davendra Paltoo, Offering Manager, Data Replication
Follow him on Twitter : https://twitter.com/Davendr18397388
One thing that is common across organizations today is that each one wants to be customer-centric and in order to be so, they need to be insights-driven. Insights-driven firms are growing at an average of more than 30% annually and are on track to earn $1.8 trillion by 2021 predicts Forrester.
These Insights-driven organizations are built on data. In order to have 360-degree view of a customer, organizations need data which is spread across disparate databases and data warehouses. Some data may reside on premise in a IBM DB2 database, or Teradata database or in the cloud in a Microsoft Azure SQL based on business needs.
Irrespective of where your data resides, you need a data replication solution for real time replication requirements in support of your data integration and analytics projects.
To help meet the needs of organizations who make use of Teradata and MS Azure SQL databases, IBM Data Replication Change Data Capture technology now supports targeting Teradata in the latest 11.4 product line. Previously, the CDC apply for Teradata was only available in earlier supported versions.
In addition, CDC is now validated to support targeting Azure SQL databases, which closely resemble Microsoft SQL Server databases, via use of the CDC for Microsoft SQL Server data replication target/apply.
CDC Replication supports Azure SQL Database as a remote target only. The CDC Replication target can either be installed on premises or in an Azure VM. For optimum performance, the CDC Replication target should be installed on a VM in the same region as the Azure SQL Database.
For more information, on our Azure SQL database targeting capability please see our knowledge centre:
https://www.ibm.com/support/knowledgecenter/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdcformssql.doc/concepts/configuringazuresql.htm
For more information on our Teradata apply in 11.4 visit our knowledge centre:
https://www.ibm.com/support/knowledgecenter/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdcforteradata.doc/concepts/whatsnew.html
Modified on by Deepthi N 270005NCRR
|
Blog Post by : Davendra Paltoo, Offering Manager, Data Replication
Follow Davendra on Twitter at : https://twitter.com/Davendr18397388
Real time analytics can provide real time insights. When businesses have data at the right time, they can be more efficient and make the right tactical and strategic decisions. Apache Kafka® is used for building Data hubs or landing zones, building real-time data pipelines, and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
The Apache Kafka (https://kafka.apache.org/) platform now has a vibrant associated eco-system.
When messages are landed in a Kafka cluster, there are a variety of available connectors or consumers that can in turn retrieve messages and deliver such messages to target destinations such as HDFS and Amazon S3. Users can save time and costs by using one of these available consumers, such as those listed on https://docs.confluent.io/current/connect/connectors.html or https://community.hortonworks.com/topics/Kafka.html
IIDR (CDC), with the initial version of its 11.4.0 release, provided users the ability to replicate from any supported CDC Replication source to a Kafka cluster by using the IIDR (CDC) target Replication Engine for Kafka. This engine writes Kafka messages that contain the replicated data to Kafka topics. The replicated data in the Kafka messages is by default written in the Avro binary format. Consumers that want to read these messages from Kafka clusters needed to utilize an Avro binary deserializer.
Users on the other hand, desired more flexibility in the IIDR (CDC) Kafka apply so that it could help users produce the various permutations of data formats of messages written to Kafka expected by the wide variety of off the shelf, custom connectors and consumer applications.
To help customers solve this challenge, IIDR (CDC) has recently introduced support for "Kafka custom operation processors" (KCOP) to improve the flexibility of message delivery into Kafka. Customers can make use of a number of integrated predefined output formats, or adapt these user exits to define their own custom formats, what data is included in the message payload, and more. Apart from giving users the flexibility of defining the message formats and payloads, users are now also able to specify the Kafka topic names for their message destinations, and much more.
Moreover, as more common customer needs are assessed IBM will add more such predefined output formats.
For more information on the KCOP and samples available, please see the IIDR (CDC) knowledge center:
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/concepts/kafkakcop.html
For demo videos on how to make use of the sample KCOPs or to contribute to the IBM Data replication community, please see the replication developer works page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/Replication%20How%20To%20Videos
For more information on how IBM Data Replication can provide near real-time incremental delivery of transactional data to your Hadoop and Kafka based Data Lakes , visit : https://www.ibm.com/analytics/data-replication
Modified on by Deepthi N 270005NCRR
|
Blog Author: Davendra Paltoo, Offering Manager,Data Replication
With growing volumes, variety, and velocity of data, the challenge of protecting data continues. Every organization today is striving to protect its customer data and other data as the cost of data breaches are high. The 2017 Ponemon Cost of Data Breach Study, reports that the global average cost of a data breach is $3.62 million. The average cost for each lost or stolen record containing sensitive and confidential information decreased from $158 in 2016 to $141 in 2017. Despite the decline in the individual cost per record, companies report having larger breaches in 2017. The average size of the data breaches reported in this research increased 1.8 percent to more than 24,000 records per incident.
Security professionals are shifting their focus from device-specific controls to a data-centric approach that focuses on securing the apps and data and controlling access. Business, security, and privacy leaders understand that industry standard security practices have to be adopted to protect an organization’s data.
One of the reasons for security of data being compromised is when industry standard authentication mechanisms are not applied.
As part of movement to more centralized governance models for ease of administration and better security, organizations commonly want to centrally manage user credentials, security policies and access rights as part of managing access to their applications and data.
As a result, many organizations manage their user credentials, security policies and access rights in a central repository by implementing a Lightweight Directory Access Protocol (LDAP) compliant Directory Service such as IBM’s Tivoli, Microsoft’s Active Directory, and Apache’s Directory Services.
In addition, organizations also prefer business software to leverage these directory services rather than use decentralized, individually managed user credentials, security policies or access rights that could potentially be created for each piece of software deployed.
To help cater to the aforementioned security needs of today’s digital businesses, IBM Data Replication’s Change Data Capture (“CDC”) technology has introduced support for integration with LDAP directory services. Traditionally, the CDC Access Server authenticates users, stores user credentials and data access information, and acts as the centralized communicator between all replication agents and Management Console clients.
Now, starting with the IIDR 11.4.0.0-10291 Management Console and Access Server delivery, users can choose to have an LDAP server manage their CDC user credentials, user authentication, and data store access information to help users conform to LDAP based centralized security architecture in their enterprise.
For more information about the new IIDR (CDC) LDAP enablement and for details on how to configure LDAP with IIDR (CDC) please refer to the below links.
|
There are many deployment models available for InfoSphere Data Replication's CDC technology of which DataStage integration is a popular one. The deployment option selected will significantly affect the complexity, performance, and reliability of the implementation. If possible, the best solution is always to use CDC direct replication (i.e. do not add DataStage to the mix).
CDC integration with DataStage is the right solution for replication when:
- You need to target a database that CDC doesn't directly support and is not appropriate for CDC FlexRep
- Complex transformations are required that could not be handled natively with CDC, such as complex table look-ups
- When integrating with MDM
Cons of replicating from CDC to DataStage to an eventual target database:
- Performance going through DataStage (no matter which integration option is chosen) will be significantly slower than applying via a CDC target directly to the database
- The exception to this rule is when targeting Teradata, if you use DataStage flatfile integration, the throughput will be higher than CDC direct to Teradata
- Adding DataStage into the replication stream introduces additional points of failure
- Having a resilient CDC installation is more complex if DataStage is also involved
- When integrating with DataStage, there are two independent GUIs for configuration, and two places required to monitor the replication stream
- There is significant development effort developing DataStage jobs for each additional table added to replication
- Incorrect DataStage job design can negatively affect transactional integrity and cause data corruption
- The maximum number of tables per CDC subscription is lower if targeting DataStage
- The CDC External Refresh does not work when targeting DataStage. A separate process would have to be put in place to de-dup duplicate records produced during the "in-doubt" period of a refresh (the captured changes that occurred while the source date was being refreshed).
Modified on by GlenSakuth 270003E7N6
|

With a mere 4 weeks until IBM's 2013 Information on Demand, the data replication team thought it might be helpful to have a complete listing of all data replication sessions at IOD. From client presentations and our product roadmap to sneak peeks at new IBM Data Replication functionality, our sessions run the gamut!
Simply take a gander at the sessions below then go to the IOD agenda builder, click on Create Sign In, and then enter your confirmation number and the email address that you used to register for the conference. Create your agenda today!

|
In April 2013, IBM announced Version 10.5 of DB2 for Linux, UNIX, and Windows. The same letter announced that DB2 AESE and DB2 AWSE would provide limited use of IBM InfoSphere Data Replication's (IIDR's) Change Data Capture (CDC) technology at no additional cost. However, the "limited use" statement sometimes leaves people with a question or two. The goal of this post is to answer those questions.
First, what CDC function are you entitled to use in the DB2 Advanced Editions? The license is the always final word, but, in simple terms, you can only use the bundled CDC to build disaster recovery solutions where a primary DB2 instance* has up-to-two backup instances. For example, the following replication topology is allowed by the DB2 Advanced Edition licenses:

Furthermore, the disaster recovery use case limits your entitled use of CDC function in the following ways:
-
You can only use unidirectional (one-way) replication.
-
You can set up replication from the primary DB2 to the backup(s) but you cannot set up replication from the backup(s) to the primary. This fits with the definition of a pure disaster recovery solution since it provides for fail-over but not switchback. If you need CDC for both fail-over and switchback, you need to license the full IIDR product.
-
You cannot transform the data as it's replicated. Again, this fits with the definition of disaster recovery and you can license the full IIDR to be entitled to transformations as you replicate.
The question is - when do you need to buy CDC now? If you want to do anything more than what's described in this post, you'll need to buy IIDR for your DB2 Advanced Editions. The two most common replication configurations that require this are ones where you do either of the following:
-
Replicate between DB2 LUW and either DB2 z/OS or Oracle.
-
Set up an HA or Active-Active solution with IIDR's CDC technology.
If you need to understand more about these examples, we'll have pictures and add a few more examples in a future post that talks about when you need to buy CDC.
Of course, the last question is - can I still build DB2 DR, HA, and Active-Active solutions using the Q Replication built into the DB2 Advanced Editions? Yes, absolutely. The addition of CDC to DB2 does not change this.
----------------
* Multiple DB2 instances can be created from a single DB2 install. Each instance can use the bundled CDC to replicate up to the entitled number of backup instances.
Modified on by DavidT 120000JC6D
|
Shared Scrape (sometimes referred to as Single Scrape)
When multiple subscriptions are running in a single instance, it is usually advantageous to utilize a shared scrape mechanism. If you don't use a shared scrape, and you have 'n' subscriptions, CDC would read the log 'n' times. If you utilize shared scrape, CDC will only read the log once which will utilize fewer system resources.
- On by default for InfoSphere CDC LUW
- You must configure the log cache for InfoSphere CDC z
- Not available on InfoSphere CDC i or CDC Informix
You need to size the shared scrape cache appropriately for optimal performance:
- If the cache is too small the following will occur:
- LUW – A private scraper will be launched which will consume additional resources
- Set staging_store_disk_quota_gb system parameter appropriately to avoid
- Z - With the log cache, each subscription attempts to read its data from the cache – it will read directly from the IFI if the data is no longer available from the cache
- Use the following to configure CACHELEVEL1SIZE, CACHEBLOCKSIZE, CACHELEVEL1RESERVED
Modified on by GlenSakuth 270003E7N6
|
Number of CDC Subscriptions Required
A Subscription is a logical container that describes the replication configuration for tables from a source to a target datastore. Once the subscription is created, you create table mappings within the subscription for the group of tables you wish to replicate
An important part of planning an InfoSphere CDC implementation is to choose the appropriate number of subscriptions to meet your requirements
More information can be found in the CDC performance documents :
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/IIDR%20Wiki?section=CDC%20Performance
For a comprehensive list of best practices, please see the parent community main page:
https://www.ibm.com/developerworks/community/groups/service/html/communityoverview?communityUuid=a9b542e4-7c66-4cf3-8f7b-8a37a4fdef0c
Rule of Thumb:
- Starting with the minimum number of subscriptions and only increasing due to valid reasons, is the optimal approach
- This will ensure efficient use of resources as well as require a lower level of maintenance
It may require an iterative process before you have a good balance
- The number of subscriptions will impact the resource utilization of the server (more CPU and RAM are needed) and performance of InfoSphere CDC
- Note that tables with referential integrity or ones where the data must be synchronized at all times must reside in the same subscription since different subscriptions may be at different points in the log
- The following are valid reasons to increase the number of subscriptions:
- Requirement to replicate one source table to multiple targets
- You need to increase the number of applies once it has been determined that it is the apply that is affecting the performance and you want further parallelism
- Management of replication for groups of tables, in cases where some tables only require mirroring with a scheduled end time, while others require continuous or they are active at different times of the day
- You have too many tables in a single subscription which is affecting start-up performance
- You have multiple independent business applications that you need to mirror, but want to be able to deal with maintenance independently
Modified on by Glenn Steffler 270001GKBD
|
Number of Subscriptions per CDC Instance
For best resource utilization, and easiest management, you want to keep the number of CDC Instances and Subscriptions to the minimum.
Rule of Thumb:
-
InfoSphere CDC LUW can generally accommodate up to 50 subscriptions per instance (either source or target)
-
InfoSphere CDC z can generally accommodate up to 20 combined source and target subscriptions per instance and a hard maximum of 50 subscriptions per instance
-
Note: For CDC z if you have three or more source subscriptions in an instance, for optimal resource utilization, you need to ensure that the log cache is configured
-
InfoSphere CDC i can generally accommodate up to 25 source subscriptions per instance, and 25 subscriptions in a target instance
-
Note that InfoSphere CDC i does not have the single scrape feature, so each additional subscription will require proportionally extra CPU resource if reading from a single journal. Thus, if you have multiple subscriptions you will achieve better efficiency if separate journals can be used for each subscription
Modified on by GlenSakuth 270003E7N6
|