Over the next month or so, we'll be updating the look of the CDC Forum and adding content. For example, you may have noticed we added a data replication icon and a wiki widget. More is in store. If you have requests, feel free to post a comment to this blog post. No guarantees, but we'll do what we can :)

IBM Data Replication's CDC News

A New Look for the CDC ForumHi, everyone, Over the next month or so, we'll be updating the look of the CDC Forum and adding content. For example, you may have noticed we added a data replication icon and a wiki widget. More is in store. If you have requests, feel free to post a comment to this blog post. No guarantees, but we'll do what we can :) --
|
A New IBM Redbook About CDC for DB2 z/OS
For those of you who don't subscribe to the IBM Redbook newsletter, you may have missed last week's announcement of a new Redbook titled Implementing IBM InfoSphere Change Data Capture for DB2 z/OS V6.5. It is a exellent extension of the brief CDC z/OS sections found in last year's Redbook titled Co-locating Transactional and Data Warehouse Workloads on System z. This new Redbook provides information about installing, configuring, running, and tuning CDC for DB2 z/OS. Read it if you need it and don't forget to rate it (no one will cry if you give it 5 stars :)
--
|
CDC and IMS Replication CompatibilityIn 2011, IBM released three new data replication products:
One question that comes up is whether the two IMS replication products are compatible with either the new Data Replication product or the existing InfoSphere CDC products. The answer is yes - the IMS products are compatible with both new and existing products that contain the CDC technology. IMore specifically, they can provide IMS changed data to any data replication solution that you can build with IBM's CDC technology. For example, you can create unidirectional (one-way) subscriptions that feed IMS changed data to any database that can be targeted by CDC: Two notes about this picture:
You could also feed IMS changed data into other business software such as ETL, IBM's DataStage, and ESBs: In other words, the new IMS data replication products extend the reach of IBM's CDC technology by adding IMS as a source for log-based capture of changed data. If you have technical questions, see the Classic CDC section of the Information Center.
|
IBM's Table Compare Utility and CDCNow that IBM has packaged its major data replication technologies into a single product, InfoSphere Data Replication, a lot of people are asking what they can take advantage of that they couldn't with the older products (InfoSphere CDC and InfoSphere Replication Server). Other than the obvious point of having access to multiple technologies, you can now use IBM's table compare utility, asntdiff, with CDC. asntdiff is a general-purpose utility that compares the data from two queries. IBM provides it through several product - Replication Server, the IBM Data Server Client, and all editions of DB2 and InfoSphere Warehouse.* Long-time CDC users may ask what's happening to CDC's differential refresh and why they would want to use asntdiff instead of differential refresh. First understand that differential refresh is alive and well and it's not going anywhere :) asntdiff is just an option available to you. To understand when you might want to use asntdiff, understand the basics of how it works.
So, the first reason to consider asntdiff is times when differential refresh's restrictions could be overcome by writing queries to get the result sets you need. For example, asntdiff may be an alternative if one of the following differential refresh restrictions applies to your replication configuration:
Next, asntdiff is independent of data replication and can be started from a command line. Among other things, this means:
One major point to be aware of with asntdiff is how it works with heterogeneous data. For example, when you want to compare data being replicated from Oracle to DB2. asntdiff was originally written for DB2 databases. As a result, it requires IBM data federation technology to query databases such as Oracle. The good news is that InfoSphere Data Replication provides data federation for use with data replication configurations. If you're not familiar with asntdiff and want to give it a try, see the ChannelDB2.com blog post titled Compare the Rows of Two Tables. If you have questions, feel free to post them in the CDC message board here on developerWorks. --
* Yes, technically, you could already use asntdiff with CDC on UNIX or Window since it comes in so many IBM products on UNIX and Windows. However, if you wanted to use it on z/OS, you could only get it through Replication Server. It's now in InfoSphere Data Replication as well. |
A new IBM Redbook about InfoSphere CDC is availableThe IBM Redbook titled "Smarter Business: Dynamic Information with IBM InfoSphere Data Replication CDC" is now available and can be found: http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg247941.html?Open This Redbook covers a wide range of topics from InfoSphere CDC use cases, solution topologies, features and functionality, performance, environmental considerations and automation. This is a great source of information if you are wondering how best to set up InfoSphere CDC, how do you fit it into a resilient environment, etc.
|
What Happened to InfoSphere CDC?
This post is the answer to one of the FAQs found in
|
InfoSphere Change Data Capture (CDC) Best PracticesI have had many requests to share best practices when using IBM InfoSphere Change Data Capture (from this point forward in the blog referred to as CDC). I will try to add new tips and techniques on a regular basis. Along with many of the best practices posts, I will include items denoted by "Rule of Thumb". These are general guidelines that will help in your planning. I will endeavor to provide reasons or context for the guidance. The Rules of Thumb should not be treated as hard limits, rather as useful guidance. If your needs fall significantly outside the guidance, it certainly does not mean that it can not be done. Rather, it would be best to engage with an InfoSphere CDC subject matter expert, and potentially you may want to consider IBM Services for assistance.
For a comprehensive list of best practices, please see the parent community main page:
Planning Deployments
Deploying
Steady State Operations
|
Best practice - Log Retention PoliciesLog retention policies
Rule of Thumb:
References:
search results for dmshowlogdepencency http://www.ibm.com/support/knowledgecenter/search/dmshowlogdependency?scope=SSTRGZ_11.3.3
Oracle dmshowlogdepencency
SQL Server dmshowlogdepencency
DB2 (LUW) dmshowlogdepencency |
Best Practice - Number of Instances RequiredNumber of Instances Required
|
Best Practice - Setting Up NotificationsSetting Up Notifications (Sometimes referred to as Alerts and Alarms)There are various means of checking and understanding replication status, performance, etc. One important aspect is to be able to be notified in the event of a replication issue be it an error, or latency. Notifications can be sent for any event message that InfoSphere CDC produces.
Notification can be directed to platform specific destinations or a custom user exit program z/OS
Rule of Thumb:
|
Best Practice - Number of Subscriptions per CDC InstanceNumber of Subscriptions per CDC InstanceFor best resource utilization, and easiest management, you want to keep the number of CDC Instances and Subscriptions to the minimum.
Rule of Thumb:
|
Best Practices - Number of CDC Subscriptions RequiredNumber of CDC Subscriptions RequiredA Subscription is a logical container that describes the replication configuration for tables from a source to a target datastore. Once the subscription is created, you create table mappings within the subscription for the group of tables you wish to replicate More information can be found in the CDC performance documents : For a comprehensive list of best practices, please see the parent community main page:
Rule of Thumb:
It may require an iterative process before you have a good balance
|
Best Practice - Number of Tables in a SubscriptionNumber of Tables in a SubscriptionRule of Thumb
Considerations for the number of tables include:
|
Best Practice - Shared ScrapeShared Scrape (sometimes referred to as Single Scrape)When multiple subscriptions are running in a single instance, it is usually advantageous to utilize a shared scrape mechanism. If you don't use a shared scrape, and you have 'n' subscriptions, CDC would read the log 'n' times. If you utilize shared scrape, CDC will only read the log once which will utilize fewer system resources.
You need to size the shared scrape cache appropriately for optimal performance:
|
Is CDC Free with DB2 LUW?
This post is the answer to one of the FAQs found in Licensing Tips for IBM Data Replication.
In April 2013, IBM announced Version 10.5 of DB2 for Linux, UNIX, and Windows. The same letter announced that DB2 AESE and DB2 AWSE would provide limited use of IBM InfoSphere Data Replication's (IIDR's) Change Data Capture (CDC) technology at no additional cost. However, the "limited use" statement sometimes leaves people with a question or two. The goal of this post is to answer those questions. First, what CDC function are you entitled to use in the DB2 Advanced Editions? The license is the always final word, but, in simple terms, you can only use the bundled CDC to build disaster recovery solutions where a primary DB2 instance* has up-to-two backup instances. For example, the following replication topology is allowed by the DB2 Advanced Edition licenses:
Furthermore, the disaster recovery use case limits your entitled use of CDC function in the following ways:
The question is - when do you need to buy CDC now? If you want to do anything more than what's described in this post, you'll need to buy IIDR for your DB2 Advanced Editions. The two most common replication configurations that require this are ones where you do either of the following:
If you need to understand more about these examples, we'll have pictures and add a few more examples in a future post that talks about when you need to buy CDC.
Of course, the last question is - can I still build DB2 DR, HA, and Active-Active solutions using the Q Replication built into the DB2 Advanced Editions? Yes, absolutely. The addition of CDC to DB2 does not change this.
---------------- * Multiple DB2 instances can be created from a single DB2 install. Each instance can use the bundled CDC to replicate up to the entitled number of backup instances.
|
IBM Data Replication at IOD 2013 - The Essential List
With a mere 4 weeks until IBM's 2013 Information on Demand, the data replication team thought it might be helpful to have a complete listing of all data replication sessions at IOD. From client presentations and our product roadmap to sneak peeks at new IBM Data Replication functionality, our sessions run the gamut! Simply take a gander at the sessions below then go to the IOD agenda builder, click on Create Sign In, and then enter your confirmation number and the email address that you used to register for the conference. Create your agenda today!
Tuesday, Nov 5
|
Best Practice - Deployment Configurations for LUWThere are multiple deployment models available for InfoSphere CDC. The deployment model chosen for the source system will significantly affect the complexity of implementation. Here are the CDC source deployment options from the least complex to the most complex: 1. InfoSphere CDC scraper runs on the source database server 2. InfoSphere CDC scraper runs on a remote tier reading logs from a shared disk (SAN)
3. InfoSphere CDC scraper runs on a remote tier using log shipping
Rule of ThumbYou should always use the least complex deployment option that will meet the business needs. The vast majority of CDC users install InfoSphere CDC on the source database server. |
Best Practice - CDC / CDD to DataStage IntegrationThere are many deployment models available for InfoSphere Data Replication's CDC technology of which DataStage integration is a popular one. The deployment option selected will significantly affect the complexity, performance, and reliability of the implementation. If possible, the best solution is always to use CDC direct replication (i.e. do not add DataStage to the mix).
CDC integration with DataStage is the right solution for replication when:
Cons of replicating from CDC to DataStage to an eventual target database:
Link to Wiki containing best practices for integration with DataStage |
Bring industry standard authentication mechanisms to your environment and protect your dataBlog Author: Davendra Paltoo, Offering Manager,Data Replication With growing volumes, variety, and velocity of data, the challenge of protecting data continues. Every organization today is striving to protect its customer data and other data as the cost of data breaches are high. The 2017 Ponemon Cost of Data Breach Study, reports that the global average cost of a data breach is $3.62 million. The average cost for each lost or stolen record containing sensitive and confidential information decreased from $158 in 2016 to $141 in 2017. Despite the decline in the individual cost per record, companies report having larger breaches in 2017. The average size of the data breaches reported in this research increased 1.8 percent to more than 24,000 records per incident.
Security professionals are shifting their focus from device-specific controls to a data-centric approach that focuses on securing the apps and data and controlling access. Business, security, and privacy leaders understand that industry standard security practices have to be adopted to protect an organization’s data. One of the reasons for security of data being compromised is when industry standard authentication mechanisms are not applied.
As part of movement to more centralized governance models for ease of administration and better security, organizations commonly want to centrally manage user credentials, security policies and access rights as part of managing access to their applications and data. As a result, many organizations manage their user credentials, security policies and access rights in a central repository by implementing a Lightweight Directory Access Protocol (LDAP) compliant Directory Service such as IBM’s Tivoli, Microsoft’s Active Directory, and Apache’s Directory Services. In addition, organizations also prefer business software to leverage these directory services rather than use decentralized, individually managed user credentials, security policies or access rights that could potentially be created for each piece of software deployed.
To help cater to the aforementioned security needs of today’s digital businesses, IBM Data Replication’s Change Data Capture (“CDC”) technology has introduced support for integration with LDAP directory services. Traditionally, the CDC Access Server authenticates users, stores user credentials and data access information, and acts as the centralized communicator between all replication agents and Management Console clients. Now, starting with the IIDR 11.4.0.0-10291 Management Console and Access Server delivery, users can choose to have an LDAP server manage their CDC user credentials, user authentication, and data store access information to help users conform to LDAP based centralized security architecture in their enterprise.
For more information about the new IIDR (CDC) LDAP enablement and for details on how to configure LDAP with IIDR (CDC) please refer to the below links.
|
Improve the flexibility of message delivery into Kafka with Kafka Custom Operation ProcessorsBlog Post by : Davendra Paltoo, Offering Manager, Data Replication Follow Davendra on Twitter at : https://twitter.com/Davendr18397388 Real time analytics can provide real time insights. When businesses have data at the right time, they can be more efficient and make the right tactical and strategic decisions. Apache Kafka® is used for building Data hubs or landing zones, building real-time data pipelines, and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies. The Apache Kafka (https://kafka.apache.org/) platform now has a vibrant associated eco-system. When messages are landed in a Kafka cluster, there are a variety of available connectors or consumers that can in turn retrieve messages and deliver such messages to target destinations such as HDFS and Amazon S3. Users can save time and costs by using one of these available consumers, such as those listed on https://docs.confluent.io/current/connect/connectors.html or https://community.hortonworks.com/topics/Kafka.html IIDR (CDC), with the initial version of its 11.4.0 release, provided users the ability to replicate from any supported CDC Replication source to a Kafka cluster by using the IIDR (CDC) target Replication Engine for Kafka. This engine writes Kafka messages that contain the replicated data to Kafka topics. The replicated data in the Kafka messages is by default written in the Avro binary format. Consumers that want to read these messages from Kafka clusters needed to utilize an Avro binary deserializer. Users on the other hand, desired more flexibility in the IIDR (CDC) Kafka apply so that it could help users produce the various permutations of data formats of messages written to Kafka expected by the wide variety of off the shelf, custom connectors and consumer applications. To help customers solve this challenge, IIDR (CDC) has recently introduced support for "Kafka custom operation processors" (KCOP) to improve the flexibility of message delivery into Kafka. Customers can make use of a number of integrated predefined output formats, or adapt these user exits to define their own custom formats, what data is included in the message payload, and more. Apart from giving users the flexibility of defining the message formats and payloads, users are now also able to specify the Kafka topic names for their message destinations, and much more. Moreover, as more common customer needs are assessed IBM will add more such predefined output formats. For more information on the KCOP and samples available, please see the IIDR (CDC) knowledge center: For demo videos on how to make use of the sample KCOPs or to contribute to the IBM Data replication community, please see the replication developer works page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/Replication%20How%20To%20Videos For more information on how IBM Data Replication can provide near real-time incremental delivery of transactional data to your Hadoop and Kafka based Data Lakes , visit : https://www.ibm.com/analytics/data-replication |
Make your organization insight-driven with the right replication solutionBlog Post by : Davendra Paltoo, Offering Manager, Data Replication Follow him on Twitter : https://twitter.com/Davendr18397388
One thing that is common across organizations today is that each one wants to be customer-centric and in order to be so, they need to be insights-driven. Insights-driven firms are growing at an average of more than 30% annually and are on track to earn $1.8 trillion by 2021 predicts Forrester. These Insights-driven organizations are built on data. In order to have 360-degree view of a customer, organizations need data which is spread across disparate databases and data warehouses. Some data may reside on premise in a IBM DB2 database, or Teradata database or in the cloud in a Microsoft Azure SQL based on business needs.
Irrespective of where your data resides, you need a data replication solution for real time replication requirements in support of your data integration and analytics projects. To help meet the needs of organizations who make use of Teradata and MS Azure SQL databases, IBM Data Replication Change Data Capture technology now supports targeting Teradata in the latest 11.4 product line. Previously, the CDC apply for Teradata was only available in earlier supported versions. In addition, CDC is now validated to support targeting Azure SQL databases, which closely resemble Microsoft SQL Server databases, via use of the CDC for Microsoft SQL Server data replication target/apply.
CDC Replication supports Azure SQL Database as a remote target only. The CDC Replication target can either be installed on premises or in an Azure VM. For optimum performance, the CDC Replication target should be installed on a VM in the same region as the Azure SQL Database. For more information, on our Azure SQL database targeting capability please see our knowledge centre: For more information on our Teradata apply in 11.4 visit our knowledge centre:
|
How to retrieve data with transactional semantics from Kafka using the IBM Data Replication CDC Target Engine?
Blog post by: Davendra Paltoo, Offering Manager, Data Replication Follow him on twitter: https://twitter.com/Davendr18397388
In addition to Apache Kafka’s more widely known capabilities as a distributed streaming platform, its capabilities, scalability and low cost as a storage system make it suitable as a central point in the enterprise architecture for data to be landed and then consumed by various applications.
However, developers of consumers struggle to find an easy way to:
This is a concern to many Kafka users because in some critical scenarios, it is extremely valuable to users to have Kafka behave with database-like transactional semantics. For example:
Why settle for duplicate data and promises of eventual consistency when you can leverage the performance and low cost of Kafka AND have database-like transactional semantics while not compromising on performance while delivering changes into Kafka?
IBM Data Replication’s “CDC” technology, with the initial version of its 11.4.0 release, provided users the ability to replicate from any supported CDC Replication source to a Kafka cluster by using the CDC target Replication Engine for Kafka.
Users are free to adapt the samples to suit their needs or to write their own consumer applications.
|
Deliver real time feeds of operational data in to IBM Integrated Analytics System (IIAS) and Db2 Warehouse with IBM Data ReplicationBlog Post by : Davendra Paltoo, Offering Manager, IBM Data Replication In November 2017, IDC’s Data Integration and Integrity (DII) Software Research Group conducted a survey targeted at end-users of DII software, including similar questions as asked in 2015 to help identify trends. With the need to bring real time and most recent data in to the enterprise for analytics efforts, IDC survey respondents found that keeping data synchronized among applications is and will be the most prevalent use case for data integration with 51% of survey respondents indicating application data sync to be the top use case. IDC is also observing that data intelligence will grow from one of the least today, to one of the most prevalent by 2020, in support of data governance, profiling, discovery and knowledge1. As organizations continue to face the challenge of bringing real time data for analytics applications/purposes, IBM provides the IBM® Integrated Analytics System (IIAS) which consists of a high-performance hardware platform and optimized database query engine software that work together to support various data analysis and business reporting features for today’s big data needs. IBM also provides Db2 Warehouse for Data Warehousing to provide users with in database analytics capabilities. Users often need to integrate data from various data sources into their IIAS appliance or Db2 Warehouse deployment, if such technologies exist in their enterprise. In addition, for an increasing variety of analytics use cases, only the freshest data is sufficient. Whether it is the customer interacting with a self-service portal or an executive looking for up to the minute financial performance, no organization can afford to serve up stale data. Yet, this can happen if organizations depend on periodic bulk movement of data around the enterprise. IBM Data Replication provides up to the second replicas of changing data where and when needed. Our users are replicating operational data to everything from a traditional data warehouses to a data appliance such as the Pure Data Analytics (PDA) appliance or IIAS, to a Big Data cluster driven by Apache Kafka and Hadoop or even to a Cloud based OLAP environment such as Db2 Warehouse. IBM data replication (Change Data Capture technology) can deliver changes using log based captures that minimize the impact on source databases from ALL supported CDC sources into IIAS and Db2 warehouse directly (i.e. in one hop). In the recent release, IBM has introduced a new Mirror Bulk Apply option that supports Db2 External Tables as the apply mechanism for faster ingest into column organized tables within Db2 Warehouse deployed in the IIAS appliance or column organized tables in “standalone” Db2 Warehouse databases. This is as compared to the previously available apply mechanisms for applying changes to such column organized tables. External table bulk apply is the algorithm that CDC employs to apply changes to the IBM Pure Data Appliance or “Netezza”. This support is now being extended to Db2 Warehouse. Such column organized tables are useful in databases intended for use in analytics since they aid query performance. The new CDC apply performance capability will give end users the confidence that even the data from the most high volume transactional systems can be replicated with acceptable latency into IIAS and Db2 Warehouse’s column organized tables. For more information on the Mirror Bulk apply capability please see our knowledge center. For more information on IBM Data replication, please read the IBM Data Replication solution brief.
|
Open your doors to Open Source Database Management Systems with IBM Data Replication support for log based captures from PostgreSQLBlog By: Davendra Paltoo, Offering Manager, IBM Data Replication Every business today revolves around data. Conversations with our customers frequently confirm the following:
Organizations are looking cost effective solutions to the above challenges. IBM Data Replication “IDR” can help by providing up to the second replicas of changing data where and when needed keeping data synchronized with low latency. Our users are replicating operational data from most of the world’s popular relational databases like Oracle and Db2 z/OS to everything from a traditional data warehouses to a data appliance such as Pure Data Analytics or IBM Integrated Analytics System, to a Big Data cluster driven by Apache Kafka and Hadoop, or even to a Cloud based On Line Analytical Processing (OLAP) environment such Db2 Warehouse on Cloud. Understanding the adoption patterns around OSDBMS, for some time, IBM Data Replication has provided users the ability to feed data INTO PostgreSQL while sourcing from a wide variety of source DBMS by employing low impact database log based captures. With Db Engines currently indicating that PostgreSQL is currently the fourth most popular database in the world, leveraging PostgreSQL around the enterprise just got easier. IBM Data Replication now supports PostgreSQL as a SOURCE3 in a recent deliverable. The IDR CDC PostgreSQL capture interoperates, out of the box, with the extensive array of target platforms supported by CDC. This includes most major DBMSs, Kafka, Hadoop, files, messaging systems and more. PostgreSQL as a source released in an IDR 11.4 Fixpack with support for both PostgreSQL Enterprise and community editions provided they meet the published system requirements . Note that users with sufficient license entitlement of either IBM Data Replication or IBM InfoSphere Data Replication can deploy the new PostgreSQL capture with no additional purchase. If in doubt, please check with your IBM account representative. Click here for a table of contents linking to more details about IBM’s CDC technology for capturing PostgreSQL database changes and replicating them across the data center or around the world. 1 https://opensource.com/life/15/12/why-open-source 2 https://www.exist.com/blog/the-future-is-open-edb-postgres-and-your-enterprise-data/)
|