Modified on by GlenSakuth
Shared Scrape (sometimes referred to as Single Scrape)
When multiple subscriptions are running in a single instance, it is usually advantageous to utilize a shared scrape mechanism. If you don't use a shared scrape, and you have 'n' subscriptions, CDC would read the log 'n' times. If you utilize shared scrape, CDC will only read the log once which will utilize fewer system resources.
- On by default for InfoSphere CDC LUW
- You must configure the log cache for InfoSphere CDC z
- Not available on InfoSphere CDC i or CDC Informix
You need to size the shared scrape cache appropriately for optimal performance:
- If the cache is too small the following will occur:
- LUW – A private scraper will be launched which will consume additional resources
- Set staging_store_disk_quota_gb system parameter appropriately to avoid
- Z - With the log cache, each subscription attempts to read its data from the cache – it will read directly from the IFI if the data is no longer available from the cache
- Use the following to configure CACHELEVEL1SIZE, CACHEBLOCKSIZE, CACHELEVEL1RESERVED
The following items need to be considered and taken into account when you are planning a replication architecture.
§Target table triggers
–Often if the target is a mirror image of the source, you may have triggers on target tables that if fired will have an affect on other tables that InfoSphere CDC is replicating into (CDC would have mirrored the source trigger effect and will get duplicate actions). To alleviate this, you should disable the trigger on the target table.
§Referential integrity constraints with DELETE CASCADE flag on target tables
–Similar to trigger, having cascade deletes set on the target will cause replication to try and delete a record (based on the delete that CDC would have replicated from the source log) that the database may have already deleted (or visa-versa). The following strategy can be deployed to deal with cascaded deletes:
Disable the RI constraints on target prior to starting replication
Please note that re-enabling these constraints may take some time during cut-over if you need to fail over to the target
Strategy: test how long re-enabling the RI constraints takes. If re-enabling all RI constraints takes too long and would impact your RTO (Recovery Time Objective), investigate whether it is possible to leave the RI constraints enabled and just change the CASCADE DELETE flag at cut-over time.
§Using ‘Standard’ replication achieves much higher throughput performance than using ‘Consolidation’ or ‘Summarization’
Standard replication can do optimizations such as arraying, commit grouping, etc that can not be performed when using the other replication methods
Note some optimizations will also be disabled if using Adaptive apply or Conflict Detection & Resolution
§Be aware when you are parking tables/subscriptions
An inactive (not currently replicating) subscription that contains tables with a replication method of Mirror will continue to accumulate change data in the staging store from the current point back to the point where mirroring was stopped. For this reason, you should delete subscriptions or remove tables that are no longer required, or change the replication method of all tables in the subscription to Refresh to prevent the accumulation of change data in the staging store on your source system.
The same is true with a parked (idle) table. You need to insure that the replication method is set to Refresh
Modified on by Glenn Steffler
Number of CDC Subscriptions Required
A Subscription is a logical container that describes the replication configuration for tables from a source to a target datastore. Once the subscription is created, you create table mappings within the subscription for the group of tables you wish to replicate
An important part of planning an InfoSphere CDC implementation is to choose the appropriate number of subscriptions to meet your requirements
More information can be found in the CDC performance documents :
For a comprehensive list of best practices, please see the parent community main page:
Rule of Thumb:
- Starting with the minimum number of subscriptions and only increasing due to valid reasons, is the optimal approach
- This will ensure efficient use of resources as well as require a lower level of maintenance
It may require an iterative process before you have a good balance
- The number of subscriptions will impact the resource utilization of the server (more CPU and RAM are needed) and performance of InfoSphere CDC
- Note that tables with referential integrity or ones where the data must be synchronized at all times must reside in the same subscription since different subscriptions may be at different points in the log
- The following are valid reasons to increase the number of subscriptions:
- Requirement to replicate one source table to multiple targets
- You need to increase the number of applies once it has been determined that it is the apply that is affecting the performance and you want further parallelism
- Management of replication for groups of tables, in cases where some tables only require mirroring with a scheduled end time, while others require continuous or they are active at different times of the day
- You have too many tables in a single subscription which is affecting start-up performance
- You have multiple independent business applications that you need to mirror, but want to be able to deal with maintenance independently
Blog Author: Davendra Paltoo, Offering Manager,Data Replication
With growing volumes, variety, and velocity of data, the challenge of protecting data continues. Every organization today is striving to protect its customer data and other data as the cost of data breaches are high. The 2017 Ponemon Cost of Data Breach Study, reports that the global average cost of a data breach is $3.62 million. The average cost for each lost or stolen record containing sensitive and conﬁdential information decreased from $158 in 2016 to $141 in 2017. Despite the decline in the individual cost per record, companies report having larger breaches in 2017. The average size of the data breaches reported in this research increased 1.8 percent to more than 24,000 records per incident.
Security professionals are shifting their focus from device-specific controls to a data-centric approach that focuses on securing the apps and data and controlling access. Business, security, and privacy leaders understand that industry standard security practices have to be adopted to protect an organization’s data.
One of the reasons for security of data being compromised is when industry standard authentication mechanisms are not applied.
As part of movement to more centralized governance models for ease of administration and better security, organizations commonly want to centrally manage user credentials, security policies and access rights as part of managing access to their applications and data.
As a result, many organizations manage their user credentials, security policies and access rights in a central repository by implementing a Lightweight Directory Access Protocol (LDAP) compliant Directory Service such as IBM’s Tivoli, Microsoft’s Active Directory, and Apache’s Directory Services.
In addition, organizations also prefer business software to leverage these directory services rather than use decentralized, individually managed user credentials, security policies or access rights that could potentially be created for each piece of software deployed.
To help cater to the aforementioned security needs of today’s digital businesses, IBM Data Replication’s Change Data Capture (“CDC”) technology has introduced support for integration with LDAP directory services. Traditionally, the CDC Access Server authenticates users, stores user credentials and data access information, and acts as the centralized communicator between all replication agents and Management Console clients.
Now, starting with the IIDR 126.96.36.199-10291 Management Console and Access Server delivery, users can choose to have an LDAP server manage their CDC user credentials, user authentication, and data store access information to help users conform to LDAP based centralized security architecture in their enterprise.
For more information about the new IIDR (CDC) LDAP enablement and for details on how to configure LDAP with IIDR (CDC) please refer to the below links.
In 2011, IBM released three new data replication products:
One question that comes up is whether the two IMS replication products are compatible with either the new Data Replication product or the existing InfoSphere CDC products. The answer is yes - the IMS products are compatible with both new and existing products that contain the CDC technology. IMore specifically, they can provide IMS changed data to any data replication solution that you can build with IBM's CDC technology. For example, you can create unidirectional (one-way) subscriptions that feed IMS changed data to any database that can be targeted by CDC:
Two notes about this picture:
- IBM recommends you use the CDC technology in IIDR if you do not own InfoSphere CDC.
- The target DB2 can be DB2 for z/OS, DB2 LUW, or DB2 for System i.
You could also feed IMS changed data into other business software such as ETL, IBM's DataStage, and ESBs:
In other words, the new IMS data replication products extend the reach of IBM's CDC technology by adding IMS as a source for log-based capture of changed data. If you have technical questions, see the Classic CDC section of the Information Center
The following rules apply with respect to what Versions of Management Console (MC), Access Server, and CDC agents (engines) will inter-operate.
These rules apply to any CDC 6.x or higher release.
1) The MC and AS must be at the exact same release level
2) The CDC source and target agents (engines) can be at different release levels
3) The MC version must be >= the most recent CDC source or target agent (engine)
When I first joined IBM in 2007 it seemed somewhat anachronistic that the Toronto Software Lab was managed by the leader of the Sensors and Actuators group. Now it seems prescient. As we consider the Internet of Things and see that all the physical objects around us have a useful place in the world of information, we see that our information assets can be viewed from a more traditionally physical perspective as well.
Databases are one of the most important assets that we have in an organization, certainly equal to our physical assets. As we consider the value in all the physical sensor information available, telling us who entered and exited every building, showing us through RFID tags what components flowed through an assembly line and so on, we should recognize the value of sensors on our databases as well.
I’ll focus on a particular type of sensor, one that provides a stream of the data changes occurring in the database. IBM’s Cloudant database provides a REST API that delivers a sensor stream of changes. IBM InfoSphere Data Replication can provide a sensor stream of changes from your distributed and mainframe-based relational databases, as well as from non-relational databases such as IMS and VSAM.
The original role for data replication technology was to enable low impact and low latency data movement. Data replication technology captures the changes occurring on the source database quickly and with minimal impact on that database and without requiring any changes in the database application. InfoSphere Data Replication captures changes from the database recovery logs. These traits make it ideal as a sensor.
Data replication has always had a role as an audit tool. Government regulations require certain industries to maintain an audit tail for their key data. Traditionally data mining was rarely done on these audit trails (let’s call them database sensor logs). The database sensor logs were kept primarily to meet the regulatory requirements.
Over time some industries have begun performing analytics on these sensor logs. Banks are using machine learning techniques to identify potential fraud events. Cell phone companies have been using streaming analytics to identify upsell opportunities. This use of analytics will grow as the Internet of Things continues to drive better analytics tools and create more data scientists experienced at working with sensor data.
I am often talking with clients as they begin to create an exploratory zone. They all understand the importance of having a copy of their database data in this exploratory zone and are interested in data replication technology as a way of maintaining a current copy of that database data. For exploratory zones that are being built around Hadoop it is easy to explain the advantages of using a database sensor log to provide that data as it suits the natural processing model of HDFS and Hive. Data replication can provide the sensor log as a series of files stored in HDFS and the data scientist can create Hive views over those files that can allow them to see either the entire audit trail or collapse that audit trail to just show the latest contents. Access to an audit trail is essentially a free side effect of the most practical method to provide data scientists with a current copy of the data and suits the general philosophy that one should not discard data on the way into your exploratory zone.
Most of our clients are just beginning the process of discovering the valuable questions that can be answered using this sensor log. An interesting difference between a database sensor log and a conventional physical sensor log is that the physical sensor log is often the primary source for both the current state of the physical object and the history of that state. You may learn both the current temperature of the engine block and the changes in that temperature over time. Many of the ideas discussed around the Internet of Things, such as the connected car, are primarily leveraging the information about the current state. This sort of analytics around the current state is already in place for databases. If you want to look at the Internet of Things to seed your thinking about what you may be able to get from database sensor logs you need to focus on those that are dependent on the history, not just the current state.
The use of personal fitness trackers to identify when a person with mobility issues may have fallen is an example that requires history. It seems quite similar to the fraud detection example that is already being done with database sensor logs. Some aspects of the connected car do depend on history, tracking the changes over time between two different sensors, say RPM and oil pressure, to ensure they maintain the expected relationship as they change. This might be comparable to comparing the database sensor log with the click stream from your application to confirm how many clicks it is taking to make specific types of updates to your system of record.
I think we are just scratching the surface here. I’m interested to see what other answers we will find. I encourage you to add a database sensor log to the assets you make available to your data scientists.
If you're looking for an excellent way to replicate changed data from a wide range of databases into a Netezza appliance, you can do so through InfoSphere Data Replication
. The latest release provides an Apply program that is both native to Netezza and optimized for Netezza targets. This Apply is built from Data Replication's CDC technology and is also compatible with the CDC technology found in InfoSphere Change Data Capture and InfoSphere Classic Change Data Capture for z/OS
. This means you can replicate data to Netezza from source databases ranging from Oracle, DB2, and others on UNIX or Windows to DB2* and IMS on the mainframe. Ordering information can be found in the Data Replication announcement letter on ibm.com
* Data Replication's CDC Apply program cannot be used to feed changed data to the IBM DB2 Analytics Accelerator (IDAA).
Modified on by Deepthi N
Blog post by: Davendra Paltoo, Offering Manager, Data Replication
Follow him on twitter: https://twitter.com/Davendr18397388
In addition to Apache Kafka’s more widely known capabilities as a distributed streaming platform, its capabilities, scalability and low cost as a storage system make it suitable as a central point in the enterprise architecture for data to be landed and then consumed by various applications.
When messages are landed in a Kafka cluster, there are a variety of available connectors or consumers that can in turn retrieve messages and deliver such messages to target destinations such as HDFS and Amazon S3. Or, Kafka users can write their own consumers.
However, developers of consumers struggle to find an easy way to:
- Only retrieve and consume transactions that have been completely delivered to Kafka (possibly to many different Kafka topics) with records in the original order as they occurred on the source database.
- Avoid processing of duplicate messages that have been delivered to Kafka.
- Avoid deadlocks when reading committed transactions from Kafka topics.
This is a concern to many Kafka users because in some critical scenarios, it is extremely valuable to users to have Kafka behave with database-like transactional semantics.
- The Kafka consumer needs to use Kafka data to populate parent child referential integrity tables in the same order as they were populated on the source.
- Processing of duplicate messages (which can sometimes occur during some failure and recovery scenarios when messages are being delivered to Kafka by Kafka producers or writers) cannot be tolerated by downstream applications. For example, knowing that no duplicates will be delivered, key business events can be triggered exactly once in response to messages delivered into Kafka.
- The Kafka consumer needs to guarantee retrieval and delivery of consistent transactions with the ability to recover from failures.
Why settle for duplicate data and promises of eventual consistency when you can leverage the performance and low cost of Kafka AND have database-like transactional semantics while not compromising on performance while delivering changes into Kafka?
IBM Data Replication’s “CDC” technology, with the initial version of its 11.4.0 release, provided users the ability to replicate from any supported CDC Replication source to a Kafka cluster by using the CDC target Replication Engine for Kafka.
In a recent delivery update, CDC now provides a java class library that can be included in a Kafka consumer application that is intended to consume data delivered by CDC into Kafka. This library, provides:
- Data in the original source log stream ORDER with identifiers available to denote transaction boundaries.
- A mechanism for ensuring exactly once delivery, so if there is an interruption in the Kafka environment and data has to be re-sent to Kafka by a producer or writer into Kafka, a consumer can be developed to only consume and process the data once.
- A “bookmark “that can be used to restart the consuming application from where it last left off processing.
Also available in the recent CDC delivery, are sample consuming applications that show how to:
- Poll records that were read by the Kafka transactionally consistent consumer for a specified subscription and write them to the standard output in the order of the source operation.
- Poll records that were read by the Kafka transactionally consistent consumer and publishes them in text format to a JMS topic.
Users are free to adapt the samples to suit their needs or to write their own consumer applications.
For more information on the Kafka transacationally consistent consumer please see our knowledge center at:
For demo videos on how to make use of the IBM Data Replication Kafka Apply or to contribute to the IBM Data replication community, please see the replication developer works page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/Replication%20How%20To%20Videos
For more information on how IBM Data Replication can provide near real-time incremental delivery of transactional data to your Hadoop and Kafka based Data Lakes or Data Hubs, download this solution brief.
I've added three new videos to my channel. They walk through configuring, operating and monitoring data replication using the CDC Management Console. This is basically the same thing you'd get if you came by the InfoSphere demo room at Information On Demand (now Insight) and agreed to let me show you a quick demo of CDC.
Here's the link to my channel "James talks about Data Replication":
With a mere 4 weeks until IBM's 2013 Information on Demand, the data replication team thought it might be helpful to have a complete listing of all data replication sessions at IOD. From client presentations and our product roadmap to sneak peeks at new IBM Data Replication functionality, our sessions run the gamut!
Simply take a gander at the sessions below then go to the IOD agenda builder, click on Create Sign In, and then enter your confirmation number and the email address that you used to register for the conference. Create your agenda today!
Modified on by Davendra
GA date: Feb 24, 2017
For availability and ordering information go to the Shopz website.
IBM InfoSphere Data Replication for DB2 for z/OS, V11.4.0 improves support for zero-data-loss continuous availability and delivers performance enhancements.
More details are in the announcement letter at https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=ca&infotype=an&appname=iSource&supplier=897&letternum=ENUS217-016.
IBM Data Replication V 11.4 Product Announcement on IBM.com :
Stay tuned for details, documentation on the new and exciting product capabilities!
Now that IBM has packaged its major data replication technologies into a single product, InfoSphere Data Replication
, a lot of people are asking what they can take advantage of that they couldn't with the older products (InfoSphere CDC and InfoSphere Replication Server). Other than the obvious point of having access to multiple technologies, you can now use IBM's table compare utility, asntdiff
, with CDC. asntdiff is a general-purpose utility that compares the data from two queries. IBM provides it through several product - Replication Server, the IBM Data Server Client, and all editions of DB2 and InfoSphere Warehouse.*
Long-time CDC users may ask what's happening to CDC's differential refresh and why they would want to use asntdiff instead of differential refresh. First understand that differential refresh is alive and well and it's not going anywhere :) asntdiff is just an option available to you.
To understand when you might want to use asntdiff, understand the basics of how it works.
- asntdiff accepts two queries as input and compares the result sets.
- You can use almost any query you can write against source and target tables.
So, the first reason to consider asntdiff is times when differential refresh's restrictions could be overcome by writing queries to get the result sets you need. For example, asntdiff may be an alternative if one of the following differential refresh restrictions applies to your replication configuration:
- Differential refresh is only available for tables that use Standard replication.
- Derived columns in the source table are not supported.
- Target columns are ignored if they are mapped to derived expressions, constants, or journal control fields.
- Key columns of the target table must be mapped directly to columns in the source table.
Next, asntdiff is independent of data replication and can be started from a command line. Among other things, this means:
- It can made part of a z/OS batch job and scheduled.
- It can be used while a CDC subscription is running
One major point to be aware of with asntdiff is how it works with heterogeneous data. For example, when you want to compare data being replicated from Oracle to DB2. asntdiff was originally written for DB2 databases. As a result, it requires IBM data federation technology to query databases such as Oracle. The good news is that InfoSphere Data Replication provides data federation for use with data replication configurations.
If you're not familiar with asntdiff and want to give it a try, see the ChannelDB2.com blog post titled Compare the Rows of Two Tables
. If you have questions, feel free to post them in the CDC message board here on developerWorks.
* Yes, technically, you could already use asntdiff with CDC on UNIX or Window since it comes in so many IBM products on UNIX and Windows. However, if you wanted to use it on z/OS, you could only get it through Replication Server. It's now in InfoSphere Data Replication as well.