Blog post by: Davendra Paltoo, Offering Manager, Data Replication
Follow him on twitter: https://twitter.com/Davendr18397388
In addition to Apache Kafka’s more widely known capabilities as a distributed streaming platform, its capabilities, scalability and low cost as a storage system make it suitable as a central point in the enterprise architecture for data to be landed and then consumed by various applications.
When messages are landed in a Kafka cluster, there are a variety of available connectors or consumers that can in turn retrieve messages and deliver such messages to target destinations such as HDFS and Amazon S3. Or, Kafka users can write their own consumers.
However, developers of consumers struggle to find an easy way to:
- Only retrieve and consume transactions that have been completely delivered to Kafka (possibly to many different Kafka topics) with records in the original order as they occurred on the source database.
- Avoid processing of duplicate messages that have been delivered to Kafka.
- Avoid deadlocks when reading committed transactions from Kafka topics.
This is a concern to many Kafka users because in some critical scenarios, it is extremely valuable to users to have Kafka behave with database-like transactional semantics.
- The Kafka consumer needs to use Kafka data to populate parent child referential integrity tables in the same order as they were populated on the source.
- Processing of duplicate messages (which can sometimes occur during some failure and recovery scenarios when messages are being delivered to Kafka by Kafka producers or writers) cannot be tolerated by downstream applications. For example, knowing that no duplicates will be delivered, key business events can be triggered exactly once in response to messages delivered into Kafka.
- The Kafka consumer needs to guarantee retrieval and delivery of consistent transactions with the ability to recover from failures.
Why settle for duplicate data and promises of eventual consistency when you can leverage the performance and low cost of Kafka AND have database-like transactional semantics while not compromising on performance while delivering changes into Kafka?
IBM Data Replication’s “CDC” technology, with the initial version of its 11.4.0 release, provided users the ability to replicate from any supported CDC Replication source to a Kafka cluster by using the CDC target Replication Engine for Kafka.
In a recent delivery update, CDC now provides a java class library that can be included in a Kafka consumer application that is intended to consume data delivered by CDC into Kafka. This library, provides:
- Data in the original source log stream ORDER with identifiers available to denote transaction boundaries.
- A mechanism for ensuring exactly once delivery, so if there is an interruption in the Kafka environment and data has to be re-sent to Kafka by a producer or writer into Kafka, a consumer can be developed to only consume and process the data once.
- A “bookmark “that can be used to restart the consuming application from where it last left off processing.
Also available in the recent CDC delivery, are sample consuming applications that show how to:
- Poll records that were read by the Kafka transactionally consistent consumer for a specified subscription and write them to the standard output in the order of the source operation.
- Poll records that were read by the Kafka transactionally consistent consumer and publishes them in text format to a JMS topic.
Users are free to adapt the samples to suit their needs or to write their own consumer applications.
For more information on the Kafka transacationally consistent consumer please see our knowledge center at:
For demo videos on how to make use of the IBM Data Replication Kafka Apply or to contribute to the IBM Data replication community, please see the replication developer works page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/Replication%20How%20To%20Videos
For more information on how IBM Data Replication can provide near real-time incremental delivery of transactional data to your Hadoop and Kafka based Data Lakes or Data Hubs, download this solution brief.