About targeting Kafka

You can replicate from any supported CDC Replication source to a Kafka cluster by using the CDC Replication Engine for Kafka. This engine writes Kakfa messages that contain the replicated data to Kafka topics.

By default, replicated data in the Kafka message is written in the Confluent Avro binary format. The following schema registries are compatible:

You can configure subscriptions to run without connecting to a schema registry by using a Kafka custom operation processor (KCOP) for the CDC Replication Engine for Kafka.

The Kafka cluster and schema registry application that you use with CDC Replication is external to the CDC Replication installation and must be set up independently. CDC Replication does not bundle the schema registry or the Kafka clustered file system and is instead designed to work with an existing Kafka installation.

Irrespective of which schema registry is used, you can use Kafka from any preferred distribution. Only the schema registry service and the matching Confluent Avro Binary Deserializer should be selected from either the Confluent or Hortonworks equivalent choices. The open source Confluent Schema Registry Service or the Hortonworks Schema Registry Service both serve as a repository for Avro schemas.

The schema registry is a small service that performs the following tasks:

Listens to incoming messages that describe a schema for a topic that returns a number that creates a reference to this schema for this topic.
Using the number, returns the appropriate schema when consumers request topic data from Kafka.

This process enables automatic handling of schema changes because specific versions of a topic's schema data are associated with each record. The first record that is detected with a changed schema causes the registry to create a new number, associating that schema with the topic's record and future records. Existing records are not affected because the prior version of the schema still exists with its own reference number.

The CDC Replication Engine for Kafka was built to support the open source schema registry API that is provided in the Confluent platform. Starting with HDF v3.1.0, Hortonworks provides Confluent compatibility with its Schema Registry Service, meaning that Hortonworks Schema Registry can register schemas for topics that are produced using the Confluent Avro serializer.

This schema registry service must be compatible with the version of Kafka that is being run. For example, the Confluent schema registry that is bundled in Confluent Platform 3.0 is documented to support Kafka 0.10.x. See these pages for more information:

Kafka data consumer components that are built or used with the Kafka cluster must use the schema registry deserializer that is included with the corresponding schema registry service. The host name and port number of the schema registry are passed as parameters to the deserializer through the Kafka consumer properties. The deserializer's access of the schema registry is transparent because it is embedded in its inner workings. Thus, if the deserializer is running it automatically handles the matching of the appropriate version of a topic's schema to specific records for that topic.

The CDC message consists of a data portion and a key portion. The Kafka key is based on user-selected key columns that are defined in the replication source data mapping. The data portion of the message includes the value of all columns from the data source that were selected for replication. Note that any key column values appear again in this part of the message. Derived columns on the source are also valid columns for replication if they were selected in the CDC Management Console.

CDC Replication has two methods for writing to the Kafka topics: a Javi API and a REST server protocol. When using the Java™ API, Kafka consumers read the Kafka topics that are populated by CDC Replication using a deserializer that is compatible with the CDC Avro binary format. Known compatible deserializers are available with the Hortonworks and Confluent schema registry packages. You can configure subscriptions to use integrated Kafka custom operation processors (KCOPs), such as KcopJsonFormatIntegrated or KcopLiveAuditSingleRowIntegrated, which write records that can be consumed with alternate deserializers. Finally, Kafka records can be consumed by using the HTTP protocol to connect to the Kafka REST server. At this time, the only known Kafka REST server is provided by Confluent.

Note: The Java API option for replication to Kafka targets provides the highest performance and greatest range of values for some data types. The REST API option is appropriate for situations in which communication between the CDC Replication Engine for Kafka target and the actual Kafka server must be routed over HTTP.

CDC Replication Engine for Kafka maintains the bookmark so that only records that are explicitly confirmed as written by Kafka are considered committed. The engine does not therefore skip operations. This behavior is maintained even spanning multiple replication sessions, where a replication session is a subscription in Mirror/Active state. This behavior means that even in the face of abnormal termination (perhaps the Kafka cluster went down) the CDC Replication Engine for Kafka, upon starting a subscription as Mirror/Active, ensures that any unconfirmed replicated source operations are resent, written to, and confirmed by Kafka before the engine progresses its bookmark to denote the point that has been replicated. This is the data delivery guarantee.

All data that is written to a given topic/partition pairing is written in order for a subscription in Mirror/Active state. For example, if multiple records from different source tables in a subscription are sent to the same topic/partition, they appear in that topic/partition in the order that they appeared on the source database.

The order in which operations occur within a transaction on the source cannot necessarily be restored by simply consuming from Kafka topics. When a producer writes records to multiple partitions on a topic, or to multiple topics, Kafka guarantees the order within a partition, but does not guarantee the order across partitions/topics. As described in the abnormal termination scenario, CDC Replication Engine for Kafka might resend data that was already sent to a Kafka topic, causing duplicates as a result of ensuring that messages were correctly delivered.

To address this issue, the CDC Replication Engine for Kafka in InfoSphere® Data Replication Version 11.4.0.0-505x and later provides a Kafka transactionally consistent consumer library that provides Kafka records that are free of duplicates and allows your applications to recreate the order of operations in a source transaction across multiple Kafka topics and partitions.