Kafka
The Kafka origin reads data from one or more topics in an Apache Kafka cluster. All messages in a batch must use the same schema. The origin supports Apache Kafka 0.10 and later. When using a Cloudera distribution of Apache Kafka, use CDH Kafka 3.0 or later.
The Kafka origin can read messages from a list of Kafka topics or from topics that match a pattern defined in a Java-based regular expression. When reading topics in the first batch, the origin can start from the first message, the last message, or a particular position in a partition. In subsequent batches, the origin starts from the last-saved offset.
When configuring the Kafka origin, you specify the Kafka brokers that the origin can initially connect to, the topics the origin reads, and where to start reading each topic. You can configure the origin to connect securely to Kafka. You specify the maximum number of messages to read from any partition in each batch. You can configure the origin to include Kafka message keys in records. You can also specify additional Kafka configuration properties to pass to Kafka.
You can also use a connection to configure the origin.
You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
For a Kafka origin, Spark determines the partitioning based on the number of partitions in the Kafka topics being read.
For example, if a Kafka origin is configured to read from 10 topics that each have 5 partitions, Spark creates a total of 50 partitions to read from Kafka.
Spark uses these partitions while the pipeline processes the batch unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
Topic Specification
The Kafka origin reads data in messages from one or more topics that you specify.
- Topic list
- Add a list of topics from your Kafka cluster. For example, suppose you want the origin to read two topics named orders_exp and orders_reg. When configuring the origin, clear the Use Topic Pattern property and in the Topic List property, add the following two topics:
- orders_exp
- orders_reg
- Topic pattern
- Specify a Java-based regular expression that identifies topics from your
Kafka cluster.
For example, suppose your cluster has four topics named cust_east, cust_west, orders_exp, and orders_reg. To read the two topics cust_east and cust_west, you can use an expression. Select the Use Topic Pattern property and in the Topic Pattern property, enter the Java expression c+.
With this configuration, if you later add the topic cust_north to your cluster, the origin will automatically read the new topic.
Offsets
In a Kafka topic, an offset identifies a message in a partition. When configuring the Kafka origin, you define the starting offset to specify the first message to read in each partition of a topic.
- Earliest
- The origin reads all available messages, starting with the first message in each partition of each topic.
- Latest
- The origin reads the last message in each partition of each topic and any subsequent messages added to those topics after the pipeline starts.
- Specific offsets
- The origin reads messages starting from a specified offset for each partition in each topic. If an offset is not specified for a partition in a topic, the origin returns an error.
When reading the last message in a batch, the origin saves the offset from that message. In the subsequent batch, the origin starts reading from the next message.
For example, suppose the orders_exp
and
orders_reg
topics have two partitions, 0
and
1
. To have the origin read from the partitions starting with
the third message, which has an offset of 2
, configure the origin
as follows:
Kafka Security
You can configure the originto connect securely to Kafka through SSL/TLS, SASL, or both. For more information about the methods and details on how to configure each method, see Security in Kafka Stages.
Data Formats
The Kafka origin generates records based on the specified data format.
- Avro
- The origin generates a record for every message. You can use one of the following methods to specify the location of the Avro schema definition:
- In Pipeline Configuration - Use the schema defined in the stage properties.
- Confluent Schema Registry - Retrieve the schema from Confluent Schema Registry. Confluent Schema Registry is a distributed storage layer for Avro schemas. You specify the URL to Confluent Schema Registry and whether to look up the schema by the schema ID or subject.
- Delimited
- The origin generates a record for every message. You can specify a custom delimiter, quote, and escape character used in the data.
- JSON
- The origin generates a record for every message.
- Text
- The origin generates a record for every message.
Configuring a Kafka Origin
Configure a Kafka origin to read data from topics in an Apache Kafka cluster.