Apache Kafka cluster and components

Apache Kafka is a high-throughput distributed messaging system that you can use to facilitate scalable data collection.

Apache Kafka is bundled with Log Analysis in the <HOME>/IBM®/LogAnalysis/kafka directory.

An installation of Apache Kafka consists of a number of brokers that run on individual servers that are coordinated by an instance of Apache ZooKeeper. You can start by creating a single broker and you can add more as you scale your data collection architecture.

In the scalable data collection architecture, the Receiver cluster writes data to Apache Kafka topics and partitions, based on the data sources. The Sender cluster reads data from Apache Kafka, does some processing and sends the data to Log Analysis.

Apache ZooKeeper

Apache Kafka uses Apache ZooKeeper to maintain and coordinate the Apache Kafka brokers.

A version of Apache ZooKeeper is bundled with Apache Kafka.

Topics, partitions, and consumer groups

The basic objects in Apache Kafka are topics, partitions, and consumer groups.

Topics are divided into partitions. Partitions are distributed across all the Apache Kafka brokers. LFAs

Create one partition for every two physical processors on the server where the broker is installed. For example, if you use eight core processors, create four partitions in the Apache Kafka broker. You specify the number of partitions in your Apache Kafka configuration. For more information, see Configuring Apache Kafka brokers.

You do not need to manually create topics or consumer groups. You only need to specify the correct values in the configuration for the LFA, Sender, and Receiver clusters. The appropriate topics and partitions are created for you.

In the Receiver configuration, you configure Logstash to receive data from the LFAs and send it to the Apache Kafka brokers. The configuration maps the logical data source attributes that are specified in the LFA configuration to the topic_id and message_key parameters in Apache Kafka. This configuration ensures that data from each physical data source is mapped to a partition in Apache Kafka. For more information, see Configuring the Receiver cluster for single line logs.

In the Sender configuration, you configure Logstash to read data from a specific topic or in the consumer group. This configuration is based on the group_id and topic_id that you specify. The topic_id is the same as the name of the logical data source. For more information, see Configuring the Sender cluster for single line logs.

Apache Kafka brokers

The configuration parameters for each Apache Kafka servers are specified in the <kafka_install_dir>/kafka_version_number/config/server.properties file, where version_number is the Kafka version number. For Kafka version numbers for Log Analysis Version 1.3.8 and its fix packs, see Other supported software.

You need to specify the broker ID, the port, the directory where the log files are stored and the Apache ZooKeeper host name and port in this file. For example:
broker.id=1
port=17991
log.dirs=/tmp/kafka-logs-server_0
zookeeper.connect=example.com:12345

You can find a sample configuration file in the <HOME>/IBM/LogAnalysis/kafka/test-configs directory.

If you want to implement high availability in a production environment, the Apache Kafka server cluster must consist of multiple servers. You can also use these servers to configure replication and retention period. However, when you add new brokers to the cluster, the existing topics are not distributed automatically across new brokers. For more information about how to fix this issue, see https://kafka.apache.org/081/ops.html.