Monitoring Kafka
You can monitor your Kafka environment with the Instana Kafka sensor to gain end-to-end visibility, identify performance bottlenecks, and optimize performance. After you install the Instana host agent, the agent automatically deploys the Kafka sensor, which collects real-time metrics. The Instana tracers instrument messaging calls to Kafka from monitored processes and capture traces across your messaging flow. You can view both the metrics and traces in the Instana UI.
For more information about tracing Kafka, see Supported client-side tracing. Currently, Infrastructure correlation between message flow traces and Kafka infrastructure is not supported.
Support information
To make sure that the Kafka sensor is compatible with your current setup, verify the following support information sections:
Supported versions and support policy
All Kafka metrics that Instana collects are available for every version of Apache Kafka, Cloudera Kafka, and Confluent Kafka, except the Consumer group lag and the Consumer/Producer Byte Rate/Throttling metrics. IBM® Event Streams, which is built on open source Apache Kafka, is supported from IBM Event Streams 11.0.4 (IBM Event Streams Operator 3.0.5) and later versions.
The following table shows the latest supported version and support policy:
| Technology | Support policy | Latest technology version | Latest supported version |
|---|---|---|---|
| Apache Kafka | 45 days | 4.1.0 | 4.1.0 |
| Cloudera Kafka | 45 days | 4.1.x | 4.1.x |
| Confluent Kafka | 45 days | 8.1.0 | 8.1.0 |
| IBM Event Stream | On demand | 12.0 | 11.6 |
For more information about the support policy, see Support strategy for sensors.
Additional support information
Consumer group lag metrics are available for the following versions:
- Apache Kafka (with Zookeeper) versions from 0.11.x.x to 3.9.0
- Apache Kafka (with KRaft) versions from 3.9.0
- Cloudera Kafka versions from 3.x.x to 4.1.x
- Confluent Kafka versions from 3.3.x
- IBM Event Streams version from 11.0.4 (IBM Event Streams Operator 3.0.5) and later versions
Consumer/Producer Byte Rate/Throttling metrics are available for Java Kafka clients only and for the following versions:
- Apache Kafka versions from 1.1.x
- Cloudera Kafka versions from 4.0.x to 4.1.x
- Confluent Kafka versions from 4.1.x
- IBM Event Streams version from 11.0.4 (IBM Event Streams Operator v3.0.5) and later versions
Supported client-side tracing
Configuration
The Instana agent automatically detects the running Kafka agent. Therefore, no configuration is required.
Instana collects the first 400 topics that are sorted by topic name.
If you need to filter topics, you can configure it in the agent configuration file <agent_install_dir>/etc/instana/configuration.yaml as shown in the following example:
com.instana.plugin.kafka:
...
poll_rate: 1 # value is in seconds. Default value is 1 second.
topicsRegex: '<OPTIONAL_REGEX_HERE>'
brokerPropertiesFilePath: '/path/to/server.properties'
collectLagData: '' # true or false. The default value is true
poll_rate: Specifies the polling frequency in seconds, with a default value of 1.-
topicsRegex: Optional regular expression to select up to 400 topics by name. If the value is empty or does not exist, Instana collects the first 400 topics that are sorted by name. -
brokerPropertiesFilePath: The path to the brokerserver.propertiesfile that the agent uses to obtain information about the broker network and security protocol settings. -
collectLagData: Flag that enables or disables lag data collection (enabled by default).
If the path to the broker properties is not specified, the agent tries to find server.properties in the following places:
- Kafka broker process arguments
-
KAFKA_SERVER_PROPERTIESenvironment variable - Using the predefined paths:
/path_to_kafka_home/config/server.propertiesor/path_to_kafka_home/etc/kafka/server.propertiesfor Confluent Kafka.
The Agent uses /opt/kafka/config/server.properties as a default path when the path to server.properties is not found in any of the previously mentioned ways.
Customizing the polling interval
You can configure how often Instana polls Kafka to collect data and metrics by using the poll_rate parameter in the agent configuration.yaml file as shown in the following example:
Configuring poll rate
com.instana.plugin.kafka:
poll_rate: 1 # value is in seconds. Default value is 1 second.
SSL TLS support
If your Kafka broker instance requires SSL client connections, you need to configure the Instana agent via <agent_install_dir>/etc/instana/configuration.yaml to enable collecting Consumer lag metrics as shown in the following example:
com.instana.plugin.kafka:
...
sslTrustStore: '/path/to/truststore.jks'
sslTrustStorePassword: 'kafkaTsPassword'
sslKeyStore: '/path/to/sslKeyStoreFile.jks'
sslKeyStorePassword: 'kafkaKsPassword'
SASL support
If your Kafka broker instance requires SASL or PLAIN authentication, configure the Instana agent through <agent_install_dir>/etc/instana/configuration.yaml to enable collection of Consumer lag metrics as shown in the following example:
com.instana.plugin.kafka:
...
saslUsername: 'kafkaUser'
saslPassword: 'kafkaPassword'
Make sure that the Keys are in the Java Keystore (JKS) format. Use the keytool to create the keys.
JMX authentication support
If your JMX authentication is enabled for your Kafka, you need to configure the Instana agent by using <agent_install_dir>/etc/instana/configuration.yaml to authenticate your JMX as shown in the following example:
com.instana.plugin.kafka:
jmxUsername: ''
jmxPassword: ''
jmxPort: '' # default jmx port is 1099
Configuring Kafka trace correlation headers
You can configure the format for Kafka trace correlation headers that are used by Instana tracers. For more information, see Configuring Kafka trace correlation headers.
Kafka node - metrics collection
Kafka node metrics collection gathers and analyzes data about the performance and health of individual nodes within a Kafka cluster.
Configuration data
You need the following details to configure Kafka node:
- Version
- Zookeeper Connects
- Process ID
- Node ID
- Topics/Partitions
Performance metrics
The following table contains the performance metrics details:
| Metric | Description | Granularity |
|---|---|---|
| Total Produce Time | Total time in milliseconds to serve the specified request that is collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce. |
1 second |
| Total Fetch Consumer Time | Total time in milliseconds to serve the specified request that is collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer. |
1 second |
| Total Fetch Follower Time | Total time in milliseconds to serve the specified request that is collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower. |
1 second |
Broker traffic
The following table contains the broker traffic details:
| Metric | Description | Granularity |
|---|---|---|
| In | Aggregate incoming byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. |
1 second |
| Out | Aggregate outgoing byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec. |
1 second |
| Rejected | Aggregate rejected byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec. |
1 second |
Broker messages in
The following table contains the broker messages details:
| Metric | Description | Granularity |
|---|---|---|
| Count | Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. |
1 second |
Produce requests
The following table contains the produced requests details:
| Metric | Description | Granularity |
|---|---|---|
| Count | Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce. |
1 second |
| Mean Latency | Average latency is calculated as quotient of Count (mentioned earlier) and of total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce. |
1 second |
Fetch consumer requests
The following table contains the fetched consumer requests details:
| Metric | Description | Granularity |
|---|---|---|
| Count | Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer. |
1 second |
| Mean Latency | Average latency is calculated as the quotient of Count (mentioned earlier) and of total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer. |
1 second |
Fetch follower requests
The following table contains the fetched follower requests details:
| Metric | Description | Granularity |
|---|---|---|
| Count | Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower. |
1 second |
| Mean Latency | Average latency is calculated as the quotient of Count (mentioned earlier) and of total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower. |
1 second |
Average idle time
The following table contains the average idle time details:
| Metric | Description | Granularity |
|---|---|---|
| Network Processor | The average fraction of time the network processor threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent. |
1 second |
| Request Handler | The average fraction of time the request handler threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent. |
1 second |
Broker failures
The following table contains the broker failures details:
| Metric | Description | Granularity |
|---|---|---|
| Fetch | Fetch request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec. |
1 second |
| Produce | Produce request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec. |
1 second |
Broker state metrics
The following table contains the broker state metrics details:
| Metric | Description | Granularity |
|---|---|---|
| Under-replicated Partitions | The number of under-replicated partitions (ISR < all replicas) and is collected from kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions. |
1 second |
| Offline Partitions | The number of partitions that don’t have an active leader and are hence not writable or readable and is collected from kafka.controller:type=KafkaController,name=OfflinePartitionsCount. |
1 second |
| Leader Elections | Leader election rate and latency and is collected from kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs. |
1 second |
| Unclean Leader Elections | Unclean leader election rate and is collected from kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec. |
1 second |
| ISR Shrinks | If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISR is expanded when the replicas are fully caught up. Other than that, the expected value for both the ISR shrink rate and expansion rate is 0. Collected from kafka.server:type=ReplicaManager,name=IsrShrinksPerSec. |
1 second |
| ISR Expansions | When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR. Collected from kafka.server:type=ReplicaManager,name=IsrExpandsPerSec. |
1 second |
| Active controller count | The number of active controllers in the cluster and is collected from kafka.controller:type=KafkaController,name=ActiveControllerCount. |
1 second |
Partitions
The following table contains the partition details:
| Metric | Description | Granularity |
|---|---|---|
| Count | Total number of partitions on this broker. This must be mostly even across all brokers and is collected from kafka.server:type=ReplicaManager,name=PartitionCount. |
1 second |
Log flushing
The following table contains the log flushing details:
| Metric | Description | Granularity |
|---|---|---|
| Mean | Log flush rate and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. |
1 second |
| Flushes | Log flush count and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. |
1 second |
Topics
The following table contains the topics of Kafka node details:
| Metric | Description | Granularity |
|---|---|---|
| Name | Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. |
1 second |
| Partitions | Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. |
1 second |
| Bytes In | Aggregate incoming byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. |
1 second |
| Bytes Out | Aggregate outgoing byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec. |
1 second |
| Bytes Rejected | Aggregate rejected byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec. |
1 second |
| Messages In | Aggregate incoming message rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. |
1 second |
| In-Sync Replicas | In-sync replicas count and are collected from kafka.cluster:type=Partition,name=InSyncReplicasCount. |
1 second |
Kafka cluster - metrics collection
Kafka cluster metrics collection gathers and analyzes data about the performance and health of the entire Apache Kafka cluster, rather than individual nodes.
Configuration data
You need the following details to configure Kafka cluster:
- Cluster Name
- Zookeeper
- Nodes (Name, Version)
- Topics/Partitions
Performance metrics
The following table contains the performance metrics details:
| Metric | Description | Granularity |
|---|---|---|
| All Brokers Messages In | The sum of the broker messages in metric from all nodes. | 1 second |
| Rejected Traffic | The sum of the broker traffic rejected metric from all nodes. | 1 second |
| Total Fetch Consumer Time | The sum of the total fetch consumer time metric from all nodes. | 1 second |
| Total Fetch Follower Time | The sum of the total fetch follower time metric from all nodes. | 1 second |
Average request latency versus throughput
The following table contains the average request latency versus throughput details:
| Metric | Description | Granularity |
|---|---|---|
| Produce Throughput | The sum of the produce requests count metric from all nodes. | 1 second |
| Fetch Consumer Throughput | The sum of the fetch consumer requests count metric from all nodes. | 1 second |
| Fetch Follower Throughput | The sum of the fetch follower requests count metric from all nodes. | 1 second |
| Total Produce Time | The sum of the total produce time from all nodes. | 1 second |
| Total Fetch Consumer Time | The sum of the total fetch consumer time from all nodes. | 1 second |
| Total Fetch Follower Time | The sum of the total fetch follower time from all nodes. | 1 second |
All brokers traffic
The following table contains the all brokers traffic details:
| Metric | Description | Granularity |
|---|---|---|
| In | The sum of the broker traffic in from all nodes. | 1 second |
| Out | The sum of the broker traffic out from all nodes. | 1 second |
| Rejected | The sum of the broker traffic rejected from all nodes. | 1 second |
All brokers failures
The following table contains the all brokers failures details:
| Metric | Description | Granularity |
|---|---|---|
| Fetch | The sum of the broker failures fetch from all nodes. | 1 second |
| Produce | The sum of the broker failures produce from all nodes. | 1 second |
All brokers state metrics
The following table contains the all brokers state metrics details:
| Metric | Description | Granularity |
|---|---|---|
| Under-replicated Partitions | The sum of the broker state metrics under-replicated partitions from all nodes. | 1 second |
| Offline Partitions | The sum of the broker state metrics offline partitions from all nodes. | 1 second |
| Leader Elections | The sum of the broker state metrics leader elections from all nodes. | 1 second |
| Unclean Leader Elections | The sum of the broker state metrics unclean leader elections from all nodes. | 1 second |
| ISR Shrinks | The sum of the broker state metrics ISR shrinks from all nodes. | 1 second |
| ISR Expansions | The sum of the broker state metrics ISR expansions from all nodes. | 1 second |
| Active controller count | The sum of the broker state metrics active controller count from all nodes. | 1 second |
Average idle time percentage
The following table contains the average idle time percentage details:
| Metric | Description | Granularity |
|---|---|---|
| Network Processor | The total average of the average idle time network processor from all nodes. | 1 second |
| Request Handler | The total average of the average idle time request handler from all nodes. | 1 second |
Log flushing
The following table contains the log flushing details:
| Metric | Description | Granularity |
|---|---|---|
| Mean | The sum of the log flushing mean from all nodes. | 1 second |
| Flushes | The sum of the log flushing flushes from all nodes. | 1 second |
Cluster nodes
The following table contains the cluster nodes details:
| Metric | Description | Granularity |
|---|---|---|
| Controller | Is the node controller? Yes or No. | 1 second |
| Messages In | Chart with the count of the broker messages In. | 1 second |
| Bytes In | Chart with the count of the broker bytes In. | 1 second |
| Bytes Out | Chart with the count of the broker bytes Out. | 1 second |
| Average Response Time | Chart with the count of the broker average response time. | 1 second |
| Health | The node health indicator. | 1 second |
Cluster topics
The following table contains the cluster topics details:
| Metric | Description | Granularity |
|---|---|---|
| Partitions | The total number of partitions. | 10 minutes |
| Bytes In | Chart with the count of the topic bytes in. | 1 second |
| Bytes Out | Chart with the count of the topic bytes out. | 1 second |
| Bytes Rejected | Chart with the count of the topic bytes rejected. | 1 second |
| Messages In | Chart with the count of the topic messages in. | 1 second |
Consumer group lag
The following table contains the consumer group lag details:
| Metric | Description | Granularity |
|---|---|---|
| Lag | Consumer group lag per topic. | 60 seconds |
Consumers
The following table contains the consumers details:
| Metric | Description | Granularity |
|---|---|---|
| Byte Rate | The total number of bytes consumed that sent per second. | 1 second |
| Throttling | The total average throttle time. | 1 second |
| Latency | The total average fetch latency. | 1 second |
Producers
The following table contains the producers details:
| Metric | Description | Granularity |
|---|---|---|
| Byte Rate | The total number of outgoing bytes that sent per second. | 1 second |
| Throttling | The total average throttle time. | 1 second |
| Latency | The total average request latency. | 1 second |
Health Signatures
For each sensor, a knowledge base of health signatures are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.
Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.
For more information about built-events for Kafka node and cluster, see Built-in events reference.
Troubleshooting
SSL not configured
Monitoring issue type: kafka_ssl_not_configured
To resolve the SSL configuration issue and configure Kafka SSL truststore location and password, see SSL/TLS Support.
SSL client authentication not configured
Monitoring issue type: kafka_ssl_client_not_configured
To resolve the SSL client authentication related issue and configure Kafka SSL client authentication (keystore location and password), see SSL/TLS Support.
JMX authentication not configured
Monitoring issue type: kafka_invalid_jmx_credentials
To resolve the JMX authentication related issue and configure JMX authentication credentials, see JMX Authentication support.