Monitoring Kafka
The Kafka sensor is automatically deployed and installed after you install the Instana agent.
- Supported information
- Configuration
- SSL TLS support
- JMX authentication support
- Kafka Node - Metrics collection
- Kafka Cluster - Metrics collection
- Troubleshooting
Supported information
Supported versions
All Kafka metrics that Instana collects are available for every version of Apache Kafka, Cloudera Kafka, and Confluent Kafka, apart from the Consumer group lag and the Consumer/Producer Byte Rate/Throttling metrics. IBM® Event Streams that is built on open-source Apache Kafka, is supported from IBM Event Streams 11.0.4 (IBM Event Streams Operator 3.0.5) and later versions.
Consumer group lag metrics are available for the following versions:
- Apache Kafka versions from 0.11.x.x to 3.x.x
- Cloudera Kafka version from 3.x.x to 4.1.x
- Confluent Kafka versions from 3.3.x to 7.x.x.
- IBM Event Streams version from 11.0.4 (IBM Event Streams Operator 3.0.5) and later versions
Consumer/Producer Byte Rate/Throttling metrics are available for Java Kafka clients only and:
- Apache Kafka versions from 1.1.x to 3.x.x
- Cloudera Kafka versions from 4.0.x to 4.1.x
- Confluent Kafka versions from 4.1.x to 7.x.x.
- IBM Event Streams version from 11.0.4 (IBM Event Streams Operator v3.0.5) and later versions
Supported client-side tracing
For this technology, Instana supports client-side tracing for the following languages and runtimes:
Configuration
The Instana agent automatically detects the running Kafka agent, therefore no configuration is required.
Instana collects the first 400 topics sorted by topic name.
If there is a requirement to filter topics, you can configure it in the agent configuration file <agent_install_dir>/etc/instana/configuration.yaml
:
com.instana.plugin.kafka:
...
topicsRegex: '<OPTIONAL_REGEX_HERE>'
brokerPropertiesFilePath: '/path/to/server.properties'
collectLagData: '' # true or false. The default value is true
topicsRegex
: Optional regular expression to select up to 400 topics by name. If the value is empty or does not exist, Instana collects the first 400 topics sorted by name.brokerPropertiesFilePath
: The path to the brokerserver.properties
file which is used by the agent to get information about the broker network and security protocol settings.collectLagData
: Flag which is being used to explicitly enable/disable lag data collection (enabled by default).
If the path to the broker properties is not specified, the agent will try to find server.properties
in the following places:
- Kafka broker process arguments
KAFKA_SERVER_PROPERTIES
environment variable- Using the predefined paths:
/path_to_kafka_home/config/server.properties
or/path_to_kafka_home/etc/kafka/server.properties
for Confluent Kafka.
The Agent uses /opt/kafka/config/server.properties
as a default path in case the path to server.properties
could not be found in any of the aforementioned ways.
SSL TLS support
If your Kafka broker instance requires SSL client connections, you need to configure the Instana agent via <agent_install_dir>/etc/instana/configuration.yaml
to enable collecting Consumer lag metrics:
com.instana.plugin.kafka:
...
sslTrustStore: '/path/to/truststore.jks'
sslTrustStorePassword: 'kafkaTsPassword'
sslKeyStore: '/path/to/sslKeyStoreFile.jks'
sslKeyStorePassword: 'kafkaKsPassword'
Keys need to be in the Java Keystore format (JKS). The keytool can be used to create these.
This will enable the Instana agent to connect to Kafka broker via SSL and collect Consumer group lag metrics.
JMX authentication support
If your JMX authentication is enabled for your Kafka, you need to configure the Instana agent by using <agent_install_dir>/etc/instana/configuration.yaml
to authenticate your JMX:
com.instana.plugin.kafka:
jmxUsername: ''
jmxPassword: ''
jmxPort: '' # default jmx port is 1099
If your JMX is not secured, you can skip this part. Instana starts monitoring by connecting to your default JMX port that is, 1099.
Kafka Node - Metrics collection
Configuration data
- Version
- Zookeeper Connect
- Process ID
- Node ID
- Topics/Partitions
Performance metrics
Metric | Description | Granularity |
---|---|---|
Total Produce Time | Total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce . |
1 second |
Total Fetch Consumer Time | Total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer . |
1 second |
Total Fetch Follower Time | Total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower . |
1 second |
Broker Traffic
Metric | Description | Granularity |
---|---|---|
In | Aggregate incoming byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec . |
1 second |
Out | Aggregate outgoing byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec . |
1 second |
Rejected | Aggregate rejected byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec . |
1 second |
Broker Messages In
Metric | Description | Granularity |
---|---|---|
Count | Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec . |
1 second |
Produce Requests
Metric | Description | Granularity |
---|---|---|
Count | Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce . |
1 second |
Mean Latency | Average latency calculated as quotient of Count (mentioned above) and of total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce . |
1 second |
Fetch Consumer Requests
Metric | Description | Granularity |
---|---|---|
Count | Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer . |
1 second |
Mean Latency | Average latency calculated as quotient of Count (mentioned above) and of total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer . |
1 second |
Fetch Follower Requests
Metric | Description | Granularity |
---|---|---|
Count | Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower . |
1 second |
Mean Latency | Average latency calculated as quotient of Count (mentioned above) and of total time in milliseconds to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower . |
1 second |
Average Idle Time
Metric | Description | Granularity |
---|---|---|
Network Processor | Average fraction of time the network processor threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent . |
1 second |
Request Handler | Average fraction of time the request handler threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent . |
1 second |
Broker Failures
Metric | Description | Granularity |
---|---|---|
Fetch | Fetch request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec . |
1 second |
Produce | Produce request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec . |
1 second |
Broker State Metrics
Metric | Description | Granularity |
---|---|---|
Under-replicated Partitions | Number of under-replicated partitions (ISR < all replicas) and is collected from kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions . |
1 second |
Offline Partitions | Number of partitions that don’t have an active leader and are hence not writable or readable and is collected from kafka.controller:type=KafkaController,name=OfflinePartitionsCount . |
1 second |
Leader Elections | Leader election rate and latency and is collected from kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs . |
1 second |
Unclean Leader Elections | Unclean leader election rate and is collected from kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec . |
1 second |
ISR Shrinks | If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion
rate is 0. Collected from kafka.server:type=ReplicaManager,name=IsrShrinksPerSec . |
1 second |
ISR Expansions | When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR. Collected from kafka.server:type=ReplicaManager,name=IsrExpandsPerSec . |
1 second |
Active controller count | Number of active controllers in the cluster and is collected from kafka.controller:type=KafkaController,name=ActiveControllerCount . |
1 second |
Partitions
Metric | Description | Granularity |
---|---|---|
Count | Number of partitions on this broker. This should be mostly even across all brokers and is collected from kafka.server:type=ReplicaManager,name=PartitionCount . |
1 second |
Log Flushing
Metric | Description | Granularity |
---|---|---|
Mean | Log flush rate and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs . |
1 second |
Flushes | Log flush count and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs . |
1 second |
Topics
Metric | Description | Granularity |
---|---|---|
Name | Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec . |
1 second |
Partitions | Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec . |
1 second |
Bytes In | Aggregate incoming byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec . |
1 second |
Bytes Out | Aggregate outgoing byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec . |
1 second |
Bytes Rejected | Aggregate rejected byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec . |
1 second |
Messages In | Aggregate incoming message rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec . |
1 second |
In-Sync Replicas | In-sync replicas count and is collected from kafka.cluster:type=Partition,name=InSyncReplicasCount . |
1 second |
Kafka Cluster - Metrics collection
Configuration data
- Cluster Name
- Zookeeper
- Nodes (Name, Version)
- Topics/Partitions
Performance metrics
Metric | Description | Granularity |
---|---|---|
All Brokers Messages In | Sum of the Broker Messages In metric from all nodes. | 1 second |
Rejected Traffic | Sum of the Broker Traffic Rejected metric from all nodes. | 1 second |
Total Fetch Consumer Time | Sum of the Total Fetch Consumer Time metric from all nodes. | 1 second |
Total Fetch Follower Time | Sum of the Total Fetch Follower Time metric from all nodes. | 1 second |
Average Request Latency versus Throughput
Metric | Description | Granularity |
---|---|---|
Produce Throughput | Sum of the Produce Requests Count metric from all nodes. | 1 second |
Fetch Consumer Throughput | Sum of the Fetch Consumer Requests Count metric from all nodes. | 1 second |
Fetch Follower Throughput | Sum of the Fetch Follower Requests Count metric from all nodes. | 1 second |
Total Produce Time | Sum of the Total Produce Time from all nodes. | 1 second |
Total Fetch Consumer Time | Sum of the Total Fetch Consumer Time from all nodes. | 1 second |
Total Fetch Follower Time | Sum of the Total Fetch Follower Time from all nodes. | 1 second |
All Brokers Traffic
Metric | Description | Granularity |
---|---|---|
In | Sum of the Broker Traffic In from all nodes. | 1 second |
Out | Sum of the Broker Traffic Out from all nodes. | 1 second |
Rejected | Sum of the Broker Traffic Rejected from all nodes. | 1 second |
All Brokers Failures
Metric | Description | Granularity |
---|---|---|
Fetch | Sum of the Broker Failures Fetch from all nodes. | 1 second |
Produce | Sum of the Broker Failures Produce from all nodes. | 1 second |
All Brokers State Metrics
Metric | Description | Granularity |
---|---|---|
Under-replicated Partitions | Sum of the Broker State Metrics Under-replicated Partitions from all nodes. | 1 second |
Offline Partitions | Sum of the Broker State Metrics Offline Partitions from all nodes. | 1 second |
Leader Elections | Sum of the Broker State Metrics Leader Elections from all nodes. | 1 second |
Unclean Leader Elections | Sum of the Broker State Metrics Unclean Leader Elections from all nodes. | 1 second |
ISR Shrinks | Sum of the Broker State Metrics ISR Shrinks from all nodes. | 1 second |
ISR Expansions | Sum of the Broker State Metrics ISR Expansions from all nodes. | 1 second |
Active controller count | Sum of the Broker State Metrics Active controller count from all nodes. | 1 second |
Average Idle Time Percentage
Metric | Description | Granularity |
---|---|---|
Network Processor | Average of the Average Idle Time Network Processor from all nodes. | 1 second |
Request Handler | Average of the Average Idle Time Request Handler from all nodes. | 1 second |
Log Flushing
Metric | Description | Granularity |
---|---|---|
Mean | Sum of the Log Flushing Mean from all nodes. | 1 second |
Flushes | Sum of the Log Flushing Flushes from all nodes. | 1 second |
Cluster Nodes
Metric | Description | Granularity |
---|---|---|
Controller | Is the node controller? Yes/No. | 1 second |
Messages In | Chart with the count of the Broker Messages In. | 1 second |
Bytes In | Chart with the count of the Broker Bytes In. | 1 second |
Bytes Out | Chart with the count of the Broker Bytes Out. | 1 second |
Average Response Time | Chart with the count of the Broker Average Response Time. | 1 second |
Health | The node health indicator. | 1 second |
Cluster Topics
Metric | Description | Granularity |
---|---|---|
Partitions | Number of partitions. | 10 minutes |
Bytes In | Chart with the count of the Topic Bytes In. | 1 second |
Bytes Out | Chart with the count of the Topic Bytes Out. | 1 second |
Bytes Rejected | Chart with the count of the Topic Bytes Rejected. | 1 second |
Messages In | Chart with the count of the Topic Messages In. | 1 second |
Consumer group lag
Metric | Description | Granularity |
---|---|---|
Lag | Consumer group lag per topic. | 60 seconds |
Consumers
Metric | Description | Granularity |
---|---|---|
Byte Rate | The number of bytes consumed sent per second. | 1 second |
Throttling | Average throttle time. | 1 second |
Latency | Average fetch latency. | 1 second |
Producers
Metric | Description | Granularity |
---|---|---|
Byte Rate | The number of outgoing bytes sent per second. | 1 second |
Throttling | Average throttle time. | 1 second |
Latency | Average request latency. | 1 second |
In order to enable the Instana agent client to query the Kafka broker for lag-related data, add the PLAINTEXT security protocol for localhost socket connections within the Kafka broker configuration file.
Health Signatures
For each sensor, there is a curated knowledgebase of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.
Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.
For information about built-events for Kafka Node and Cluster, see the Built-in events reference.
Troubleshooting
SSL not configured
Monitoring issue type: kafka_ssl_not_configured
To resolve this issue refer to the steps as described in SSL/TLS Support for how to configure Kafka SSL truststore location and password.
SSL client authentication not configured
Monitoring issue type: kafka_ssl_client_not_configured
To resolve this issue refer to the steps as described in SSL/TLS Support for how to configure Kafka SSL client authentication (keystore location and password).
JMX authentication not configured
Monitoring issue type: kafka_invalid_jmx_credentials
To resolve this issue refer to the steps as described in JMX Authentication support for how to configure JMX authentication credentials.