Getting started with Kafka client metrics

Apache Kafka stands as a widely recognized open source event store and stream processing platform. It has evolved into the de facto standard for data streaming, as over 80% of Fortune 500 companies use it. All major cloud providers provide managed data streaming services to meet this growing demand.

One key advantage of opting for managed Kafka services is the delegation of responsibility for broker and operational metrics, allowing users to focus solely on metrics specific to applications. In this article, Product Manager Uche Nwankwo provides guidance on a set of producer and consumer metrics that customers should monitor for optimal performance.

With Kafka, monitoring typically involves various metrics that are related to topics, partitions, brokers and consumer groups. Standard Kafka metrics include information on throughput, latency, replication and disk usage. Refer to the Kafka documentation and relevant monitoring tools to understand the specific metrics available for your version of Kafka and how to interpret them effectively.

Why is it important to monitor Kafka clients?

Monitoring your IBM® Event Streams for IBM Cloud® instance is crucial to ensure optimal functionality and overall health of your data pipeline. Monitoring your Kafka clients helps to identify early signs of application failure, such as high resource usage and lagging consumers and bottlenecks. Identifying these warning signs early enables proactive response to potential issues that minimize downtime and prevent any disruption to business operations.

Kafka clients (producers and consumers) have their own set of metrics to monitor their performance and health. In addition, the Event Streams service supports a rich set of metrics produced by the server. For more information, see Monitoring Event Streams metrics by using IBM Cloud Monitoring.

Client metrics to monitor

Producer metrics

Metric	Description
Record-error-rate	This metric measures the average per-second number of records sent that resulted in errors. A high (or an increase in) record-error-rate might indicate a loss in data or data not being processed as expected. All these effects might compromise the integrity of the data you are processing and storing in Kafka. Monitoring this metric helps to ensure that data being sent by producers is accurately and reliably recorded in your Kafka topics.
Request-latency-avg	This is the average latency for each produce request in ms. An increase in latency impacts performance and might signal an issue. Measuring the request-latency-avg metric can help to identify bottlenecks within your instance. For many applications, low latency is crucial to ensure a high-quality user experience and a spike in request-latency-avg might indicate that you are reaching the limits of your provisioned instance. You can fix the issue by changing your producer settings, for example, by batching or scaling your plan to optimize performance.
Byte-rate	The average number of bytes sent per second for a topic is a measure of your throughput. If you stream data regularly, a drop in throughput can indicate an anomaly in your Kafka instance. The Event Streams Enterprise plan starts from 150MB-per-second split one-to-one between ingress and egress, and it is important to know how much of that you are consuming for effective capacity planning. Do not go above two-thirds of the maximum throughput, to account for the possible impact of operational actions, such as internal updates or failure modes (for example, the loss of an availability zone).

Scroll to view full table

Table 1. Producer metrics

Consumer metrics

Metric	Description
Fetch-rate fetch-size-avg	The number of fetch requests per second (fetch-rate) and the average number of bytes fetched per request (fetch-size-avg) are key indicators for how well your Kafka consumers are performing. A high fetch-rate might signal inefficiency, especially over a small number of messages, as it means insufficient (possibly no) data is being received each time. The fetch-rate and fetch-size-avg are affected by three settings: fetch.min.bytes, fetch.max.bytes and fetch.max.wait.ms. Tune these settings to achieve the desired overall latency, while minimizing the number of fetch requests and potentially the load on the broker CPU. Monitoring and optimizing both metrics ensures that you are processing data efficiently for current and future workloads.
Commit-latency-avg	This metric measures the average time between a committed record being sent and the commit response being received. Similar to the request-latency-avg as a producer metric, a stable commit-latency-avg means that your offset commits happen in a timely manner. A high-commit latency might indicate problems within the consumer that prevent it from committing offsets quickly, which directly impacts the reliability of data processing. It might lead to duplicate processing of messages if a consumer must restart and reprocess messages from a previously uncommitted offset. A high-commit latency also means spending more time in administrative operations than actual message processing. This issue might lead to backlogs of messages waiting to be processed, especially in high-volume environments.
Bytes-consumed-rate	This is a consumer-fetch metric that measures the average number of bytes consumed per second. Similar to the byte-rate as a producer metric, this should be a stable and expected metric. A sudden change in the expected trend of the bytes-consumed-rate might represent an issue with your applications. A low rate might be a signal of efficiency in data fetches or over-provisioned resources. A higher rate might overwhelm the consumers’ processing capability and thus require scaling, creating more consumers to balance out the load or changing consumer configurations, such as fetch sizes.
Rebalance-rate-per-hour	The number of group rebalances participated per hour. Rebalancing occurs every time there is a new consumer or when a consumer leaves the group and causes a delay in processing. This happens because partitions are reassigned making Kafka consumers less efficient if there are a lot of rebalances per hour. A higher rebalance rate per hour can be caused by misconfigurations leading to unstable consumer behavior. This rebalancing act can cause an increase in latency and might result in applications crashing. Ensure that your consumer groups are stable by tracking a low and stable rebalance-rate-per-hour.

Scroll to view full table

Table 2. Consumer metrics

The metrics should cover a wide variety of applications and use cases. Event Streams on IBM Cloud provide a rich set of metrics that are documented here and will provide further useful insights depending on the domain of your application. Take the next step. Learn more about Event Streams for IBM Cloud.

What’s next?

You’ve now got the knowledge on essential Kafka clients to monitor. You’re invited to put these points into practice and try out the fully managed Kafka offering on IBM Cloud. For any challenges in set up, see the Getting Started Guide and FAQs.

Author

Uche Nwankwo

Product Manager, Event Streams on IBM Cloud

Learn more about Kafka and its use cases

Provision an instance of Event Streams on IBM Cloud