March 14, 2024 By Uche Nwankwo 4 min read

Apache Kafka stands as a widely recognized open source event store and stream processing platform. It has evolved into the de facto standard for data streaming, as over 80% of Fortune 500 companies use it. All major cloud providers provide managed data streaming services to meet this growing demand.

One key advantage of opting for managed Kafka services is the delegation of responsibility for broker and operational metrics, allowing users to focus solely on metrics specific to applications. In this article, Product Manager Uche Nwankwo provides guidance on a set of producer and consumer metrics that customers should monitor for optimal performance.

With Kafka, monitoring typically involves various metrics that are related to topics, partitions, brokers and consumer groups. Standard Kafka metrics include information on throughput, latency, replication and disk usage. Refer to the Kafka documentation and relevant monitoring tools to understand the specific metrics available for your version of Kafka and how to interpret them effectively.

Why is it important to monitor Kafka clients?

Monitoring your IBM® Event Streams for IBM Cloud® instance is crucial to ensure optimal functionality and overall health of your data pipeline. Monitoring your Kafka clients helps to identify early signs of application failure, such as high resource usage and lagging consumers and bottlenecks. Identifying these warning signs early enables proactive response to potential issues that minimize downtime and prevent any disruption to business operations.

Kafka clients (producers and consumers) have their own set of metrics to monitor their performance and health. In addition, the Event Streams service supports a rich set of metrics produced by the server. For more information, see Monitoring Event Streams metrics by using IBM Cloud Monitoring.

Client metrics to monitor

Producer metrics

MetricDescription
Record-error-rateThis metric measures the average per-second number of records sent that resulted in errors. A high (or an increase in) record-error-rate might indicate a loss in data or data not being processed as expected. All these effects might compromise the integrity of the data you are processing and storing in Kafka. Monitoring this metric helps to ensure that data being sent by producers is accurately and reliably recorded in your Kafka topics.
Request-latency-avgThis is the average latency for each produce request in ms. An increase in latency impacts performance and might signal an issue. Measuring the request-latency-avg metric can help to identify bottlenecks within your instance. For many applications, low latency is crucial to ensure a high-quality user experience and a spike in request-latency-avg might indicate that you are reaching the limits of your provisioned instance. You can fix the issue by changing your producer settings, for example, by batching or scaling your plan to optimize performance.
Byte-rateThe average number of bytes sent per second for a topic is a measure of your throughput. If you stream data regularly, a drop in throughput can indicate an anomaly in your Kafka instance. The Event Streams Enterprise plan starts from 150MB-per-second split one-to-one between ingress and egress, and it is important to know how much of that you are consuming for effective capacity planning. Do not go above two-thirds of the maximum throughput, to account for the possible impact of operational actions, such as internal updates or failure modes (for example, the loss of an availability zone).
Scroll to view full table
Table 1. Producer metrics

Consumer metrics

MetricDescription
Fetch-rate
fetch-size-avg
The number of fetch requests per second (fetch-rate) and the average number of bytes fetched per request (fetch-size-avg) are key indicators for how well your Kafka consumers are performing. A high fetch-rate might signal inefficiency, especially over a small number of messages, as it means insufficient (possibly no) data is being received each time. The fetch-rate and fetch-size-avg are affected by three settings: fetch.min.bytes, fetch.max.bytes and fetch.max.wait.ms. Tune these settings to achieve the desired overall latency, while minimizing the number of fetch requests and potentially the load on the broker CPU. Monitoring and optimizing both metrics ensures that you are processing data efficiently for current and future workloads.
Commit-latency-avgThis metric measures the average time between a committed record being sent and the commit response being received. Similar to the request-latency-avg as a producer metric, a stable commit-latency-avg means that your offset commits happen in a timely manner. A high-commit latency might indicate problems within the consumer that prevent it from committing offsets quickly, which directly impacts the reliability of data processing. It might lead to duplicate processing of messages if a consumer must restart and reprocess messages from a previously uncommitted offset. A high-commit latency also means spending more time in administrative operations than actual message processing. This issue might lead to backlogs of messages waiting to be processed, especially in high-volume environments.
Bytes-consumed-rateThis is a consumer-fetch metric that measures the average number of bytes consumed per second. Similar to the byte-rate as a producer metric, this should be a stable and expected metric. A sudden change in the expected trend of the bytes-consumed-rate might represent an issue with your applications. A low rate might be a signal of efficiency in data fetches or over-provisioned resources. A higher rate might overwhelm the consumers’ processing capability and thus require scaling, creating more consumers to balance out the load or changing consumer configurations, such as fetch sizes.
Rebalance-rate-per-hourThe number of group rebalances participated per hour. Rebalancing occurs every time there is a new consumer or when a consumer leaves the group and causes a delay in processing. This happens because partitions are reassigned making Kafka consumers less efficient if there are a lot of rebalances per hour. A higher rebalance rate per hour can be caused by misconfigurations leading to unstable consumer behavior. This rebalancing act can cause an increase in latency and might result in applications crashing. Ensure that your consumer groups are stable by tracking a low and stable rebalance-rate-per-hour.
Scroll to view full table
Table 2. Consumer metrics

The metrics should cover a wide variety of applications and use cases. Event Streams on IBM Cloud provide a rich set of metrics that are documented here and will provide further useful insights depending on the domain of your application. Take the next step. Learn more about Event Streams for IBM Cloud. 

What’s next?

You’ve now got the knowledge on essential Kafka clients to monitor. You’re invited to put these points into practice and try out the fully managed Kafka offering on IBM Cloud. For any challenges in set up, see the Getting Started Guide and FAQs.

Learn more about Kafka and its use cases Provision an instance of Event Streams on IBM Cloud
Was this article helpful?
YesNo

More from Cloud

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

The advantages and disadvantages of private cloud 

6 min read - The popularity of private cloud is growing, primarily driven by the need for greater data security. Across industries like education, retail and government, organizations are choosing private cloud settings to conduct business use cases involving workloads with sensitive information and to comply with data privacy and compliance needs. In a report from Technavio (link resides outside ibm.com), the private cloud services market size is estimated to grow at a CAGR of 26.71% between 2023 and 2028, and it is forecast to increase by…

Optimize observability with IBM Cloud Logs to help improve infrastructure and app performance

5 min read - There is a dilemma facing infrastructure and app performance—as workloads generate an expanding amount of observability data, it puts increased pressure on collection tool abilities to process it all. The resulting data stress becomes expensive to manage and makes it harder to obtain actionable insights from the data itself, making it harder to have fast, effective, and cost-efficient performance management. A recent IDC study found that 57% of large enterprises are either collecting too much or too little observability data.…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters