Apache Kafka stands as a widely recognized open source event store and stream processing platform. It has evolved into the de facto standard for data streaming, as over 80% of Fortune 500 companies use it. All major cloud providers provide managed data streaming services to meet this growing demand.
One key advantage of opting for managed Kafka services is the delegation of responsibility for broker and operational metrics, allowing users to focus solely on metrics specific to applications. In this article, Product Manager Uche Nwankwo provides guidance on a set of producer and consumer metrics that customers should monitor for optimal performance.
With Kafka, monitoring typically involves various metrics that are related to topics, partitions, brokers and consumer groups. Standard Kafka metrics include information on throughput, latency, replication and disk usage. Refer to the Kafka documentation and relevant monitoring tools to understand the specific metrics available for your version of Kafka and how to interpret them effectively.
Monitoring your IBM® Event Streams for IBM Cloud® instance is crucial to ensure optimal functionality and overall health of your data pipeline. Monitoring your Kafka clients helps to identify early signs of application failure, such as high resource usage and lagging consumers and bottlenecks. Identifying these warning signs early enables proactive response to potential issues that minimize downtime and prevent any disruption to business operations.
Kafka clients (producers and consumers) have their own set of metrics to monitor their performance and health. In addition, the Event Streams service supports a rich set of metrics produced by the server. For more information, see Monitoring Event Streams metrics by using IBM Cloud Monitoring.
Metric | Description |
Record-error-rate | This metric measures the average per-second number of records sent that resulted in errors. A high (or an increase in) record-error-rate might indicate a loss in data or data not being processed as expected. All these effects might compromise the integrity of the data you are processing and storing in Kafka. Monitoring this metric helps to ensure that data being sent by producers is accurately and reliably recorded in your Kafka topics. |
Request-latency-avg | This is the average latency for each produce request in ms. An increase in latency impacts performance and might signal an issue. Measuring the request-latency-avg metric can help to identify bottlenecks within your instance. For many applications, low latency is crucial to ensure a high-quality user experience and a spike in request-latency-avg might indicate that you are reaching the limits of your provisioned instance. You can fix the issue by changing your producer settings, for example, by batching or scaling your plan to optimize performance. |
Byte-rate | The average number of bytes sent per second for a topic is a measure of your throughput. If you stream data regularly, a drop in throughput can indicate an anomaly in your Kafka instance. The Event Streams Enterprise plan starts from 150MB-per-second split one-to-one between ingress and egress, and it is important to know how much of that you are consuming for effective capacity planning. Do not go above two-thirds of the maximum throughput, to account for the possible impact of operational actions, such as internal updates or failure modes (for example, the loss of an availability zone). |
Metric | Description |
Fetch-rate fetch-size-avg | The number of fetch requests per second (fetch-rate) and the average number of bytes fetched per request (fetch-size-avg) are key indicators for how well your Kafka consumers are performing. A high fetch-rate might signal inefficiency, especially over a small number of messages, as it means insufficient (possibly no) data is being received each time. The fetch-rate and fetch-size-avg are affected by three settings: fetch.min.bytes, fetch.max.bytes and fetch.max.wait.ms. Tune these settings to achieve the desired overall latency, while minimizing the number of fetch requests and potentially the load on the broker CPU. Monitoring and optimizing both metrics ensures that you are processing data efficiently for current and future workloads. |
Commit-latency-avg | This metric measures the average time between a committed record being sent and the commit response being received. Similar to the request-latency-avg as a producer metric, a stable commit-latency-avg means that your offset commits happen in a timely manner. A high-commit latency might indicate problems within the consumer that prevent it from committing offsets quickly, which directly impacts the reliability of data processing. It might lead to duplicate processing of messages if a consumer must restart and reprocess messages from a previously uncommitted offset. A high-commit latency also means spending more time in administrative operations than actual message processing. This issue might lead to backlogs of messages waiting to be processed, especially in high-volume environments. |
Bytes-consumed-rate | This is a consumer-fetch metric that measures the average number of bytes consumed per second. Similar to the byte-rate as a producer metric, this should be a stable and expected metric. A sudden change in the expected trend of the bytes-consumed-rate might represent an issue with your applications. A low rate might be a signal of efficiency in data fetches or over-provisioned resources. A higher rate might overwhelm the consumers’ processing capability and thus require scaling, creating more consumers to balance out the load or changing consumer configurations, such as fetch sizes. |
Rebalance-rate-per-hour | The number of group rebalances participated per hour. Rebalancing occurs every time there is a new consumer or when a consumer leaves the group and causes a delay in processing. This happens because partitions are reassigned making Kafka consumers less efficient if there are a lot of rebalances per hour. A higher rebalance rate per hour can be caused by misconfigurations leading to unstable consumer behavior. This rebalancing act can cause an increase in latency and might result in applications crashing. Ensure that your consumer groups are stable by tracking a low and stable rebalance-rate-per-hour. |
The metrics should cover a wide variety of applications and use cases. Event Streams on IBM Cloud provide a rich set of metrics that are documented here and will provide further useful insights depending on the domain of your application. Take the next step. Learn more about Event Streams for IBM Cloud.
You’ve now got the knowledge on essential Kafka clients to monitor. You’re invited to put these points into practice and try out the fully managed Kafka offering on IBM Cloud. For any challenges in set up, see the Getting Started Guide and FAQs.
Experience IBM API Connect with a free trial or connect with our experts to discuss your needs. Whether you're ready to optimize your API management or want to learn more, we're here to support your digital transformation.
Discover the full potential of your integration processes with AI-powered solutions. Schedule a meeting with our experts or explore our product documentation to get started.
Supercharge your business with IBM MQ's secure, high-performance messaging solutions. Start your free trial or connect with our experts to explore how IBM MQ can transform your operations.
Experience faster, more secure file transfers—any size, any distance. Try IBM Aspera today and streamline your data workflows with high-speed efficiency.
Transform your business by effortlessly connecting apps and data. Start your free trial today and see how IBM App Connect can streamline your integration journey.
Explore how IBM DataPower Gateway enhances security, control, and performance for your cloud and on-premises applications. Book a meeting now to get started with a free container evaluation!
A complete solution for modernizing integrations across hybrid environments, allowing your team to accelerate application deployment while cutting down costs and complexity.
Streamline your digital transformation with IBM’s hybrid cloud solutions, built to optimize scalability, modernization, and seamless integration across your IT infrastructure.
IBM Cloud Infrastructure Center is an OpenStack-compatible software platform for managing the infrastructure of private clouds on IBM zSystems and IBM LinuxONE.