Share this post:
Build messaging solutions: Be proactive in planning and building
This post is a part of the multi-part blog series: Build Messaging Solutions with Apache Kafka or Event Streams for IBM Cloud
One key aspect of a robust architecture is that it is built to smoothly handle system failures, outages, and configuration changes without violating the data loss and consistency requirements of the use case. To proactively build such solutions needs an understanding of the possible exceptions and risky scenarios and preparedness to manage them efficiently. Alongside the design process in the previous blog post in this series are some areas to explore proactively and incorporate in the architecture of your Kafka solution. These are:
- Handling failures and configuration changes
- Handling data processing lag
- Handling duplicate message processing
Handling failures and configuration changes
In the design section of the previous post, it was mentioned that the Consumers must have logic for handling failures and dynamic configuration changes.
So what are these different scenarios that need to be prepared for? Well, there are several such scenarios and some may or may not be applicable to your use case and design.
This is a quick checklist of such failures and changes that the design and logic should be able to handle:
- One or more Broker in the cluster fails or restarted: This can happen due to various reasons, such as network or connectivity issues. You need to make sure you pass multiple brokers’ names for the connection.
- Failed message, write acknowledgments, commit acknowledgments, and processing of duplicates: Check out the use of acknowledgments for Producer writes and Consumer sync/async commits in order to control guarantees and reduce the risk of data loss.
- Consumer in a Consumer Group fails or Consumer is removed or started, restarted, or added for scalability: Above failures or changes can cause Consumer-Partition reassignment (rebalancing) that should be handled by the code if it impacts any logic. Leverage ConsumerRebalaceListener() in your code to capture the rebalancing event and take appropriate measures (e.g., commits) in the code before the new assignment kicks in.
- Partition leader change due to failures: There is a possibility of momentary time out and duplicate messages as the new leader node is selected from the ISRs (in-sync replicas) and uncommitted messages may be resent again. The code should handle such a wait and duplicate message scenario.
- Partition added or removed for scalability: This will result in changes to the key-Partition and Consumer-Partition reassignment. To handle these changes from code, you can leverage polling PartitionsFor() and ConsumerRebalaceListener() mentioned above.
- New Topics added/removed: If Consumers are subscribed to Topics based on regex patterns, caution should be taken when naming and creating new Topics as Consumers may auto-subscribe to the new Topic.
- Failures with partially processed and long-held buffers: If the messages are buffered for processing before doing commits in a Consumer logic, a build-up of buffer for long periods without commits poses two risks:
- Depending on the throughput and sizes, they might hit memory limits on the system, causing failures.
- If the buffered time goes beyond the retention time of offsets or messages, for any failures or restarts after the expired retention time, these messages will no longer be available and will be lost. You need to make sure such buffers are handled before reaching any size or time-based expiration.
Handling data processing lag
Another area of risk that was also mentioned before is data processing lag or backlog. A backlog can be the result of peak or busy periods where message production rate is higher than the consumption. It causes a backlog of unread and uncommitted messages. Another common reason are outages caused by planned maintenance, unplanned maintenance, or other failures on the infrastructure that result in processing being slow or completely stopped.
So, how do you deal with the backlog? Let’s use the analogy of dealing with backed-up traffic.
More lanes: One option to mitigate this risk is scaling out by adding more Consumers and more Topic partitions. Although this is the simplest approach, to make it go smoothly, the Consumers should be tolerant to the configuration change consequences mentioned before, like the changing “key-partition” or “consumer-partition stickiness.” The Consumers should be able to handle new Consumers and Partition changes gracefully without breaking consistency and other requirements.
Divert/bypass: In situations where the backlog is substantial and Consumers can’t process and won’t be able to process backlog, you can have messages diverted. In other words, the Producers are sent to a different Topic temporarily while the actual problem Topic is being worked on, and at the same time, the Consumers read and process from this temporary Topic.
Another option is to have some Consumer tooling in place to fetch and move/commit the messages (all or only the problem ones) from the backlogged Topic over to a separate temporary Topic or another persistent repository. These problem messages can then be processed separately later, but it will help in clearing (or unblocking) the backlog on the current Topics.
Stop traffic: In extreme situations, you might need a graceful process to stop accepting new messages that you won’t be able to process. You need to plan how that will be done without impacting other configurations and the processing of data again when things are back to normal. Note that not all Producers might be in your control.
Drop traffic: Although vehicles don’t become useless if they don’t reach their destination in time, messages can. If no longer needed, you can commit the messages to the offset position, where they can be dropped and to clear the backlog.
Of course, many of the options above should meet the use-case requirements, but having such tooling developed and tested beforehand comes in handy when problems arise.
Handling duplicate message processing
The “at-least once” delivery paradigm prevents loss of messages in Kafka, but the caveat is that you can expect that Consumers may sometimes receive messages more than one time (as duplicates). System behavior—component failures, broker or Consumer restarts, configuration changes, reassignment, rebalancing, lost acknowledgments, lost Consumer offset positions, etc.—can cause messages to received again by the Consumers. Manual administrative activities like add/remove Consumers or Partitions and processing/reprocessing Topic data separately to clear backlogs with or without knowing committed Consumer offset positions can also result in messages being received by the Consumers more than once. How do you prepare for handling duplicates?
Depending on your requirements, there are several strategies to handle duplicates. The most common is to maintain your own processing state of messages (e.g., “read,” “processing,” “completed,” etc.) This can be done in memory for faster access along with a separate persistent database for durability.
Ideally, the logic in Consumers should be proactively designed to have low overhead and handle duplicate messages normally and not as an exception. That helps not only with automatically and safely handling component failures or restarts, but it also goes a long way in simplifying many administrative activities that can then be done as-needed (e.g., scaling with more Consumers, adding removing Partitions, etc.) without having to be concerned too much about dealing with duplicates and reconciliation. The complexity, of course, depends on the level of tolerance of the use case and can be very simple to achieve in many cases.
Reprocessing a message the same way as the initial processing may break the integrity of data (e.g., ordering system where a duplicate message should not end up creating a new order), but in many other use cases, it might be perfectly fine to reprocess the same way.
Improvements and preparedness are continuous iterative activities. To conclude, I’ll mention again the fundamental building blocks that were laid out in the main blog as a target for a fault-tolerant production implementation:
- Zero sustained processing lag (ideally) or a lag that can be overcome without any outages (as a worst-case)
- Resiliency for dynamically handling some key configuration changes that may happen in the Event Streams/Kafka environment, either automatically or by manual initiation
- Resiliency for handling failures and fixes of the various hardware and software components that are part of the solution
- Proactive tooling to handle the known potential risks in a methodical and planned manner
Hope this blog series has been useful to highlight the areas for considering when you design and build with Event Streams for IBM Cloud/Kafka. Links to the rest of the articles in the series are include below. Feedback is welcome.
Learn more about IBM Event Streams for IBM Cloud
Full blog series