Building messaging solutions with Apache Kafka or IBM Event Streams for IBM Cloud
This multi-part blog series is going to walk you through some of the key architectural considerations and steps for building messaging solutions with Apache Kafka or IBM Event Streams for IBM Cloud. This series will be helpful for developers, architects, and technology consultants who have a general understanding of Apache Kafka and are now looking toward getting deeper into evaluating and building messaging solutions.
Once you start evaluating or working with Apache Kafka, you soon realize that behind its simplicity are numerous configuration “knobs and levers” for controlling how messages are handled. These configuration options provide a lot of flexibility to make Kafka fit for multiple use cases, but at the same time, poor understanding and implementation of these options can increase risks of data loss and data inconsistencies. It is important that the handling of various functional aspects of the application is well understood and tested, and it is equally important that the various operational and administrative interdependencies related to the application are proactively incorporated in the architecture. In short, the sooner you understand the behavior, consequences, and interplay of the configurations, the sooner you can mitigate risks, reduce operational overhead, and build a resilient, fault-tolerant solution for production.
Goals for this blog series
With the above in mind, the goals for this blog series are as follows:
Provide the fundamental building blocks: Architectural considerations and best practices for a production-grade, robust solution with Kafka.
Provide an approach: Iterative steps on how to start building a solution—gathering information, evaluating and designing, and tips for proactively managing and mitigating risks.
Learning and target audience
Like any technology, the concepts related to Kafka are wide and deep. There is plenty of good material already on the internet that can provide an overview of Kafka’s internal architecture and concepts. If you are new to Kafka, the best way to get started and familiarize yourself with the core concepts is by leveraging the official Kafka content.
Once you have a general understanding of Kafka and are looking toward getting deeper into evaluating and building solutions, this blog can serve as a good starting reference template. Using examples and an iterative approach, the blog walks through the requirements-gathering and design process. This can help developers, architects, and consultants get a structured start for designing a solution for their use case. My comments may not be exhaustive or applicable to all use cases and situations, but I have attempted to mention related options and scenarios where possible. Any feedback is welcome.
A (very) brief introduction to Apache Kafka and Event Streams for IBM Cloud
At a high level, the main components in any Kafka solution include Producers and Consumers that interact with Kafka. Producers are part of your own code leveraging Kafka libraries. They create and write messages to Kafka Topics.
Message records get written to a Topic Partition in the order they are received and automatically get a position number or offset assigned. Consumers are part of your client code, leveraging Kafka libraries to read and process messages from the Topic.
While Apache Kafka is a great platform, it is also a distributed platform. That means you need to routinely manage the infrastructure of all the distributed components—servers, replicas, and their configurations—in addition to the application logic in your Kafka Producer/Consumer code components.
Fortunately, Event Streams—available on IBM Cloud—takes many of the complexities away from the end user by providing Apache Kafka as a Service. With less to worry about in managing Apache Kafka cluster and its operations, Event Streams for IBM Cloud enables the end-user developers and architects to directly focus on the value from Kafka and design resilient and fault-tolerant solutions on IBM Cloud. It’s quick and easy to set up your own Kafka service and get some hands-on experience. Check out the getting started page for Event Streams for IBM Cloud.
Let’s get to the main goals of this blog post.
The fundamental building blocks
The key to a robust application architecture with Kafka/Event Streams for IBM Cloud is to efficiently handle the flow of messages across the components while serving the desired processing functions at acceptable tolerance levels of data loss and consistency.
To achieve the above, the architecture must be built on these fundamental building blocks:
Zero sustained processing lag* (ideally) or a lag that can be overcome without any outages (as a worst-case)
Resiliency for dynamically handling some key configuration changes that may happen in the Event Streams/Kafka environment, either automatically or by manual initiation
Resiliency for handling failures and fixes of the various hardware and software components that are part of the solution
Proactive tooling to handle the known potential risks in a methodical and planned manner
*Note: The processing lag mentioned above means the delays resulting from the difference in the rate of producing vs. final consumption of messages.
While this might appear to be a lot to build on, remember our target is a resilient production solution. Almost always, such solutions are built and improved iteratively. With some guidance and structure, it’s not too difficult to get started and achieve the above target.
The main questions are:
What do the above fundamental blocks mean from a design and implementation perspective?
How do you start designing and building a solution that incorporates them in the architecture?
The remainder of the blog attempts to address the above questions through an iterative design approach.
The iterative approach
A set of iterative steps to help you in the evaluation and design process are outlined below. Each step is then described in detail in separate blogs that will follow in this series.
Quantify use case and requirements for evaluation.
Evaluate by attempting a preliminary design, checking constraints and capabilities.
Be Proactive for planning and building for fault tolerance.
The iterative process
1. Quantify use case and requirements for evaluation
The first step is all about knowing the requirements of the use case around some key areas and quantifying these requirements as much as possible. Several categories are outlined below for which details and numbers should be gathered. If you don’t have everything, it’s okay—that’s why this is iterative.
Part of the goal is for you to start looking at the use case requirements through these categories as they have a direct bearing on the evaluation—sizing, configuration, and architecture. These categories are as follows:
Content (types and sizes of messages)
Throughput (expected rate of message generation, ingestion, consumption)
Processing and latencies (type of processing, async batch/micro-batch, streaming)
Data durability and consistency guarantees (tolerance for data loss, delays)
Data retention (persistence duration, retention of data)
Other (security, development, etc.)
2. Evaluate by attempting a preliminary design, checking constraints and capabilities
Once you have gathered some initial requirements, you need to start building a preliminary design and evaluating the viability of your solution. Working through the questions below will help you get started and oriented with the important considerations about the constraints and capabilities that should be reviewed in the design process.
Where do I begin the design process?
How and where will I write messages, fetch them, and process them with scale?
How do I handle component failures, outages, and configuration changes?
What are the main capacity, sizing, or configurations limits I should be aware of?
How do I check and maintain healthy environment working as designed?
3. Be proactive for planning and building for fault tolerance
One key aspect of a robust architecture is that it is built to smoothly handle system failures, outages, and configuration changes without violating the data loss and consistency requirements. To proactively build such solutions requires an understanding of the possible exceptions and risky scenarios and a preparedness to manage them. Below are some areas to explore proactively and incorporate in the architecture.
Handling failures and configuration changes
Handling data processing lag
Handling duplicate message processing
In the next part of this blog series, we’ll take a deeper and more thorough look at Step 1: Quantify.
Full blog series