Build messaging solutions: The iterative approach
This blog is a part of the multi-part blog series: Build Messaging Solutions with Apache Kafka or Event Streams for IBM Cloud
As part of the iterative approach described in the main introduction blog of this series, the first step is to building messaging solutions is to identify the use case requirements and quantify these requirements as much as possible in terms of Apache Kafka and Event Streams. Several categories are outlined below that provide structure to what needs to be gathered.
The goal is to start looking at your use case requirements through these categories because they have a direct bearing on the evaluation—sizing, configuration, and architecture. If you don’t have everything, it’s okay—that’s why this is an iterative process.
Quantify: The type of information needed to start a preliminary design
Content (types and sizes of messages)
Throughput (expected rate of message generation, ingestion, and consumption)
Processing and latencies (type of processing, async batch/micro-batch, streaming)
Data durability and consistency guarantees (tolerance for data loss, delays)
Data retention (persistence duration, retention of data)
Other—development, security, etc.
To make it easier to follow, we will use an example to demonstrate the process of requirements gathering, sizing calculations, and design. Let’s assume an IoT use case of smart health-bands worn by athletes to monitor their health data. We will build a solution that receives messages from the health-band devices, processes them as per the needed requirements, and persists the information.
Content (types and sizes of messages)
Let’s assume that an active smart health-band sends “health-beats”—messages that include geolocation, basic health, and diagnostics information (heart rate, metabolism level, body temperature, etc.). Based on the information received, any special activity event or emergency needs to be detected by the application and flagged for notification. The messages received should be persisted for use by other systems.The content will be best organized and sent as a JSON document that has attributes and values. The main identifiers that can be leveraged in the messages are a unique health-band serial number, model number, manufacturer batch number, etc.
--Understand the Content--
Health-beat messages: Basic health and diagnostics information
Context or use of the messages:
1. Get geolocation and health status of users (90% of the messages)
2. Trigger actions/notifications if a certain pattern of health or activity is detected from the messages (10% of the messages)
3. Persist the messages for use by other systems (e.g., analytics, billing, etc.)
Available identifiers: device serial number, model number
Apache Kafka throughput (expected rate of message generation, ingestio, and consumption)
With IoT features, let’s again assume that after accounting for some yearly sales and growth, the manufacturer expects 100,000 smart health-band units sold, connected, and sending messages. We use that as the number for our design and capacity calculations.
Next, we gather information about the size of the data sent and how frequently each health-band transmits. We assume health-beats are sent every 10 minutes, with an average message size 3KB. We can use these to do some throughput calculations about the rate at which the messages will be produced and received by Apache Kafka.
--Calculate the Throughput--
Total devices: 100,000 smart health-bands
Health-beat messages sent every 10 minutes by each health-band: 1/(10*60) messages per second
Messages/sec: 100000*(1/(10*60)) = ~167 health-beat messages/sec
KBytes/sec: 3KB*167 = 501 KB/sec
Based on its content, we need to understand how the data will be handled and processed for the required data durability, retention, and consistency guarantees.
For the next three categories mentioned below, the requirements are tied together closely. We will review them together.
Processing and Latencies (type of processing, async batch/micro-batch, streaming)
Data Durability and Consistency Guarantees (tolerance for data loss, delays)
Data Retention (persistence duration, retention of data)
For our IoT example, let us assume these processing requirements: the health-beat data can be processed asynchronously but must be done almost as soon as it is received. It may be batched for processing for, at most, up to an hour (up to 6 messages as messages are sent every 10 minutes per health-band). It is not acceptable to lose any messages/data. The messages/data should also be available for up to one year for use by other analytics systems.
For a preliminary design, we mainly need to quantify how long each message processing will take to meet the functional requirements (and also include in the latencies for achieving the required guarantees). Getting deeper, we have the two functional requirements from the messages as 1) health check and 2) triggering actions for any health anomalies. The data guarantees needed are are zero data loss, consistency, and persistence. To achieve all this, the processing would require a combination of logic for conditional checks, data pattern checks, buffering, persisting and committing small batches, and (in about 10% of the messages) triggering other APIs for health diagnostics and then confirming the success of the action.
We have these approximate processing latency numbers per message for our IoT example:
--Understand the Processing and Latencies--
Latency per message for functional and data guarantees:
- Logic for conditional checks = 5ms
- Pattern checks = 10ms
- Buffering, persisting, and committing small batches = 10ms
(100ms total for 10 buffered messages 100ms/10 = 10ms)
- Triggering other APIs and confirming success of the action= 20ms
(200ms for 10% of the massages; 200ms*10/100=20ms)
Average processing latency per message: 5+10+10+20 = 45ms
As you can see, the above quantification can get complex; hopefully, it gives you an idea of what is needed. While it is good to get into details, as the preliminary design attempt, the focus can be only major latency contributors. Typically these are:
Processing any buffered messages.
Consistency related checks or updates by reading/writing from another system.
Time to write/persist the messages.
The above numbers will come in handy in the sizing and evaluation that can be done iteratively (as in the next blog), but you will almost always need something to start with. The numbers should be quantified (at least approximately) through calculations or actual testing and noted as key variables for later iterations.
An often-related part of data guarantee is for disaster recovery (DR). In addition to the high-availability guarantees from a single Kafka cluster (usually same data center), if cross-data center data redundancy is required, mirror-maker can be used for “mirroring” messages across Kafka clusters in another data center to achieve DR type redundancy in your design.
Other—development, security, etc.
The Producers and Consumers are basically client code interacting with Kafka. These can be part of existing code or separate processes developed by leveraging many of the available open source Apache Kafka client libraries. For Security, Apache Kafka supports TLS/SSL for on-the-wire encryption and SASL Authentication. It is important, however, to verify the specific versions and mechanisms supported by the client libraries that you use.
Once you have gathered and quantified some of the initial requirements from the above categories, you can attempt a preliminary design to evaluate the viability of your solution. This is described in the next blog—Part 2: Evaluate.
Full blog series