Using Availability Zones to Enhance Event Streams Resilience

By: Adrian Preston

Deploying Kafka across availability zones with Event Streams

IBM Event Streams offers a managed platform that is built using Apache Kafka. It is available on-premise, to run on your own servers, or as a hosted service in IBM Cloud. This post applies to the Enterprise plan for the hosted version of Event Streams, and it describes how we deploy Kafka across availability zones to maximize both its resilience to failures and the durability of your message data. We also cover how applications can use Kafka to achieve the right balance of availability and durability to meet your business needs.

Availability zones

A bit of slang used when talking about high availability is blast radius. It’s a colorful way to ask: “If component X goes wrong, how many other components will be affected?” We’re always looking for ways to reduce the blast radius of a failure, and availability zones are one of the ways we can do this.

An availability zone is a set of isolated infrastructure (e.g., power, cooling, and network connectivity) that is designed to limit the blast radius of an infrastructure problem like a power outage.

IBM Cloud supports the concept of availability zones via multi-zone regions. These are geographic regions (e.g., US-South, US-East, Central Europe, etc.) where there is the redundant infrastructure to support multiple availability zones.

The Event Streams Enterprise plan is designed to take full advantage of multi-zone regions to maximize the availability of the service while also ensuring the durability with which message data is stored. In each geographic region, when you create an instance of the Event Streams service, we deploy Kafka brokers (and other key management microservices) into each of the three availability zones. This configuration means that message data sent to Kafka has a replica stored in each of the three availability zones. Once message data is replicated into three availability zones, it becomes incredibly resilient to infrastructure failures, as three isolated sets of infrastructure need to fail before there is a risk of data loss.

Durability of message data

Apache Kafka offers a range of configuration options that control the durability of message data. Often, there is a trade-off between how willing Kafka is to accept your message data and how much risk there is that some of the message data might be lost. When you use the Kafka API, one of the key configuration options that can be used to control this trade-off is the ‘acks’ property of the Kafka producer. This property controls when the client considers Kafka to have successfully stored a message and acknowledged (hence ‘acks’) its receipt. The ‘acks’ property can be set to one of the following values:

  • acks=0 

    • The client behaves as if each message is acknowledged as soon as it is sent.

    • There is no assurance that messages sent like this even arrives at Kafka.

    • With this configuration, a Kafka application will always be able to produce messages.

    • The trade-off is that this setting provides the least durable storage of message data, as a brief interruption to the network between the client and Kafka is enough to cause data to be lost.

  • acks=1 

    • Kafka acknowledges receipt of each message at the point it is stored at one broker and before it is replicated to any other brokers.

    • The client can produce messages if there is as few as one broker available to store the message data. In this case, there is a risk that if a broker or the infrastructure hosting a broker fails, then the sent message data can be lost.

    • This is the default setting.

  • acks=all

    • Copies of the message need to be stored at a minimum number of brokers (with Event Streams Enterprise plan the minimum number is two) before receipt of the message is acknowledged to the client.

    • This offers the best level of durability because there is a very low risk that the message data might be lost.

    • The trade-off is that the attempt to produce the message will fail if there are fewer than two brokers available where copies of the message data can be stored.

    • This is the recommended setting.

Why did we set the minimum number of brokers to two when using acks=all?

Occasionally, we need to restart a broker to apply maintenance. A restarted broker may take a short amount of time to re-synchronize itself with messages sent while it was offline, which means it is unavailable for new message storing. So, by setting the number of brokers to two, we are still able to allow messages to be produced with acks=all whilst we upgrade the cluster.

We recommend that you set ‘acks=all’ in your producer configuration because this offers the best level of durability.

To illustrate this, let’s consider the kinds of failures that messages produced using ‘acks=all’ are shielded from:

  • From the producer perspective: If we lose an entire availability zone (in our case, a broker), the producer can still send messages for durable storage to the two remaining brokers in the existing AZs.

  • From the message data perspective: Once the message has been stored, it will be stored durably if there is at least one Kafka broker with a copy of the message data. This means that two entire availability zones can fail without impacting the durability of message data stored by Event Streams.

Conclusions

Now you know all about how Event Streams makes use of availability zones, both to maximize the availability of the service and also to offer very durable storage of data.

It’s important to note that the ‘acks’ configuration of the Kafka producer is something that you’ll need to configure when you connect your application to Event Streams.

Find more information about Event Streams for IBM Cloud

Be the first to hear about news, product updates, and innovation from IBM Cloud