Stretching a Red Hat OpenShift cluster across availability zones

Edit online

The basic set up to increase the resilience of a Red Hat OpenShift cluster is to stretch its nodes across multiple hardware units (or availability zones).

Due to the virtualization of IBM Z, high availability can be easily achieved even within a single physical machine by taking advantage of the concept of an LPAR, which provides highly secure hardware isolation. Given that LPAR virtualization is EAL5+ certified, each single LPAR can be considered as a logical hardware unit of its own. Deploying a single Red Hat OpenShift cluster with IBM Fusion Data Foundation across multiple LPARs allows you to achieve a high available distributed setup by spreading across largely independent logical units. By taking advantage of the redundancy built into IBM Z hardware, the deployment on multiple LPARs and hardware units (CECs) is the simplest setup for high availability.

The content of this image is explained in the surrounding text. — Figure 1. Single cluster setup for HA within a single data center

When disaster recovery is needed in addition to high availability, a second site needs to be established and the nodes of the cluster are spread across the sites.

There is flexibility on how the nodes of a single cluster can be spread across LPARs, hardware units, and data centers. For IBM Fusion Data Foundation, there is no data replication necessary. Instead, the data in all storage nodes is mirrored to be kept instantly in sync. The overarching control plane of the Red Hat OpenShift cluster schedules the pods and their applications and PVs across all nodes.

Nevertheless, there is one strict requirement that is imposed by OpenShift Container Platform: a network latency smaller than 10 ms round-trip time must be ensured at all times. This is necessary to keep a quorum of 'etcd' instances and avoid unnecessary 'etcd' leader changes. The 'etcd' instances manage a database of the control plane nodes in sync. The 'etcd' database is the key-value store for OpenShift Container Platform, which persists the state of all resource objects.

If the latency is within the defined limit of 10 ms for a roundtrip, a stretched cluster can even span multiple data centers to cover aspects of disaster recovery. The flexibility of distributing the nodes can, for example, be used to spread a single cluster of OpenShift Container Platform and IBM Fusion Data Foundation across typically 3 data centers, each representing an availability zone. In case of an outage of one availability zone, the other two availability zones continue to operate without any downtime or recovery time at all. There is no recovery point nor data loss. This makes it possible to achieve a Recovery Time Objective (RTO) of zero, as well as a Recovery Point Objective (RPO) of zero.

Note:

This solution for high availability can be applied if you can guarantee the low latency of the network between the data centers / availability zones. Each site / availability zone must be connected by low latency networks (latency < 5 ms for a full round-trip).
Despite the resilience features by this high available setup, it still does not cover larger disaster scenarios, where entire data centers fail. For disaster recovery, a regional or metro DR topology is required as described in the following sections.

For normal operations, IBM Fusion Data Foundation and OpenShift Container Platform require a minimum of 3 storage nodes with attached storage but can still run healthy as long as 2 storage nodes and 2 control planes are operational.

In failure scenarios, storage nodes can become unavailable to the Red Hat OpenShift cluster. This can be caused, for example, by:

Unavailability of the LPAR or physical hardware in which a control node is installed.
Failure of a node (for example storage, compute, or control node).
Network failures interrupting the interaction between the nodes within the cluster.
Failure of the physical storage or logical OSDs connected to the storage nodes.

The recommended best practices on a 3 LPAR cluster with 3 control nodes and 3 compute nodes includes the following constraints:

To run a fully functional Red Hat OpenShift cluster, which includes IBM Fusion Data Foundation, the following requirements need to be met:
- Control plane: The number of 3 control plane nodes must always be active.
- Compute nodes: Ensure that you have enough compute nodes to handle failures due to excessive load and lost infrastructure. In the case of deploying IBM Fusion Data Foundation a minimum of 3 compute nodes is required.
- Storage nodes: A minimum of 3 storage nodes must always be active.
A single Red Hat OpenShift cluster must ensure consistency even across sites:
- For Red Hat OpenShift Container Platform, this includes the etcd database, which stores all cluster metadata. The Red Hat OpenShift control plane takes care of keeping all copies of the etcd database on all control nodes within a cluster in sync.
- To ensure consistency of 'etcd' within a cluster the network latency must be less than 10 ms round-trip time.
Storage must be replicated across a multi-site deployment:
- IBM Fusion Data Foundation ensures that the persistent volumes on all storage nodes are kept in sync (provided by Ceph technology)
- For all other storage, which is not managed by IBM Fusion Data Foundation, the environment must ensure replication by itself.

If there is failure, a single Red Hat OpenShift cluster can deal with failing nodes in a resilient way:

If 1 OSD on one of the compute nodes is down, the cluster should continue its operation normally and the application access (read and write) should have no impact.
If 2 out of 3 OSDs are down, you have no access to data (either read or write) to preserve or guarantee data integrity. Once the cluster rebalances data to other OSDs and you have 2 copies of the object again, you get back read/write access.
If 1 compute node is down, the cluster should continue its operation normally and the application access (read and write) should have no impact.
If 2 out of 3 compute nodes are down, the cluster freezes scalability on all pods of the remaining node. But the pods on that node continue to run in healthy mode. Nevertheless, the applications have no access to persistent volumes until the failing compute nodes are restarted or replaced and the IBM Fusion Data Foundation storage nodes become available again. Once the failing compute nodes are restored (either by restart or replacement), the system is fully functional again. Ceph ensures that the storage volumes are in sync at any time.
If there is network failure on 1 compute node, the cluster should continue its operation normally and the application access (read and write) should have no impact.
If there is network failure on 2 compute nodes, the cluster freezes scalability on all pods of the remaining node until the failing compute nodes are restarted or replaced. Again, the applications have no access to persistent volumes until the IBM Fusion Data Foundation storage nodes become available again.
If 1 LPAR (control and compute) is down, the cluster should continue its operation normally and the application access (read and write) should have no impact.
If there is a node replacement due to 1 compute node failure, the cluster should continue its operation normally, the application access (read and write) should have no impact and the data recovery should happen automatically on the new node.
In addition to the availability of the compute/storage node, OpenShift Container Platform requires 3 control plane nodes to be active. If 2 control nodes are down, the cluster is in an error state and only allows read operations. Once failing control plane nodes are restored (either by restart or reinstallation), the system is fully functional again.

Table 1. Table 1. Failure scenarios for IBM Fusion Data Foundation
Resource	3 instances	2 instances	1 instance	ODF Data recovery
OSDs (on one storage node)	normal operation	normal operation	No access to application data	self-recovery
Storage / compute node	normal operation	normal operation	outage	self-recovery
network interface	normal operation	normal operation	outage	self-recovery
LPAR	normal operation	normal operation	outage	self-recovery