Regional disaster recovery

Edit online

The regional disaster recovery topology creates an active/passive setup to cover regional data center failures. Such a deployment becomes necessary when the latency requirements for a single cluster cannot be achieved, due to the larger geographical distance between the data centers.

An active/passive setup is based on two independent and self-contained clusters of OpenShift Container Platform and IBM Fusion Data Foundation. Each cluster running in a separate data center. One cluster is active. The other one is passive as a stand-by if there is an outage.

The persistent volumes of IBM Fusion Data Foundation are being replicated from the active to the passive site. This happens asynchronously. Red Hat OpenShift Data Foundation (Ceph) takes care of the replication implementation on a persistent volume level. The replication interval can be chosen as short as a minute.

This setup is targeted for data centers that are located at a larger geographical distance apart from each other (regional data centers). The network latency is too significant to allow a reliable synchronous replication, nor does it allow a stretched cluster.

To orchestrate the two independent clusters, an extra component is required: Red Hat Advanced Cluster Manager (RHACM). RHACM is installed as the hub cluster and defines in its policies to which of the managed cluster the traffic needs to be routed.

If there is an outage in the active site, an administrator can trigger a failover to the stand-by site. Because all namespaces of the active site are synchronized to the stand-by site, it can be ensured that the applications can continue to run. As the failover procedure is automated, the Recovery Time Objective (RTO) is minimized and the transition, if there is an outage, is smooth. It is possible to use the second site as "hot" stand-by with all applications up and running and ready to pick up the load if there is a disaster. As an alternative approach, the second site can decommission low-priority applications and start them only in case the stand-by site is being activated ("cold" stand-by).

On the stand-by site, the replicated data set is used to continue the operations. As the replication between the persistent volumes on active and stand-by site can be configured with a high frequency, a solid value for the Recovery Point Object (RPO) can be achieved. Typically, the RPO is in the range of minutes, which implies that there might be a small amount of data loss to be considered.

For details and instructions how to setup this topology, refer to the Regional-DR solution for OpenShift Data Foundation (Red Hat Documentation).

Note:

This setup takes advantage of Red Hat OpenShift Application Data Protection API (OADP), which is available as a community operator for IBM Z.
Regional DR is currently only supported for Block storage (CephRBD). Regional DR can be configured only in a fresh deployment of Red Hat OpenShift Data Foundation. For IBM zCX this setup is not supported.

There are several basic use cases for operating a regional DR environment. The following list describes the basic administrative procedures when dealing with regional DR and the involved Red Hat OpenShift clusters managed by Advanced Cluster Manager.

Normal operations: Deployment of 2 managed clusters with OpenShift Container Platform and IBM Fusion Data Foundation plus an additional hub cluster with Advanced Cluster Manager. Both managed clusters run in normal operations. One cluster is active and the other one is in passive mode.
Application resilience within a single managed cluster (high-availability): As each managed cluster runs Red Hat OpenShift, all resilience mechanisms within the cluster are available. This includes the high availability capabilities like restart failing storage nodes, compute nodes, and control planes. As long as 2 storage nodes and 2 control planes are running, the cluster is fully operational and there is no need for fail over to another data center.
Switch active managed cluster: Advanced cluster Manager allows administrators to select the active cluster among its managed clusters back and forth as needed. This is a manual task, which the administrator can perform. Applications can be relocated to the active cluster as well. The data, which is being used by the fail-over cluster (former passive cluster, which now becomes active) had been replicated by Ceph before. Applying DR placement control.
Create a backup of the artifacts: At any time or based on a defined schedule, the artifacts of an application namespace within a managed cluster can be backed up and stored as an archive file in a safe place. Typically, all artifacts (metadata, images, PV data) are stored in object storage that is offered by a cloud provider and can be used later if a restore of the applications in the cluster becomes necessary. (see chapter 5.4 for more details on backup of applications in a Red Hat OpenShift cluster)
Complete failure of a managed cluster (disaster recovery): When the active managed cluster is failing, it is switched over to the stand-by cluster, which has been passive so far. This task needs to be triggered by an administrator (identical to Scenario 3). The failing cluster can then be recovered from scratch and joined back to the same DR environment, which is managed by Advanced Cluster Manager. Later, the selected active cluster can be switched back to the restored cluster. The assumption is that the hub cluster of Advanced Cluster Manager continues to run uninterrupted during this procedure.
Complete failure of both managed clusters (disaster recovery): If both managed clusters are destroyed (either one after the other or both at the same time), those clusters need to be recovered from a backup (as created in Scenario 4) and added back to the hub cluster, managed by Advanced Cluster Manager. The assumption is that the hub cluster of Advanced Cluster Manager continues to run uninterrupted during this procedure.
Loss of network connectivity between the managed clusters: When one managed cluster loses connectivity to the hub cluster with Advanced Cluster Manager, that cluster is considered as not operable and DR operations can be kicked-off (see previous scenarios)
Complete failure of the hub cluster running RHACM: If there is a failure of the hub cluster with RHACM you will need to fall back to a redundant installation of a hub cluster, which can take over operations. Deployment of RHACM for fail-over of RHACM itself needs to be prepared upfront. The current configuration of the fail-over deployment of RHACM can be restored from an OADP backup, which needs to be prepared upfront. For example, from a S3 Cloud storage. For details on resilience and DR for RHACM see: Business continuity(Red Hat Documentation)