Considerations for business continuity and resilience
In addition to the setup of a full production environment, it is essential to consider resiliency.
- To protect against logical failures, it is necessary to facilitate backups. Logical failures include bugs, malicious software, or even administration mistakes. A backup provides the option to restore the application by a point-in-time copy of its data. For IBM Fusion Data Foundation, the Red Hat OpenShift API for Data Protection (OADP) is used for backup. OADP is an operator on its own. Its implementation is based on the open source project Velero and uses CSI Snapshots or Restic under the hoods. Backup images are being saved as object storage in the cloud. It is important to emphasize that in the context of OADP, backups are snapshots of the application’s persistent volumes data and all the Kubernetes metadata belonging to the application’s namespace. This makes a backup always scoped by a namespace. The deployment of the entire OpenShift Container Platform cluster itself and its corresponding configuration, like the etcd database are not part of OADP backup functions.
- Resiliency not only covers the data protection, but also the high availability of the entire
system, including failover and disaster recovery in the case of major disruptions. IBM Z platform
best practices can be applied here.
- A deployment of a single Red Hat OpenShift cluster can be stretched across data centers within proximity. Network latency needs to be below 10 ms for round-trips. This setup addresses outages due to local system or hardware failures, as the individual nodes of the Red Hat OpenShift cluster and the IBM Fusion Data Foundation storage nodes are isolated from each other (in separate LPARs, hardware units or even data centers). The minimum of two control, compute, and storage nodes must always be maintained to ensure continuous operation.
- IBM Fusion Data Foundation can also be set up in an active / passive setup, where two data centers are located further apart to cover broader regional disasters. Each data center has its own Red Hat OpenShift cluster and its own IBM Fusion Data Foundation deployment. One cluster is active, while the other remains stand-by for use in case of a disaster. The 2 separate clusters are orchestrated by Red Hat Advanced Cluster Manager (RHACM). The persistent volumes in the stand-by system are kept up to date by Ceph’s asynchronous replication. This approach is referred to Regional Disaster Recovery (RDR).
- In addition to RDR, a Metro Disaster Recovery (MDR) approach allows to synchronous replication and active / active setup, if the data centers are located within proximity. Both RDR and MDR are described in more detail in the next section.