Recovering to a replacement cluster with Regional-DR
When there is a failure with the primary cluster, you get the options to either repair, wait for the recovery of the existing cluster, or replace the cluster entirely if the cluster is irredeemable. This solution guides you when replacing a failed primary cluster with a new cluster and enables failback (relocate) to this new cluster.
Before you begin
- Ensure that the Regional-DR environment has been configured with applications installed using Red Hat Advance Cluster Management (RHACM).
- Ensure that the applications are assigned a Data policy which protects them against cluster failure.
About this task
In these instructions, we are assuming that a RHACM managed cluster must be replaced after the applications have been installed and protected. For purposes of this section, the RHACM managed cluster is the replacement cluster, while the cluster that is not replaced is the surviving cluster and the new cluster is the recovery cluster.
Procedure
- On the Hub cluster, navigate to Applications and failover all the protected applications on the failed replacement cluster to the surviving cluster.
- Validate that all protected applications are now running on the surviving cluster before
moving to the next step. Note: The PROGRESSION state for each application
DRPlacementControl
will show asCleaning Up
. This is expected if the replacement cluster is offline or down. - From the Hub cluster, delete the DRCluster for the replacement cluster.
oc delete drcluster <drcluster_name> --wait=false
Note: Use--wait=false
since the DRCluster will not be deleted until a later step. - Disable disaster recovery on the Hub cluster for each protected application on the
surviving cluster.
Perform all the sub-steps on the hub cluster.
- For each application, edit the Placement and ensure that the surviving cluster is
selected. Note: For Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in the
openshift-gitops
namespace on the hub cluster.oc edit placement <placement_name> -n <namespace>
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement metadata: annotations: cluster.open-cluster-management.io/experimental-scheduling-disable: "true" [...] spec: clusterSets: - submariner predicates: - requiredClusterSelector: claimSelector: {} labelSelector: matchExpressions: - key: name operator: In values: - cluster1 <-- Modify to be surviving cluster name [...]
Note: For Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in theopenshift-gitops
namespace on the hub cluster. - Verify that the
s3Profile
is removed for the replacement cluster by running the following command on the surviving cluster for each protected application’s VolumeReplicationGroup.oc get vrg -n <application_namespace> -o jsonpath='{.items[0].spec.s3Profiles}' | jq
- Delete all
DRPlacementControl
(DRPC) resources from the Hub cluster after the protected application Placement resources are all configured to use the surviving cluster and replacement cluster s3Profile(s) removed from protected applications.Before deleting the DRPC, edit the DRPC of each application and add the annotation
drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "true"
.oc delete drpc <drpc_name> -n <namespace>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPlacementControl metadata: annotations: ## Add this annotation drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "true"
Verify that the annotation has been copied to the associatedVolumeReplicationGroup
(VRG) on the surviving cluster for each protected application.
Delete DRPC.oc get vrg -n {namespace} -o jsonpath='{.items[*].metadata.annotations}' | jq
oc delete drpc {drpc_name} -n {namespace}
Note: For Subscription-based applications the associated DRPlacementControl can be found in the same namespace as the managed clusters on the hub cluster. For ApplicationSet-based applications the associated DRPlacementControl can be found in theopenshift-gitops
namespace on the hub cluster.Verify that all DRPlacementControl resources are deleted before proceeding to the next step. This command is a query across all namespaces. There should be no resources found.oc get drpc -A
- Edit each applications Placement and remove the
annotation
cluster.open-cluster-management.io/experimental-scheduling-disable: "true"
.oc edit placement <placement_name> -n <namespace>
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement metadata: annotations: ## Remove this annotation cluster.open-cluster-management.io/experimental-scheduling-disable: "true" [...]
- For each application, edit the Placement and ensure that the surviving cluster is
selected.
- Repeat the process detailed in the last step and the sub-steps for every protected application on the surviving cluster. Disabling DR for protected applications is now completed.
- On the Hub cluster, run the following script to remove all disaster recovery
configurations from the surviving cluster and the hub cluster.
#!/bin/bash secrets=$(oc get secrets -n openshift-operators | grep Opaque | cut -d" " -f1) echo $secrets for secret in $secrets do oc patch -n openshift-operators secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge done mirrorpeers=$(oc get mirrorpeer -o name) echo $mirrorpeers for mp in $mirrorpeers do oc patch $mp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $mp done drpolicies=$(oc get drpolicy -o name) echo $drpolicies for drp in $drpolicies do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done drclusters=$(oc get drcluster -o name) echo $drclusters for drp in $drclusters do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done oc delete project openshift-operators managedclusters=$(oc get managedclusters -o name | cut -d"/" -f2) echo $managedclusters for mc in $managedclusters do secrets=$(oc get secrets -n $mc | grep multicluster.odf.openshift.io/secret-type | cut -d" " -f1) echo $secrets for secret in $secrets do set -x oc patch -n $mc secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge oc delete -n $mc secret/$secret done done oc delete clusterrolebinding spoke-clusterrole-bindings
Note: This script used the commandoc delete project openshift-operators
to remove the Disaster Recovery (DR) operators in this namespace on the hub cluster. If there are other non-DR operators in this namespace, you must install them again from OperatorHub. - After the namespace
openshift-operators
is automatically created again, add the monitoring label back for collecting the disaster recovery metrics.oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
- On the surviving cluster, ensure that the object bucket created during the DR
installation is deleted. Delete the object bucket if it was not removed by script. The name of the
object bucket used for DR starts with
odrbucket
.oc get obc -n openshift-storage
- Uninstall Submariner for only the replacement cluster (failed cluster)
using the RHACM console.
- Navigate to
Infrastructure > Clusters > Clusters
view > Submariner add-ons view and uninstall
Submariner for only the replacement cluster. Note: The uninstall process of Submariner for the replacement cluster (failed cluster) will stay GREEN and not complete until the replacement cluster has been detached from the RHACM console.
- Navigate back to Clusters view and detach replacement cluster.
- After the Submariner add-ons for the replacement cluster (failed
cluster) is removed from the RHACM console view, delete all stale Submariner resources left on the
hub cluster. The following example is of removing the Submariner stale resources from the Hub cluster. In this example, the
perf5
cluster is the failed cluster andperf8
is the surviving cluster. Remember to change the resource names as per your environment.Note: Do not remove resources for the surviving cluster.oc get endpoints.submariner.io,clusters.submariner.io -n submariner-broker
NAME AGE endpoint.submariner.io/perf5-submariner-cable-perf5-10-70-56-163 4h2m endpoint.submariner.io/perf8-submariner-cable-perf8-10-70-56-90 4h7m NAME AGE cluster.submariner.io/perf5 4h2m cluster.submariner.io/perf8 4h7m
Delete the replacement cluster stale resources (for example, fromperf5
).oc delete endpoint.submariner.io/perf5-submariner-cable-perf5-10-70-56-163
endpoint.submariner.io "perf5-submariner-cable-perf5-10-70-56-163" deleted
oc delete cluster.submariner.io/perf5
cluster.submariner.io "perf5" deleted
- Create a new OpenShift cluster (recovery cluster) and import into Infrastructure > Clusters view.
- Add the new recovery cluster to the
Clusterset
used by Submariner. - Install Submariner add-ons only for the new recovery cluster. Note: If GlobalNet is used for the surviving cluster make sure to enable GlobalNet for the recovery cluster as well.
- Navigate to
Infrastructure > Clusters > Clusters
view > Submariner add-ons view and uninstall
Submariner for only the replacement cluster.
- Install Fusion Data Foundation operator on the
recovery cluster. The Fusion Data Foundation version should be
Fusion Data Foundation 4.16 (or greater) and the same version
of Fusion Data Foundation as on the surviving cluster. While
creating the storage cluster, in the Data Protection step, you must select
the Prepare cluster for disaster recovery (Regional-DR only) checkbox. Note: Make sure to follow the optional instructions in the documentation to modify the Fusion Data Foundation storage cluster on the recovery cluster if GlobalNet has been enabled when installing Submariner.
- On the Hub cluster, install the ODF Multicluster Orchestrator operator from OperatorHub. For instructions, see chapter on Installing Fusion Data Foundation Multicluster Orchestrator operator.
- Using the RHACM console, navigate to Data
Services > Disaster
recovery > Policies tab.
- Select Create DRPolicy and name your policy.
- Select the recovery cluster and the surviving cluster.
- Create the policy. For instructions see chapter on Creating Disaster Recovery Policy on Hub cluster.
Proceed to the next step only after the status of DRPolicy changes to
Validated
.
- Verify that
cephblockpool
IDs remain unchanged.- Run the following command on the recovery cluster.
oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
The result is the sample output.
apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"1"}]}]' kind: ConfigMap [...]
- Run the following command on the surviving cluster.
oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
The result is the sample output.
apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"3":"1"}]}]' kind: ConfigMap [...]
- Check the
RBDPoolIDMapping
in the yaml for both the clusters. IfRBDPoolIDMapping
does not match, then edit therook-ceph-csi-mapping-config
config map of recovery cluster to add the additional or missingRBDPoolIDMapping
directly as shown in the following examples.csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"1"}]},{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"3"}]}]’
Note: After editing the configmap, restart rook-ceph-operator pod in the namespaceopenshift-storage
on the surviving cluster by deleting the pod.
- Run the following command on the recovery cluster.
- Apply the DRPolicy to the applications on the surviving cluster that were originally protected before the replacement cluster failed.
- Relocate the newly protected applications on the surviving cluster back to the new recovery cluster. Using the RHACM console, navigate to the Applications menu to perform the relocation.