Recovering to a replacement cluster with Regional-DR

When there is a failure with the primary cluster, you get the options to either repair, wait for the recovery of the existing cluster, or replace the cluster entirely if the cluster is irredeemable. This solution guides you when replacing a failed primary cluster with a new cluster and enables failback (relocate) to this new cluster.

Before you begin

  • Ensure that the Regional-DR environment has been configured with applications installed using Red Hat Advance Cluster Management (RHACM).
  • Ensure that the applications are assigned a Data policy which protects them against cluster failure.

About this task

In these instructions, we are assuming that a RHACM managed cluster must be replaced after the applications have been installed and protected. For purposes of this section, the RHACM managed cluster is the replacement cluster, while the cluster that is not replaced is the surviving cluster and the new cluster is the recovery cluster.

Note: Replacement cluster recovery for Discovered applications is currently not supported. Only Managed applications are supported.

Procedure

  1. On the Hub cluster, navigate to Applications and failover all the protected applications on the failed replacement cluster to the surviving cluster.
  2. Validate that all protected applications are now running on the surviving cluster before moving to the next step.
    Note: The PROGRESSION state for each application DRPlacementControl will show as Cleaning Up. This is expected if the replacement cluster is offline or down.
  3. From the Hub cluster, delete the DRCluster for the replacement cluster.
    oc delete drcluster <drcluster_name> --wait=false
    Note: Use --wait=false since the DRCluster will not be deleted until a later step.
  4. Disable disaster recovery on the Hub cluster for each protected application on the surviving cluster.

    Perform all the sub-steps on the hub cluster.

    1. For each application, edit the Placement and ensure that the surviving cluster is selected.
      Note: For Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in the openshift-gitops namespace on the hub cluster.
      oc edit placement <placement_name> -n <namespace>
      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: Placement
      metadata:
      annotations:
        cluster.open-cluster-management.io/experimental-scheduling-disable: "true"
      [...]
      spec:
      clusterSets:
      - submariner
      predicates:
      - requiredClusterSelector:
          claimSelector: {}
          labelSelector:
            matchExpressions:
            - key: name
              operator: In
              values:
              - cluster1  <-- Modify to be surviving cluster name
      [...]
      Note: For Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in the openshift-gitops namespace on the hub cluster.
    2. Verify that the s3Profile is removed for the replacement cluster by running the following command on the surviving cluster for each protected application’s VolumeReplicationGroup.
      oc get vrg -n <application_namespace> -o jsonpath='{.items[0].spec.s3Profiles}' | jq
    3. Delete all DRPlacementControl (DRPC) resources from the Hub cluster after the protected application Placement resources are all configured to use the surviving cluster and replacement cluster s3Profile(s) removed from protected applications.

      Before deleting the DRPC, edit the DRPC of each application and add the annotation drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "true".

      oc delete drpc <drpc_name> -n <namespace>
      apiVersion: ramendr.openshift.io/v1alpha1
         kind: DRPlacementControl
         metadata:
           annotations:
           ## Add this annotation
           drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "true"
      Verify that the annotation has been copied to the associated VolumeReplicationGroup (VRG) on the surviving cluster for each protected application.
      oc get vrg -n {namespace} -o jsonpath='{.items[*].metadata.annotations}' | jq
      Delete DRPC.
      oc delete drpc {drpc_name} -n {namespace}
      Note: For Subscription-based applications the associated DRPlacementControl can be found in the same namespace as the managed clusters on the hub cluster. For ApplicationSet-based applications the associated DRPlacementControl can be found in the openshift-gitops namespace on the hub cluster.
      Verify that all DRPlacementControl resources are deleted before proceeding to the next step. This command is a query across all namespaces. There should be no resources found.
      oc get drpc -A
    4. Edit each applications Placement and remove the annotation cluster.open-cluster-management.io/experimental-scheduling-disable: "true".
      oc edit placement <placement_name> -n <namespace>
      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: Placement
      metadata:
      annotations:
        ## Remove this annotation
        cluster.open-cluster-management.io/experimental-scheduling-disable: "true"
      [...]
  5. Repeat the process detailed in the last step and the sub-steps for every protected application on the surviving cluster. Disabling DR for protected applications is now completed.
  6. On the Hub cluster, run the following script to remove all disaster recovery configurations from the surviving cluster and the hub cluster.
    #!/bin/bash
    
    secrets=$(oc get secrets -n openshift-operators | grep Opaque | cut -d" " -f1)
    echo $secrets
    for secret in $secrets
    do
        oc patch -n openshift-operators secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge
    done
    
    mirrorpeers=$(oc get mirrorpeer -o name)
    echo $mirrorpeers
    for mp in $mirrorpeers
    do
        oc patch $mp -p '{"metadata":{"finalizers":null}}' --type=merge
        oc delete $mp
    done
    
    drpolicies=$(oc get drpolicy -o name)
    echo $drpolicies
    for drp in $drpolicies
    do
        oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge
        oc delete $drp
    done
    
    drclusters=$(oc get drcluster -o name)
    echo $drclusters
    for drp in $drclusters
    do
        oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge
        oc delete $drp
    done
    
    oc delete project openshift-operators
    
    managedclusters=$(oc get managedclusters -o name | cut -d"/" -f2)
    echo $managedclusters
    for mc in $managedclusters
    do
        secrets=$(oc get secrets -n $mc | grep multicluster.odf.openshift.io/secret-type | cut -d" " -f1)
        echo $secrets
        for secret in $secrets
        do
            set -x
            oc patch -n $mc secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge
            oc delete -n $mc secret/$secret
        done
    done
    
    oc delete clusterrolebinding spoke-clusterrole-bindings
    Note: This script used the command oc delete project openshift-operators to remove the Disaster Recovery (DR) operators in this namespace on the hub cluster. If there are other non-DR operators in this namespace, you must install them again from OperatorHub.
  7. After the namespace openshift-operators is automatically created again, add the monitoring label back for collecting the disaster recovery metrics.
    oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
  8. On the surviving cluster, ensure that the object bucket created during the DR installation is deleted. Delete the object bucket if it was not removed by script. The name of the object bucket used for DR starts with odrbucket.
    oc get obc -n openshift-storage
  9. Uninstall Submariner for only the replacement cluster (failed cluster) using the RHACM console.
    1. Navigate to Infrastructure > Clusters > Clusters view > Submariner add-ons view and uninstall Submariner for only the replacement cluster.
      Note: The uninstall process of Submariner for the replacement cluster (failed cluster) will stay GREEN and not complete until the replacement cluster has been detached from the RHACM console.
    2. Navigate back to Clusters view and detach replacement cluster.
    3. After the Submariner add-ons for the replacement cluster (failed cluster) is removed from the RHACM console view, delete all stale Submariner resources left on the hub cluster.
      The following example is of removing the Submariner stale resources from the Hub cluster. In this example, the perf5 cluster is the failed cluster and perf8 is the surviving cluster. Remember to change the resource names as per your environment.
      Note: Do not remove resources for the surviving cluster.
      oc get endpoints.submariner.io,clusters.submariner.io -n submariner-broker
      NAME                                                               AGE
      endpoint.submariner.io/perf5-submariner-cable-perf5-10-70-56-163   4h2m
      endpoint.submariner.io/perf8-submariner-cable-perf8-10-70-56-90    4h7m
      
      NAME                          AGE
      cluster.submariner.io/perf5   4h2m
      cluster.submariner.io/perf8   4h7m
      Delete the replacement cluster stale resources (for example, from perf5).
      oc delete endpoint.submariner.io/perf5-submariner-cable-perf5-10-70-56-163
      endpoint.submariner.io "perf5-submariner-cable-perf5-10-70-56-163" deleted
      oc delete cluster.submariner.io/perf5
      cluster.submariner.io "perf5" deleted
    4. Create a new OpenShift cluster (recovery cluster) and import into Infrastructure > Clusters view.
    5. Add the new recovery cluster to the Clusterset used by Submariner.
    6. Install Submariner add-ons only for the new recovery cluster.
      Note: If GlobalNet is used for the surviving cluster make sure to enable GlobalNet for the recovery cluster as well.
  10. Install Fusion Data Foundation operator on the recovery cluster. The Fusion Data Foundation version should be Fusion Data Foundation 4.16 (or greater) and the same version of Fusion Data Foundation as on the surviving cluster. While creating the storage cluster, in the Data Protection step, you must select the Prepare cluster for disaster recovery (Regional-DR only) checkbox.
    Note: Make sure to follow the optional instructions in the documentation to modify the Fusion Data Foundation storage cluster on the recovery cluster if GlobalNet has been enabled when installing Submariner.
  11. On the Hub cluster, install the ODF Multicluster Orchestrator operator from OperatorHub. For instructions, see chapter on Installing Fusion Data Foundation Multicluster Orchestrator operator.
  12. Using the RHACM console, navigate to Data Services > Disaster recovery > Policies tab.
    1. Select Create DRPolicy and name your policy.
    2. Select the recovery cluster and the surviving cluster.
    3. Create the policy. For instructions see chapter on Creating Disaster Recovery Policy on Hub cluster.

      Proceed to the next step only after the status of DRPolicy changes to Validated.

  13. Verify that cephblockpool IDs remain unchanged.
    1. Run the following command on the recovery cluster.
      oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml

      The result is the sample output.

      apiVersion: v1
      data:
        csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"1"}]}]'
      kind: ConfigMap
      [...]
    2. Run the following command on the surviving cluster.
      oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml

      The result is the sample output.

      apiVersion: v1
      data:
        csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"3":"1"}]}]'
      kind: ConfigMap
      [...]
    3. Check the RBDPoolIDMapping in the yaml for both the clusters. If RBDPoolIDMapping does not match, then edit the rook-ceph-csi-mapping-config config map of recovery cluster to add the additional or missing RBDPoolIDMapping directly as shown in the following examples.
      csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"1"}]},{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"3"}]}]’
      Note: After editing the configmap, restart rook-ceph-operator pod in the namespace openshift-storage on the surviving cluster by deleting the pod.
  14. Apply the DRPolicy to the applications on the surviving cluster that were originally protected before the replacement cluster failed.
  15. Relocate the newly protected applications on the surviving cluster back to the new recovery cluster. Using the RHACM console, navigate to the Applications menu to perform the relocation.