Recovering to a replacement cluster with Metro-DR

When there is a failure with the primary cluster, you get the options to either repair, wait for the recovery of the existing cluster, or replace the cluster entirely if the cluster is irredeemable. This solution guides you when replacing a failed primary cluster with a new cluster and enables failback (relocate) to this new cluster.

Before you begin

  • Ensure that the Metro-DR environment has been configured with applications installed using Red Hat Advance Cluster Management (RHACM).
  • Ensure that the applications are assigned a Data policy which protects them against cluster failure.

About this task

In these instructions, we are assuming that a RHACM managed cluster must be replaced after the applications have been installed and protected. For purposes of this section, the RHACM managed cluster is the replacement cluster, while the cluster that is not replaced is the surviving cluster and the new cluster is the recovery cluster.

Procedure

  1. Perform the following steps on the Hub cluster:
    1. Fence the replacement cluster by using the CLI terminal to edit the DRCluster resource, where <drcluster_name> is the replacement cluster name.
      oc edit drcluster <drcluster_name>
      apiVersion: ramendr.openshift.io/v1alpha1
      kind: DRCluster
      metadata:
      [...]
      spec:
        ## Add or modify this line
        clusterFence: Fenced
        cidrs:
        [...]
      [...]
    2. Using the RHACM console, navigate to Applications and failover all protected applications from the failed cluster to the surviving cluster.
    3. Verify and ensure that all protected applications are now running on the surviving cluster.
      Note: The PROGRESSION state for each application DRPlacementControl will show as Cleaning Up. This is expected if the replacement cluster is offline or down.
  2. Unfence the replacement cluster.

    Using the CLI terminal, edit the DRCluster resource, where <drcluster_name> is the replacement cluster name.

    oc edit drcluster <drcluster_name>
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: DRCluster
    metadata:
    [...]
    spec:
      ## Modify this line
      clusterFence: Unfenced
      cidrs:
      [...]
    [...]
  3. Delete the DRCluster for the replacement cluster.
    oc delete drcluster <drcluster_name> --wait=false
    Note: Use --wait=false since the DRCluster will not be deleted until a later step.
  4. Disable disaster recovery on the Hub cluster for each protected application on the surviving cluster.
    1. For each application, edit the Placement and ensure that the surviving cluster is selected.
      Note: For Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in the openshift-gitops namespace on the hub cluster.
      oc edit placement <placement_name> -n <namespace>
      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: Placement
      metadata:
      annotations:
        cluster.open-cluster-management.io/experimental-scheduling-disable: "true"
      [...]
      spec:
      clusterSets:
      - submariner
      predicates:
      - requiredClusterSelector:
          claimSelector: {}
          labelSelector:
            matchExpressions:
            - key: name
              operator: In
              values:
              - cluster1  <-- Modify to be surviving cluster name
      [...]
    2. Verify that the s3Profile is removed for the replacement cluster by running the following command on the surviving cluster for each protected application’s VolumeReplicationGroup.
      oc get vrg -n <application_namespace> -o jsonpath='{.items[0].spec.s3Profiles}' | jq
    3. After the protected application Placement resources are all configured to use the surviving cluster and replacement cluster s3Profile(s) removed from protected applications, all DRPlacementControl resources must be deleted from the Hub cluster.
      oc delete drpc <drpc_name> -n <namespace>
      Note: For Subscription-based applications the associated DRPlacementControl can be found in the same namespace as the managed clusters on the hub cluster. For ApplicationSets-based applications the associated DRPlacementControl can be found in the openshift-gitops namespace on the hub cluster.
    4. Verify that all DRPlacementControl resources are deleted before proceeding to the next step. This command is a query across all namespaces. There should be no resources found.
      oc get drpc -A
    5. The last step is to edit each applications Placement and remove the annotation cluster.open-cluster-management.io/experimental-scheduling-disable: "true".
      oc edit placement <placement_name> -n <namespace>
      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: Placement
      metadata:
      annotations:
        ## Remove this annotation
        cluster.open-cluster-management.io/experimental-scheduling-disable: "true"
      [...]
  5. Repeat the process detailed in the last step and the sub-steps for every protected application on the surviving cluster. Disabling DR for protected applications is now completed.
  6. On the Hub cluster, run the following script to remove all disaster recovery configurations from the surviving cluster and the hub cluster.
    #!/bin/bash
    secrets=$(oc get secrets -n openshift-operators | grep Opaque | cut -d" " -f1)
    echo $secrets
    for secret in $secrets
    do
        oc patch -n openshift-operators secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge
    done
    mirrorpeers=$(oc get mirrorpeer -o name)
    echo $mirrorpeers
    for mp in $mirrorpeers
    do
        oc patch $mp -p '{"metadata":{"finalizers":null}}' --type=merge
        oc delete $mp
    done
    drpolicies=$(oc get drpolicy -o name)
    echo $drpolicies
    for drp in $drpolicies
    do
        oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge
        oc delete $drp
    done
    drclusters=$(oc get drcluster -o name)
    echo $drclusters
    for drp in $drclusters
    do
        oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge
        oc delete $drp
    done
    oc delete project openshift-operators
    managedclusters=$(oc get managedclusters -o name | cut -d"/" -f2)
    echo $managedclusters
    for mc in $managedclusters
    do
        secrets=$(oc get secrets -n $mc | grep multicluster.odf.openshift.io/secret-type | cut -d" " -f1)
        echo $secrets
        for secret in $secrets
        do
            set -x
            oc patch -n $mc secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge
            oc delete -n $mc secret/$secret
        done
    done
    
    oc delete clusterrolebinding spoke-clusterrole-bindings
    Note: This script used the command oc delete project openshift-operators to remove the Disaster Recovery (DR) operators in this namespace on the hub cluster. If there are other non-DR operators in this namespace, you must install them again from OperatorHub.
  7. After the namespace openshift-operators is automatically created again, add the monitoring label back for collecting the disaster recovery metrics.
    oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
  8. On the surviving cluster, ensure that the object bucket created during the DR installation is deleted. Delete the object bucket if it was not removed by script. The name of the object bucket used for DR starts with odrbucket.
    oc get obc -n openshift-storage
  9. On the RHACM console, navigate to Infrastructure > Clusters view.
    1. Detach the replacement cluster.
    2. Create a new OpenShift cluster (recovery cluster) and import the new cluster into the RHACM console. For instructions, see Fusion Data FoundationCreating a cluster and Importing a target managed cluster to the hub cluster.
  10. Install Fusion Data Foundation operator on the recovery cluster and connect it to the same external IBM Storage Ceph as the surviving cluster. For detailed instructions, refer to Deploying Data Foundation in external mode.
    Note:

    Ensure that the Fusion Data Foundation version is 4.15 (or greater) and the same version of is on the surviving cluster.

  11. On the hub cluster, install the ODF Multicluster Orchestrator operator from OperatorHub. For instructions, see chapter on Installing Fusion Data Foundation on managed clusters.
  12. Using the RHACM console, navigate to Data Services > Data policies.
    1. Select Create DRPolicy and name your policy.
    2. Select the recovery cluster and the surviving cluster.
    3. Create the policy. For instructions see chapter on Creating Disaster Recovery Policy on Hub cluster.

      Proceed to the next step only after the status of DRPolicy changes to Validated.

  13. Apply the DRPolicy to the applications on the surviving cluster that were originally protected before the replacement cluster failed.
  14. Relocate the newly protected applications on the surviving cluster back to the new recovery (primary) cluster. Using the RHACM console, navigate to the Applications menu to perform the relocation.