Recovering to a replacement cluster with Metro-DR
When there is a failure with the primary cluster, you get the options to either repair, wait for the recovery of the existing cluster, or replace the cluster entirely if the cluster is irredeemable. This solution guides you when replacing a failed primary cluster with a new cluster and enables failback (relocate) to this new cluster.
Before you begin
- Ensure that the Metro-DR environment has been configured with applications installed using Red Hat Advance Cluster Management (RHACM).
- Ensure that the applications are assigned a Data policy which protects them against cluster failure.
About this task
In these instructions, we are assuming that a RHACM managed cluster must be replaced after the applications have been installed and protected. For purposes of this section, the RHACM managed cluster is the replacement cluster, while the cluster that is not replaced is the surviving cluster and the new cluster is the recovery cluster.
Procedure
- Perform the following steps on the Hub cluster:
- Fence the replacement cluster by using the CLI terminal to edit the DRCluster
resource, where <drcluster_name> is the replacement cluster name.
oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: ## Add or modify this line clusterFence: Fenced cidrs: [...] [...]
- Using the RHACM console, navigate to Applications and failover all protected applications from the failed cluster to the surviving cluster.
- Verify and ensure that all protected applications are now running on the surviving
cluster. Note: The PROGRESSION state for each application DRPlacementControl will show as
Cleaning Up
. This is expected if the replacement cluster is offline or down.
- Fence the replacement cluster by using the CLI terminal to edit the DRCluster
resource, where <drcluster_name> is the replacement cluster name.
- Unfence the replacement cluster.
Using the CLI terminal, edit the DRCluster resource, where <drcluster_name> is the replacement cluster name.
oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: ## Modify this line clusterFence: Unfenced cidrs: [...] [...]
- Delete the DRCluster for the replacement cluster.
oc delete drcluster <drcluster_name> --wait=false
Note: Use--wait=false
since the DRCluster will not be deleted until a later step. - Disable disaster recovery on the Hub cluster for each protected application on the
surviving cluster.
- For each application, edit the Placement and ensure that the surviving cluster is
selected. Note: For Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in the
openshift-gitops
namespace on the hub cluster.oc edit placement <placement_name> -n <namespace>
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement metadata: annotations: cluster.open-cluster-management.io/experimental-scheduling-disable: "true" [...] spec: clusterSets: - submariner predicates: - requiredClusterSelector: claimSelector: {} labelSelector: matchExpressions: - key: name operator: In values: - cluster1 <-- Modify to be surviving cluster name [...]
- Verify that the
s3Profile
is removed for the replacement cluster by running the following command on the surviving cluster for each protected application’s VolumeReplicationGroup.oc get vrg -n <application_namespace> -o jsonpath='{.items[0].spec.s3Profiles}' | jq
- After the protected application Placement resources are all configured to use the
surviving cluster and replacement cluster s3Profile(s) removed from protected applications,
all DRPlacementControl resources must be deleted from the Hub cluster.
oc delete drpc <drpc_name> -n <namespace>
Note: For Subscription-based applications the associated DRPlacementControl can be found in the same namespace as the managed clusters on the hub cluster. For ApplicationSets-based applications the associated DRPlacementControl can be found in theopenshift-gitops
namespace on the hub cluster. - Verify that all DRPlacementControl resources are deleted before proceeding to the next
step. This command is a query across all namespaces. There should be no resources found.
oc get drpc -A
- The last step is to edit each applications Placement and remove the
annotation
cluster.open-cluster-management.io/experimental-scheduling-disable: "true"
.oc edit placement <placement_name> -n <namespace>
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement metadata: annotations: ## Remove this annotation cluster.open-cluster-management.io/experimental-scheduling-disable: "true" [...]
- For each application, edit the Placement and ensure that the surviving cluster is
selected.
- Repeat the process detailed in the last step and the sub-steps for every protected application on the surviving cluster. Disabling DR for protected applications is now completed.
- On the Hub cluster, run the following script to remove all disaster recovery
configurations from the surviving cluster and the hub cluster.
#!/bin/bash secrets=$(oc get secrets -n openshift-operators | grep Opaque | cut -d" " -f1) echo $secrets for secret in $secrets do oc patch -n openshift-operators secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge done mirrorpeers=$(oc get mirrorpeer -o name) echo $mirrorpeers for mp in $mirrorpeers do oc patch $mp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $mp done drpolicies=$(oc get drpolicy -o name) echo $drpolicies for drp in $drpolicies do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done drclusters=$(oc get drcluster -o name) echo $drclusters for drp in $drclusters do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done oc delete project openshift-operators managedclusters=$(oc get managedclusters -o name | cut -d"/" -f2) echo $managedclusters for mc in $managedclusters do secrets=$(oc get secrets -n $mc | grep multicluster.odf.openshift.io/secret-type | cut -d" " -f1) echo $secrets for secret in $secrets do set -x oc patch -n $mc secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge oc delete -n $mc secret/$secret done done oc delete clusterrolebinding spoke-clusterrole-bindings
Note: This script used the commandoc delete project openshift-operators
to remove the Disaster Recovery (DR) operators in this namespace on the hub cluster. If there are other non-DR operators in this namespace, you must install them again from OperatorHub. - After the namespace openshift-operators is automatically created again, add the
monitoring label back for collecting the disaster recovery metrics.
oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
- On the surviving cluster, ensure that the object bucket created during the DR
installation is deleted. Delete the object bucket if it was not removed by script. The name of the
object bucket used for DR starts with
odrbucket
.oc get obc -n openshift-storage
- On the RHACM console, navigate to
Infrastructure > Clusters
view.
- Detach the replacement cluster.
- Create a new OpenShift cluster (recovery cluster) and import the new cluster into the RHACM console. For instructions, see Fusion Data FoundationCreating a cluster and Importing a target managed cluster to the hub cluster.
- Install Fusion Data Foundation operator on the
recovery cluster and connect it to the same external IBM Storage
Ceph as the surviving cluster. For detailed instructions,
refer to Deploying Data Foundation in external mode. Note:
Ensure that the Fusion Data Foundation version is 4.15 (or greater) and the same version of is on the surviving cluster.
- On the hub cluster, install the ODF Multicluster Orchestrator operator from OperatorHub. For instructions, see chapter on Installing Fusion Data Foundation on managed clusters.
- Using the RHACM console, navigate to Data
Services > Data policies.
- Select Create DRPolicy and name your policy.
- Select the recovery cluster and the surviving cluster.
- Create the policy. For instructions see chapter on Creating Disaster Recovery Policy on Hub cluster.
Proceed to the next step only after the status of DRPolicy changes to
Validated
.
- Apply the DRPolicy to the applications on the surviving cluster that were originally protected before the replacement cluster failed.
- Relocate the newly protected applications on the surviving cluster back to the new recovery (primary) cluster. Using the RHACM console, navigate to the Applications menu to perform the relocation.