Metro-DR for IBM Storage Fusion HCI System

Steps to configure Metro-DR and enable disaster recovery on applications.

The following are the high-level steps to setup Metro-DR, enable applications, and do failback and failover of applications from one site to another:

Log in to the primary site 1.
Copy the connection snippet.
Log in to the secondary site 2.
Use the connection snippet and connect site 1 and site 2.
Connect tiebreaker.
Go to the Applications page and enable disaster recovery or enroll applications from the Replicated applications page.
To failover applications from one site to another, in the Replicated application page of site 1, select applications and failover them to site 2.
To failback applications from the mirrored site to the original site, from the original source destination, select applications that are marked with failover symbol and click Failover.

Setting up Metro-DR

Before you begin

Before you set up Metro-DR, ensure that all the mentioned prerequisites are met:

For general Metro-DR prerequisites, see General Metro-DR prerequisites.
For tiebreaker prerequisites, see Preparing the tiebreaker.
In disaster recovery, two clusters must be connected to failover and failback applications. If the cluster recovers after the expiry of the client cert, the connection must be cleaned and setup to rejoin the recovered cluster. For the procedure to rejoin, see Reconnecting OpenShift Container Platform cluster.

For more information about Metro-DR, see Metro-DR (Disaster Recovery).

Procedure

Log in to IBM Storage Fusion HCI System user interface of the primary site.
Go to Disaster Recovery > Overview.
Click Generate connection snippet to copy the connection snippet of the primary site.
In the Connection Snippet successfully generated window, click Copy.
Log in to the IBM Storage Fusion HCI System user interface of the secondary site.
Go to Diaster Recovery > Overview.
Click Connect to a cluster, and in Add to cluster window, click Metro tile.
Click Next.
In the Replication type window, paste the connection snippet of primary site in Enter connection snippet.
Click Add.
The success of the connection is displayed in a confirmation message.
Refresh the Topology page to see a pictorial representation of the connection between primary site and secondary site.
Example failure statuses for Metro-DR are connection state failed, admin network connectivity failure, Daemon network connectivity failure, stretch cluster addition failure, Minio setup failure, RamenDr failure.
Click Connect in the pictorial representation or click Connect tiebreaker.
Enter the following tiebreaker details in the Connect tiebreaker window:

IP address

Enter the IP address of the tiebreaker.

User ID

The user id to log in to the tiebreaker

Password

The password for the tiebreaker user.
Click Save.
The Tiebreaker connecting notification message is displayed. After the tiebreaker connects successfully, the Connect tiebreaker changes to Connect cluster. You cannot add more than two clusters.
If you want to edit the details of the tiebreaker, then do the following steps:
1. Go to the Topology page.
2. In the pictorial representation, click the tiebreaker.
3. Click Edit.
4. In the Edit tiebreaker credentials window, update the details of the tiebreaker and click Save.

Note: The sites (clusters) and tiebreaker statuses are as follows:

Healthy: The status of the cluster or tiebreaker is healthy.

Degraded: Whenever a mismatch exists between the two sites.

Failed: When the connectivity is lost between the sites or with tiebreaker.

Running: All clusters and tiebreaker are running.

After you complete the setup, select the applications that you want to protect from the Applications page.

Enabling applications for Metro-DR

Prerequisites: The prerequisite for this step is that the Metro-DR setup must be complete.

Procedure

As a first step, register your applications for this disaster recovery.

Go to the Applications page.
Click the ellipsis overflow menu of your application record and click Manage disaster recovery.
In the Manage disaster recovery window, select Metro tile and click Save.
Disaster recovery is enabled for the selected application. After you register your applications, you can see a table with a list of applications that are protected by this disaster recovery in the Disaster recovery page.
To validate, go to the Disaster recovery > Replicated applications page.
Check whether the enrolled application is available in the Replicated applications page.

Alternatively, you can enroll applications from the Replicated applications page:

Click Enroll applications in the Replicated applications page.
In the Enroll applications for disaster recovery window, select the applications for disaster recovery.
Click Enroll.

Removing disaster recovery

If you no longer need disaster recovery for an application, then do the following steps to disable:

Prerequisites

Scale down the applications before you remove the enrollment. Otherwise, it leads to inconsistent Ramen CR and Replication error.

Procedure

Go to the Disaster recovery > Replicated applications. Alternatively, you can do this action from the Applications page.
Search for the application and click the ellipsis overflow menu of of the record and click Manage disaster recovery.
In the Manage disaster recovery window, select No disaster recovery tile.
Click Save.
Scale up the applications.

Failover applications from site 1 to site 2

From the Disaster recovery page you can failover applications from one site to another:

The planned failover must be done from the site where you want the applications to be failed over.

Before you begin

Consider the following points before you failover applications:

If backup policies are applied to the applications on site 1, they continue to run until those polices are removed from the failover application.
The applications must be enabled for disaster recovery.
Scale down of an application before initiating planned failover.
Before unplanned failover of applications to surviving site, do Metro-DR Data Fencing. For the procedure, see Metro-DR data fencing.

Procedure

In the Replicated applications page, click Actions > Failover.
The Failover remote applications window gets displayed. It also indicates the number of applications available for failover.

This Failover does a failover for all remote applications.
In the Failover remote applications window, select the applications and click Failover. It creates the Persistent Volumes (PVs) that belong to the applications in the local site. In addition, it creates namespace and application CR on the cluster from where you initiated the failover. After relocation is complete to remote cluster, a success message is displayed on the screen.
In the Replicated applications page, check the status of the Primary cluster.
Initially, the status of the Primary cluster for the application is Failing over. After successful completion of the failover, the primary cluster of the given application must be changed to secondary site or partner cluster. After completion, it displays
icon next to the name of the Primary cluster, and a Failover complete notification is displayed. The IBM Storage Fusion HCI System prepares the cluster for deployment of the relocated applications.
Scale up the applications.

Failback applications from the mirrored site to the original site

To failback, go to the Replicated applications page and do the following steps:

Log in to the original site of the application.
Select one or more applications marked with
symbol and click Failover.
Click Actions > Failover.
In the Failover remote applications window, select the applications and click Failover.
A success notification is displayed on the screen.

Upgrade and upsize considerations in Metro-DR

When any of the following activities are in progress on a site, then do not attempt any other activity from the list on either this site or the other site:
- Upgrade
- Scale out
- Scale up
  For example, when scale up is in progress, do not trigger upgrade and vice versa.
- Node maintenance / replacement or repair
- Disk maintenance /replacement or repair
- Unhealthy state of open shift or storage cluster
When one of the site is at a lower version level than the other, do not any attempt any operations on either of the sites in the Metro-DR cluster. For example, site 1 is at IBM Storage Fusion HCI System 2.6.1 (lower level) and site 2 is at IBM Storage Fusion HCI System 2.7.1.
In case of the following situations, do an upgrade or scale up to obtain parity between the two sites:
Note: Ensure you bring them both to the same level at the earliest as the tolerant limit is only couple of days.
- Ensure that the IBM Storage Fusion version is at the same level for both site 1 and site 2.
- Ensure that the OpenShift Container Platform version is at the same level for both site 1 an site 2.
- Ensure that the submariner version is at the same level for both site 1 and site 2.
- Add nodes when mismatch in the number of nodes between sites exist.
- Add disks when mismatch in the number of disks between sites exist.

If these mismatches exist, the tolerant limit is 2 days and you need to bring them both at the same level at the earliest.