Recovery steps for IBM Spectrum Scale active-passive deployments

Complete these steps to if you need to recover from any issues during IBM Spectrum Scale active-passive deployments. An active-passive deployment is a disaster recovery (DR) configuration. In this configuration, GPFS is not responsible for data replication from the primary to the passive configuration. The active-passive deployment uses volume replication to replicate the data.

Unplanned failover

Note: Because the active-active deployment uses GPFS replication while the active-passive deployment uses volume replication, a primary configuration can either be part of an active-active or an active-passive deployment, but not both at the same time.

The following cases describe situations that can occur during active-passive deployment. If such a situation occurs, follow the steps to fail over to the available configuration.

In an unplanned failover situation, the primary configuration is down during the takeover process. For example, something unexpected happens and the primary configuration is down before running the passive takeover process.

Unplanned failover example 1: The primary configuration is down but the primary volumes are up

Complete the following steps:

Change volume replication.
Run the Passive Takeover operation on the passive instance. Now the initial passive configuration becomes a primary configuration.
Retrieve the client private key from the passive configuration and redeploy all GPFS shared services that were pointing to the previous primary configuration.
Run the Connect to Server operation on each client to move to the passive replica file system. After the operation completes, the clients have access to the file system again.
Re-create the active-passive configuration by deploying a new passive instance on the first system, where the primary instance was located originally. Attach the volumes that are now passive since you changed the replication.
Attach the new passive instance to the new primary instance. If you get a message that the passive instance is already attached, go to Remove Member > Passive to clean up the link to the initial primary instance, then try again.

After these steps are completed, you will have an active-passive deployment, active on the second system, and passive on the first system.

Unplanned failover example 2: Both the primary configuration and the primary volumes are down

Complete the following steps:

Break replication from the passive volumes, if the replication is still running.
Run the Passive Takeover operation on the passive instance. Now the initial passive configuration becomes a primary configuration.
Retrieve the client private key from the passive configuration and redeploy all GPFS shared services that were pointing to the previous primary configuration.
Run the Connect to Server operation on each client to move to the passive replica file system. After the operation completes, the clients have access to the file system again.
Create new volumes and replicate the new primary volumes by deploying a new passive instance on the first system, where the primary instance was located originally. Use the newly created volumes.
Attach the new passive instance to the new primary instance. If you get a message that the passive instance is already attached, go to Remove Member > Passive to clean up the link to the initial primary instance, then try again.

After these steps are completed, you will have an active-passive deployment, active on the second system, and passive on the first system.

Planned failover

In a planned failover situation, the primary nodes and primary volumes are active during the takeover process. For example, the primary system needs to be upgraded.

Note: As soon as the primary instance is down, all clients that are connected to the primary instance lose access to the shared file system. This situation happens because the primary GPFS nodes are down and the GPFS nodes from the passive configuration, which run in a separate cluster, are not yet ready to take over the clients. The passive nodes are activated and the passive file system becomes active after the failover to the passive steps are completed.

Complete the following steps:

Run the Prepare Primary for Takeover operation on the primary instance.
Change volume replication.
Run the Passive Takeover operation on the passive instance. Now the initial passive configuration becomes a primary configuration.
Retrieve the client private key from the passive configuration and redeploy all GPFS shared services that were pointing to the previous primary configuration.
Run the Connect to Server operation on each client to move to the passive replica file system. After the operation completes, the clients have access to the file system again.

After these steps are completed, all the clients have access to the file system again.