Failover to the recovery site and subsequent failback for an active-passive configuration

For an active-passive storage replication based cluster, complete these steps to fail over production to the recovery site.

Procedure when the Clustered Configuration Repository (CCR) scheme is in use

  1. Stop the GPFS daemon on the surviving nodes as site B where the file gpfs.siteB lists all of the nodes at site B:
    mmshutdown -N gpfs.siteB
  2. Perform the appropriate commands to make the secondary replication devices available and change their status from being secondary devices to suspended primary devices.
  3. If site C, the tiebreaker, failed along with site A, existing node quorum designations must be relaxed in order to allow the surviving site to fulfill quorum requirements. To relax node quorum, temporarily change the designation of each of the failed quorum nodes to nonquorum nodes using the –- force option:
    mmchnode --nonquorum -N nodeA001, nodeA002, nodeA003, nodeC --force
  4. Ensure that the source volumes are not accessible to the recovery site:
    • Disconnect the cable
    • Define the nsddevices user exit file to exclude the source volumes
  5. Restart the GPFS daemon on all surviving nodes:
    mmstartup -N gpfs.siteB
Note: Make no further changes to the quorum designations at site B until the failed sites are back online and the following failback procedure has been completed. Do not shut down the current set of nodes on the surviving site B and restart operations on the failed sites A and C. This will result in a non-working cluster.

Failback procedure

After the physical operation of the production site has been restored, complete the failback procedure to transfer the file system activity back to the production GPFS cluster. The failback operation is a two-step process:

  1. For each of the paired volumes, resynchronize the pairs in the reserve direction with the recovery LUN lunRx acting as the sources for the production LUN lunPx. An incremental resynchronization will be performed which identifies the mismatching disk tracks, whose content is then copied from the recovery LUN to the production LUN. Once the data has been copied and the replication is running in the reverse direction this configuration can be maintained until a time is chosen to switch back to site P.
  2. If the state of the system configuration has changed, update the GPFS configuration data in the production cluster to propagate the changes made while in failover mode. From a node at the recovery site, issue:
    mmfsctl all syncFSconfig –n gpfs.sitePnodes
  3. Stop GPFS on all nodes in the recovery cluster and reverse the disk roles so the original primary disks become the primaries again:
    1. From a node in the recovery cluster, stop the GPFS daemon on all nodes in the recovery cluster:
      mmshutdown –a
    2. Perform the appropriate actions to switch the replication direction so that diskA is now the source and diskB is the target.
    3. From a node in the production cluster, start GPFS:
      mmstartup –a
      
    4. From a node in the production cluster, mount the file system on all nodes in the production cluster.