Failover to the recovery site and subsequent failback for an active/active configuration

For an active/active storage replication based cluster, complete these steps to restore access to the file system through site B after site A has experienced a disastrous failure.

Procedure when the server-based configuration scheme is in use

  1. Stop the GPFS daemon on the surviving nodes as site B where the file gpfs.siteB lists all of the nodes at site B:
    mmdsh -N gpfs.siteB /usr/lpp/mmfs/bin/mmshutdown
  2. Perform the appropriate commands to make the secondary replication devices available and change their status from being secondary devices to suspended primary devices.
  3. If you needed to relax node quorum or make configuration changes, migrate the primary cluster configuration server to site B, issue this command:
    mmchcluster -p nodeB001
  4. If site C, the tiebreaker, failed along with site A, existing node quorum designations must be relaxed in order to allow the surviving site to fulfill quorum requirements. To relax node quorum, temporarily change the designation of each of the failed quorum nodes to non-quorum nodes:
    mmchnode --nonquorum -N nodeA001,nodeA002,nodeA003,nodeC
  5. Ensure the source volumes are not accessible to the recovery site:
    • Disconnect the cable
    • Define the nsddevices user exit file to exclude the source volumes
  6. Restart the GPFS daemon on all surviving nodes:
    mmstartup -N gpfs.siteB

Procedure when the Clustered Configuration Repository (CCR) scheme is in use

For an active-active PPRC-based cluster, follow these steps to restore access to the file system through site B after site A has experienced a disastrous failure:
  1. Stop the GPFS daemon on the surviving nodes as site B where the file gpfs.siteB lists all of the nodes at site B:
    mmshutdown -N gpfs.siteB
  2. Perform the appropriate commands to make the secondary replication devices available and change their status from being secondary devices to suspended primary devices.
  3. If site C, the tiebreaker, failed along with site A, existing node quorum designations must be relaxed in order to allow the surviving site to fulfill quorum requirements. To relax node quorum, temporarily change the designation of each of the failed quorum nodes to nonquorum nodes using the –- force option:
    mmchnode --nonquorum -N nodeA001,nodeA002,nodeA003,nodeC --force
  4. Ensure that the source volumes are not accessible to the recovery site:
    • Disconnect the cable.
    • Define the nsddevices user exit file to exclude the source volumes.
  5. Restart the GPFS daemon on all surviving nodes:
    mmstartup -N gpfs.siteB
Note:
  • Make no further changes to the quorum designations at site B until the failed sites are back online and the following failback procedure has been completed.
  • Do not shut down the current set of nodes on the surviving site B and restart operations on the failed sites A and C. This will result in a non-working cluster.

Failback procedure

After the operation of site A has been restored, the failback procedure is completed to restore the access to the file system from that location. The following procedure is the same for both configuration schemes (server-based and Clustered Configuration Repository (CCR)). The failback operation is a two-step process:

  1. For each of the paired volumes, resynchronize the pairs in the reserve direction with the recovery LUN diskB acting as the sources for the production LUN diskA. An incremental resynchronization is performed, which identifies the mismatching disk tracks, whose content is then copied from the recovery LUN to the production LUN. Once the data has been copied and the replication is running in the reverse direction this configuration can be maintained until a time is chosen to switch back to site A.
  2. Shut GPFS down at site B and reverse the disk roles (the original primary disk becomes the primary again), bringing the replication pair to its initial state.
    1. Stop the GPFS daemon on all nodes.
    2. Perform the appropriate actions to switch the replication direction so that diskA is now the source and diskB is the target.
    3. If during failover you migrated the primary cluster configuration server to a node in site B:
      1. Migrate the primary cluster configuration server back to site A:
        mmchcluster -p nodeA001
      2. Restore the initial quorum assignments:
        mmchnode --quorum -N nodeA001,nodeA002,nodeA003,nodeC
      3. Ensure that all nodes have the latest copy of the mmsdrfs file:
        mmchcluster -p LATEST
    4. Ensure the source volumes are accessible to the recovery site:
      • Reconnect the cable
      • Edit the nsddevices user exit file to include the source volumes
    5. Start the GPFS daemon on all nodes:
      mmstartup -a
    6. Mount the file system on all the nodes at sites A and B.