Failover to the recovery site and subsequent failback for an active-passive configuration
For an active-passive storage replication based cluster, complete these steps to fail over production to the recovery site.
Procedure when the Clustered Configuration Repository (CCR) scheme is in use
- Stop the GPFS daemon on
the surviving nodes as site B where the file gpfs.siteB lists
all of the nodes at site B:
mmshutdown -N gpfs.siteB
- Perform the appropriate commands to make the secondary replication devices available and change their status from being secondary devices to suspended primary devices.
- If site C, the tiebreaker, failed along with site A,
existing node quorum designations must be relaxed in order to allow
the surviving site to fulfill quorum requirements. To relax node quorum,
temporarily change the designation of each of the failed quorum nodes
to nonquorum nodes using the
–- force
option:mmchnode --nonquorum -N nodeA001, nodeA002, nodeA003, nodeC --force
- Ensure that the source volumes are not accessible to the
recovery site:
- Disconnect the cable
- Define the nsddevices user exit file to exclude the source volumes
- Restart the GPFS daemon
on all surviving nodes:
mmstartup -N gpfs.siteB
Note: Make no further changes to the quorum designations at site B until the failed sites are back
online and the following failback procedure has been completed. Do not shut down the current set of
nodes on the surviving site B and restart operations on the failed sites A and C. This will result
in a non-working cluster.
Failback procedure
After the physical operation of the production site has been restored, complete the failback procedure to transfer the file system activity back to the production GPFS cluster. The failback operation is a two-step process:
- For each of the paired volumes, resynchronize the pairs in the reserve direction with the recovery LUN lunRx acting as the sources for the production LUN lunPx. An incremental resynchronization will be performed which identifies the mismatching disk tracks, whose content is then copied from the recovery LUN to the production LUN. Once the data has been copied and the replication is running in the reverse direction this configuration can be maintained until a time is chosen to switch back to site P.
- If the state of the system configuration
has changed, update the GPFS configuration
data in the production cluster to propagate the changes made while
in failover mode. From a node at the recovery site, issue:
mmfsctl all syncFSconfig –n gpfs.sitePnodes
- Stop GPFS on all nodes in
the recovery cluster and reverse the disk roles so the original primary
disks become the primaries again:
- From a node in the recovery cluster, stop the GPFS daemon on all nodes in the recovery cluster:
mmshutdown –a
- Perform the appropriate actions to switch the replication direction so that diskA is now the source and diskB is the target.
- From a node in the production cluster, start GPFS:
mmstartup –a
- From a node in the production cluster, mount the file system on all nodes in the production cluster.
- From a node in the recovery cluster, stop the GPFS daemon on all nodes in the recovery cluster: