Failover to the surviving site
Following a disaster, which failover process is implemented depends upon whether or not the tiebreaker site is affected.
Failover without the loss of tiebreaker site C
The proposed three-site configuration is resilient to a complete failure of any single hardware site. Should all disk volumes in one of the failure groups become unavailable, GPFS performs a transparent failover to the remaining set of disks and continues serving the data to the surviving subset of nodes with no administrative intervention.
Failover with the loss of tiebreaker site C with server-based configuration in use
If both site A and site C fail:
- Shut the GPFS daemon down on the surviving nodes at site
B, where the file gpfs.siteB lists all of the nodes at site B:
mmshutdown -N gpfs.siteB
- If it is necessary to make changes to the configuration, migrate
the primary cluster configuration server to a node at site B:
mmchcluster -p nodeB002
- Relax node quorum by temporarily changing the designation of each
of the failed quorum nodes to non-quorum nodes:
mmchnode --nonquorum -N nodeA001,nodeA002,nodeA003,nodeC
- Relax file system descriptor quorum by informing the GPFS daemon to migrate the file system descriptor
off of the failed disks:
mmfsctl fs0 exclude -d "gpfs1nsd;gpfs2nsd;gpfs5nsd"
- Restart the GPFS daemon
on the surviving nodes:
mmstartup -N gpfs.siteB
- Mount the file system on the surviving nodes at site B.
Failover with the loss of tiebreaker site C with Clustered Configuration Repository (CCR) in use
If both site A and site C fail:
- Shut the GPFS daemon down
on the surviving nodes at site B , where the file gpfs.siteB lists
all of the nodes at site B :
mmdsh -N gpfs.siteB /usr/lpp/mmfs/bin/mmshutdown
- Changing (downgrading) the quorum assignments when half or more
of the quorum nodes are no longer available at site B using the
–- force
option :mmchnode –-nonquorum -N nodeA001,nodeA002,nodeA003,nodeC --force
- Relax file system descriptor quorum by informing the GPFS daemon to migrate the file system descriptor
off of the failed disks:
mmfsctl fs0 exclude -d "gpfs1nsd;gpfs2nsd;gpfs5nsd"
- Restart the GPFS daemon
on the surviving nodes:
mmstartup -N gpfs.siteB
- Mount the file system on the surviving nodes at site B.
Make no further changes to the quorum designations at site B until the failed sites are back online and the following failback procedure has been completed.
Do not shut down the current set of nodes on the surviving site B and restart operations on the failed sites A and C. This will result in a non-working cluster.