Failover to the surviving site

Following a disaster, which failover process is implemented depends upon whether or not the tiebreaker site is affected.

Failover without the loss of tiebreaker site C

The proposed three-site configuration is resilient to a complete failure of any single hardware site. Should all disk volumes in one of the failure groups become unavailable, GPFS performs a transparent failover to the remaining set of disks and continues serving the data to the surviving subset of nodes with no administrative intervention.

Failover with the loss of tiebreaker site C with server-based configuration in use

If both site A and site C fail:

  1. Shut the GPFS daemon down on the surviving nodes at site B, where the file gpfs.siteB lists all of the nodes at site B:
    mmshutdown -N gpfs.siteB
  2. If it is necessary to make changes to the configuration, migrate the primary cluster configuration server to a node at site B:
    mmchcluster -p nodeB002
  3. Relax node quorum by temporarily changing the designation of each of the failed quorum nodes to non-quorum nodes:
    mmchnode --nonquorum -N nodeA001,nodeA002,nodeA003,nodeC
  4. Relax file system descriptor quorum by informing the GPFS daemon to migrate the file system descriptor off of the failed disks:
    mmfsctl fs0 exclude -d "gpfs1nsd;gpfs2nsd;gpfs5nsd"
    
  5. Restart the GPFS daemon on the surviving nodes:
    mmstartup -N gpfs.siteB 
  6. Mount the file system on the surviving nodes at site B.

Failover with the loss of tiebreaker site C with Clustered Configuration Repository (CCR) in use

If both site A and site C fail:

  1. Shut the GPFS daemon down on the surviving nodes at site B , where the file gpfs.siteB lists all of the nodes at site B :
    mmdsh -N gpfs.siteB /usr/lpp/mmfs/bin/mmshutdown
  2. Changing (downgrading) the quorum assignments when half or more of the quorum nodes are no longer available at site B using the –- force option :
    mmchnode –-nonquorum -N nodeA001,nodeA002,nodeA003,nodeC --force
  3. Relax file system descriptor quorum by informing the GPFS daemon to migrate the file system descriptor off of the failed disks:
    mmfsctl fs0 exclude -d "gpfs1nsd;gpfs2nsd;gpfs5nsd"
  4. Restart the GPFS daemon on the surviving nodes:
    mmstartup -N gpfs.siteB
  5. Mount the file system on the surviving nodes at site B.

Make no further changes to the quorum designations at site B until the failed sites are back online and the following failback procedure has been completed.

Do not shut down the current set of nodes on the surviving site B and restart operations on the failed sites A and C. This will result in a non-working cluster.