Failure recovery processing
GPFS failure recovery processing occurs automatically. Therefore, though not necessary but some familiarity with its internal functions is useful when failures are observed.
Only one state change, such as the loss or initialization of a node, can be processed at a time and subsequent changes are queued. Therefore, the entire failure processing must be completed before the failed node can rejoin the group. All failures are processed first, which means that GPFS handles all failures before it completes any recovery.
GPFS recovers from a node failure by using join or leave processing messages that are sent explicitly by the cluster manager node. The cluster manager node observes that a node fails when it no longer receives heartbeat messages from the node. The join or leave processing messages are broadcast to the entire group of nodes that run GPFS, and each node updates its status for the failing or joining node. Failure of the cluster manager node results in a new cluster manager that is elected by the cluster. Then, the newly elected cluster configuration manager node processes the failure message for the failed cluster manager.
GPFS is notified of a node failure or that the GPFS daemon has failed on a node. It then starts the recovery for each of the file systems that were mounted on the failed node. If necessary, new file system managers are selected for any file systems that no longer have one.
The file system manager for each file system ensures that the failed node can no longer access the disks that comprise the file system. If the file system manager is newly appointed as a result of this failure, it rebuilds a token state by querying the other nodes in the group. After this rebuilding process is completed, the actual recovery of the failed node log begins. This recovery also rebuilds the metadata that was being modified at the time of the failure to a consistent state. Sometimes, blocks are allocated that are not part of any file and are effectively lost until mmfsck is run, online, or offline. After log recovery is complete, the locks that are held by the failed nodes are released for this file system. When this activity is completed for all file systems, failure processing is done. The last step of this process allows a failed node to rejoin the cluster.