Failback with permanent loss
If an outage is of a permanent nature, follow steps to remove and replace the failed resources, and then resume the operation of GPFS across the cluster.
- Remove the failed resources from the GPFS configuration.
- Replace the failed resources, then add the new resources into the configuration.
- Resume the operation of GPFS across the entire cluster.
Assume that sites A and C have had permanent losses. To remove all references of the failed nodes and disks from the GPFS configuration and replace them:
Procedure when Clustered Configuration Repository (CCR) is in use
- To remove the failed resources from the GPFS configuration:
- Delete the failed disks from the GPFS configuration:
mmdeldisk fs0 "gpfs1nsd;gpfs2nsd;gpfs5nsd"
mmdelnsd "gpfs1nsd;gpfs2nsd;gpfs5nsd"
- Delete the failed nodes from the GPFS configuration:
mmdelnode -N nodeA001,nodeA002,nodeA003,nodeA004,nodeC
- Delete the failed disks from the GPFS configuration:
- If there are new resources to add to the configuration:
- Add the new nodes at sites A and C to the cluster where the file
gpfs.sitesAC lists the new nodes:
mmaddnode -N gpfs.sitesAC
- Restore original quorum node assignments at site B:
mmchnode --quorum -N nodeA001,nodeA002,nodeA003,nodeC
- Start GPFS on the new nodes
mmstartup -N gpfs.sitesAC
- Prepare the new disks for use in the cluster, create the NSDs
using the original disk descriptors for site A contained in
the file clusterDisksAC:
%nsd: device=/dev/diskA1 servers=nodeA002,nodeA003 usage=dataAndMetadata failureGroup=1 %nsd: device=/dev/diskA2 servers=nodeA003,nodeA002 usage=dataAndMetadata failureGroup=1 %nsd: device=/dev/diskC1 servers=nodeC usage=descOnly failureGroup=3mmcrnsd -F clusterDisksAC
- Add the new NSDs to the file system specifying the -r option
to rebalance the data on all disks:
mmadddisk fs0 -F clusterDisksAC -r
- Add the new nodes at sites A and C to the cluster where the file
gpfs.sitesAC lists the new nodes: