Failback with permanent loss

If an outage is of a permanent nature, follow steps to remove and replace the failed resources, and then resume the operation of GPFS across the cluster.

  1. Remove the failed resources from the GPFS configuration.
  2. Replace the failed resources, then add the new resources into the configuration.
  3. Resume the operation of GPFS across the entire cluster.

Assume that sites A and C have had permanent losses. To remove all references of the failed nodes and disks from the GPFS configuration and replace them:

Procedure when Clustered Configuration Repository (CCR) is in use

  1. To remove the failed resources from the GPFS configuration:
    1. Delete the failed disks from the GPFS configuration:
      mmdeldisk fs0 "gpfs1nsd;gpfs2nsd;gpfs5nsd"
      mmdelnsd "gpfs1nsd;gpfs2nsd;gpfs5nsd"
    2. Delete the failed nodes from the GPFS configuration:
      mmdelnode -N nodeA001,nodeA002,nodeA003,nodeA004,nodeC
      
  2. If there are new resources to add to the configuration:
    1. Add the new nodes at sites A and C to the cluster where the file gpfs.sitesAC lists the new nodes:
      mmaddnode -N gpfs.sitesAC
    2. Restore original quorum node assignments at site B:
      mmchnode --quorum -N nodeA001,nodeA002,nodeA003,nodeC
    3. Start GPFS on the new nodes
      mmstartup -N gpfs.sitesAC
    4. Prepare the new disks for use in the cluster, create the NSDs using the original disk descriptors for site A contained in the file clusterDisksAC:
      %nsd: device=/dev/diskA1
      	servers=nodeA002,nodeA003
      	usage=dataAndMetadata
      	failureGroup=1
      
      	%nsd: device=/dev/diskA2
      	servers=nodeA003,nodeA002
      	usage=dataAndMetadata
      	failureGroup=1
      
      	%nsd: device=/dev/diskC1
      	servers=nodeC
      	usage=descOnly
      
      	failureGroup=3mmcrnsd -F clusterDisksAC
    5. Add the new NSDs to the file system specifying the -r option to rebalance the data on all disks:
      mmadddisk fs0 -F clusterDisksAC -r