Auto recovery

The FPO-enabled/disabled storage pool over internal disks are subject to frequent node and disk failures because of the commodity hardware used in IBM Spectrum Scale™ clusters.

IBM Spectrum Scale auto recovery feature is designed to handle random but routine node and disk failures without requiring manual intervention. However, auto recovery cannot cover all catastrophic outages involving large number of nodes and disks at once. Administrator assessment of the situation and judgment is required to determine the cluster recovery action.

Note:

Following are some important recommendations:

  • Once the disks are suspended by the auto recovery, these disks must be manually resumed by the administrator so that the disks can be used by the file system. When the owning node is determined to be the healthy again, mmchdisk resume command must be run.
  • If extended outages (days and weeks) are expected, it is recommended to remove that node and all associated disks from the cluster to avoid this outage from affecting subsequent recovery actions.
  • If the failed disk is meta disk, during auto recovery, GPFS™ will try to suspend the failed disk using the mmchdisk <file-system> command. If the remaining failure groups of meta or data disks is less than the value of -r/-m, this will make mmchdisk <file-system> suspend/fail, and therefore auto recovery will not take further actions.