Auto recovery
The FPO-enabled/disabled storage pool over internal disks are subject to frequent node and disk failures because of the commodity hardware used in IBM Storage Scale clusters.
IBM Storage Scale auto recovery feature is designed to handle random but routine node and disk failures without requiring manual intervention. However, auto recovery cannot cover all catastrophic outages involving large number of nodes and disks at once. Administrator assessment of the situation and judgment is required to determine the cluster recovery action.
Note:
Following are some important recommendations:
- In IBM
Storage Scale 5.0.2
and later, the suspended disks are resumed automatically when the node rejoins the cluster. In
versions earlier than 5.0.2, the system administrator must issue a command like the following one to
resume the disks:
mmchdisk <FileSystem> resume -a
- If extended outages (days and weeks) are expected, it is recommended to remove that node and all associated disks from the cluster to avoid this outage from affecting subsequent recovery actions.
- If the failed disk is meta disk, during auto recovery, GPFS will try to suspend the failed disk using the mmchdisk <file-system> command. If the remaining failure groups of meta or data disks is less than the value of -r/-m, this will make mmchdisk <file-system> suspend/fail, and therefore auto recovery will not take further actions.