Failure and recovery

There are two main failures to consider for FPO environments.
  1. Node failure and outages: These outages include reboot, kernel crash and hang and can last long. When a node is inaccessible, all the associated disks also become inaccessible.
  2. Disk failures: These failures include disk failures, hard IO errors and are generally triggered by hardware failures and affect specific disks.
IBM Storage Scale recovery actions are enabled by setting the restripeOnDiskFailure configuration option to yes. When this option is enabled, auto recovery leverages the IBM Storage Scale event callback mechanism to trigger necessary actions to perform recovery actions. Specifically, the following system callbacks are installed when restripeOnDiskFailure=yes:
  • event = diskFailure action: /usr/lpp/mmfs/bin/mmcommon recoverFailedDisk %fsName %diskName
  • event = nodeJoin action: /usr/lpp/mmfs/bin/mmcommon restartDownDisks %myNode %clusterManager %eventNode
  • event = nodeLeave action: /usr/lpp/mmfs/bin/mmcommon stopFailedDisk %myNode %clusterManager %eventNode
Important: Disable auto recovery while you are doing any planned maintenance such as upgrading an operating system, upgrading hardware or firmware, or doing any of the following tasks for IBM Storage Scale: installing a later version, deleting a node, or deleting or replacing an NSD server. Otherwise auto recovery, which handles unexpected disk or node exceptions, can interfere with or break the maintenance process.