Failure and recovery
There are two main failures to consider for FPO environments.
- Node failure and outages: These outages include reboot, kernel crash and hang and can last long. When a node is inaccessible, all the associated disks also become inaccessible.
- Disk failures: These failures include disk failures, hard IO errors and are generally triggered by hardware failures and affect specific disks.
IBM
Storage Scale recovery actions are enabled by
setting the restripeOnDiskFailure configuration option to yes. When this option is
enabled, auto recovery leverages the IBM
Storage Scale
event callback mechanism to trigger necessary actions to perform recovery actions. Specifically, the
following system callbacks are installed when restripeOnDiskFailure=yes:
- event = diskFailure action: /usr/lpp/mmfs/bin/mmcommon recoverFailedDisk %fsName %diskName
- event = nodeJoin action: /usr/lpp/mmfs/bin/mmcommon restartDownDisks %myNode %clusterManager %eventNode
- event = nodeLeave action: /usr/lpp/mmfs/bin/mmcommon stopFailedDisk %myNode %clusterManager %eventNode
Important: Disable auto recovery while you are doing any planned
maintenance such as upgrading an operating system, upgrading hardware or firmware, or doing any of
the following tasks for IBM
Storage Scale: installing a
later version, deleting a node, or deleting or replacing an NSD server. Otherwise auto recovery,
which handles unexpected disk or node exceptions, can interfere with or break the maintenance
process.