diskFailure Event

This event is triggered when a disk I/O operation fails. Upon I/O failure, IBM Storage Scale marks the disk from read/up to ready/down. This I/O failure can also be caused by a node, because all disks connected by the node become unavailable, or a disk failure.

The disk state is ready/down.

Recovery process

  1. Perform simple checks, such as fpo pool and replication >1.
  2. Check the maxDownDisksForRecovery (default16), maxFailedNodesForRecovery (default 3). Abort if the limit is exceeded. Note that these limits can be changed by using the configuration parameters.
  3. If the number of failed FGs is less than 2, wait until dataDiskWaitTimeForRecovery (default 3600/2400) expires, otherwise wait for minDiskWaitTimeForRecovery (default 1800 sec) to expedite recovery due to increased risk.
  4. If the available FGs for metadata is less than three, no action is taken because recovery cannot be performed due to the metadata outage.
  5. After the recovery wait period has passed, recheck the node and disk availability status to ensure that recovery actions are taken.
  6. Suspend all the failed and unavailable disks by running the tschdisk suspend command.
  7. Restripe the data. If a previous restripe process is running, stop it and start a new process.
  8. At successful completion, disks will be in suspended/down or suspended/up if the node is recovered during the restripe.
    Note: If the file system version is 5.0.2 and later, the suspended disks from auto recovery are resumed when the node with suspended or to be emptied disks joins the cluster again. If the file system version is earlier than 5.0.2, cluster administrator has to manually run mmchdisk fs-name resume -a to resume the disks.