Starting the disk failure recovery

This topic lists the steps to start the disk failure recovery.

  1. To check the disks that are not in the Ready state, run the mmlsdisk -e command.
  2. Resume the disks that have the Suspended status.
    If there are multiple suspended disks, create a file that lists all the suspended disks one nsd name per line before you resume the disks. Resume the suspended disks by issuing the following command:
    mmchdisk <fsName> resume -d <suspendDisk:List>

    Check the disk status again by running the mmlsdisk -e command and confirm that all disks are now in the Ready state. If some disks are still in the Suspended state, there might be a hardware media or a connection problem. Save the names of these disks in the brokenDiskList file.

  3. Save the disks that do not have the Up availability in the downDiskList file. Each line in downDiskList file stores a disk name. Start these disks by running the following command:
    mmchdisk <fsName> start -d <downDiskList>
    Check the disk status by running the mmlsdisk -e command to confirm that all disks have the Up availability. Disks that do not have the Up availability might have a hardware media or connection problem. Save the names of these disks in the tobeSuspendDiskList file. Suspend these disks by running the following command:
    mmchdisk <fsName> suspend -d <tobeSuspendDiskList>
  4. If a disk is in Suspended status after you restart it, there might be a hardware media or connection problems. To keep your data safe, migrate it to the suspended disks by running the following command:
    mmrestripefs <fsName> -r

    After a file system restripe, all data in the suspended disks is migrated to other disks. At this point, all the data in the file system has the desired level of protection.

  5. Check the disk connections and the disk media for disks that are in the Suspended state and repeat step 2 through step 4. If a failure occurs again, delete these disks from file system by running the mmdeldisk command.
    For example,
    mmdeldisk <fs-name> "broken-disk1;broken-disk2" 

    If one file has some replica on down disks and if you make an update against this replica on down disks, the inode gets marked with the dataupdatemiss or metaupdatemiss flags (if the replica on down disks is metadata, it will be metaupdatemiss; if the replica on down disks is data, it will be dataupdatemiss). You could run mmlsattr -d -L /path/to/file to check these flags. These two flags could only be cleaned by mmchdisk Device start. If some down disks cannot be brought back with mmchdisk Device start, these flags will be kept even if you run mmdeldisk or mmrestripefs -r. To remove these flags, you could stop one NSD disk and then run mmchdisk start to bring it back immediately. This will clean up all the missupdate flags.

    If you are unable to delete a broken disk, contact IBM® support.