Disk media failure

Recovery procedures to recover lost data in case of disk media failure.

Regardless of whether you have chosen additional hardware or replication to protect your data against media failures, you first need to determine that the disk has completely failed. If the disk has completely failed and it is not the path to the disk which has failed, follow the procedures defined by your disk vendor. Otherwise:
  1. Check on the states of the disks for the file system:
    mmlsdisk fs1 -e
    GPFS will mark disks down if there have been problems accessing the disk.
  2. To prevent any I/O from going to the down disk, issue these commands immediately:
    mmchdisk fs1 suspend -d gpfs1nsd
    mmchdisk fs1 stop -d gpfs1nsd
    Note: If there are any GPFS file systems with pending I/O to the down disk, the I/O will timeout if the system administrator does not stop it.
    To see if there are any threads that have been waiting a long time for I/O to complete, on all nodes issue:
    mmfsadm dump waiters 10 | grep "I/O completion"
  3. The next step is irreversible! Do not run this command unless data and metadata have been replicated. This command scans file system metadata for disk addresses belonging to the disk in question, then replaces them with a special broken disk address value, which might take a while.
    CAUTION:
    Be extremely careful with using the -p option of mmdeldisk, because by design it destroys references to data blocks, making affected blocks unavailable. This is a last-resort tool, to be used when data loss might have already occurred, to salvage the remaining data–which means it cannot take any precautions. If you are not absolutely certain about the state of the file system and the impact of running this command, do not attempt to run it without first contacting the IBM® Support Center.
    mmdeldisk fs1 gpfs1n12 -p
  4. Invoke the mmfileid command with the operand :BROKEN:
    mmfileid fs1 -d :BROKEN

    For more information, see The mmfileid command.

  5. After the disk is properly repaired and available for use, you can add it back to the file system.