Multiple nodes failure without SGPanic

This topic lists the steps to handle multiple nodes failure without SGPanic

  1. Recover the failed nodes.
  2. If all nodes are recovered quickly, run the mmlsdisk <fs-name> -e command to view the down disk list.
  3. Run the mmlsnsd -X command to check whether there are disks that are undetected by the operating system of nodes.
    For example,
    # mmlsnsd -X
     
    Disk name    NSD volume ID      Device    Devtype  Node name                Remarks
    ---------------------------------------------------------------------------------------------------
    mucxs131d01  AC170E46561E7A8F   /dev/sdb  generic  mucxs131.muc.infineon.com server node
    mucxs131d02  AC170E46561E7A90   /dev/sdc  generic  mucxs131.muc.infineon.com server node
    mucxs531d07  AC170E4B5612838E   /dev/sdh  generic  mucxs531.muc.infineon.com server node
    mucxs531d08  AC170E4B56128391   -         -        mucxs531.muc.infineon.com (not found) server node

    In the above output means the physical disk for the nsd mucxs531d08 is not recognized by the OS. If a disk is not detected, check the corresponding node to see if the disk is physically broken. If the undetected disks cannot be recovered quickly, remove them from the down disk list.

  4. Run the mmchdisk <fs-name> start -d <down disk in step3>.
    If it succeeds, go to step5); if not, open PMR against the issue.
  5. If the undetected disks cannot be recovered, run the mmrestripefs <fs-name> -r to fix the replica of the data whose part of replica are located in these undetected disks.