Multiple nodes failure without SGPanic
This topic lists the steps to handle multiple nodes failure without SGPanic
- Recover the failed nodes.
- If all nodes are recovered quickly, run the mmlsdisk <fs-name> -e command to view the down disk list.
- Run the mmlsnsd -X command to check
whether there are disks that are undetected by the operating system
of nodes. For example,
# mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- mucxs131d01 AC170E46561E7A8F /dev/sdb generic mucxs131.muc.infineon.com server node mucxs131d02 AC170E46561E7A90 /dev/sdc generic mucxs131.muc.infineon.com server node mucxs531d07 AC170E4B5612838E /dev/sdh generic mucxs531.muc.infineon.com server node mucxs531d08 AC170E4B56128391 - - mucxs531.muc.infineon.com (not found) server node
In the above output means the physical disk for the nsd mucxs531d08 is not recognized by the OS. If a disk is not detected, check the corresponding node to see if the disk is physically broken. If the undetected disks cannot be recovered quickly, remove them from the down disk list.
- Run the mmchdisk <fs-name> start -d <down
disk in step3>. If it succeeds, go to step5); if not, open PMR against the issue.
- If the undetected disks cannot be recovered, run the mmrestripefs <fs-name> -r to fix the replica of the data whose part of replica are located in these undetected disks.