Recover storage cluster from an unplanned multi-node restart or failure
The service of IBM Storage Scale file system might hang if the storage nodes are restarted.
The storage nodes might get restarted unexpectedly and might cause the disk to miss in the storage cluster. If the missing disks exceed the fault tolerance RAID code used that is used in the storage cluster, the file system might stop to provide service. It cannot finish the log recovery process for the lost nodes. It might also block the restarting nodes to join the cluster again.
Follow these steps to recover IBM Spectrum® Storage Scale Erasure Code Edition (ECE) from
the out of fault tolerance.
Note: Out of fault tolerance in the IBM Storage Scale Erasure Code Edition (ECE) environment can be caused due to different
reasons.
- Identify the problem.
- Run the mmfsadm dump waiters command on the running node of the application. It displays the long waiters of I/O threads.
- Run the mmgetstate command. It displays the restart nodes in arbitrating state and others in active state.
- Run the tslsrecgroup rg_1 --server --v2 command several times. It displays that Log Groups jumps between nodes.
- Check the mmfs.log file on all storage nodes. The log message shows up in
loop on different nodes. See the message as
follows:
[E] Beginning to resign log group LG002 in recovery group rg1 ....
It means that the user log group continues to resign.
- Start the recovery process.
- Run the mmlscluster command to find all the quorum nodes.
- Run the mmshutdown command on all quorum nodes. This action causes all nodes to unmount from the file system.
- Run the mmstartup command on all quorum nodes.
- Run the mmlsrecoverygroup rg1 -L
--pdisk to check
whether the recovery group is active again and that no missing
pdisk
exists. - Mount the file system again on all the client nodes and restart the application.