Handling node crashes

If a failed node cannot be recovered, auto recovery migrates all data from disks in this node to other disks in cluster. If the system does not recover, delete the disks in the node and node.

  1. Log in to another cluster node and run mmlsdisk <fs-name> -M command to get a list of disks attached to the failed node. Save the disk list in the diskList file. Each line lists a disk name.
  2. Run the mmdeldisk <fsName> -F <diskList> command to delete the disks attached to the failed node.
  3. Run the mmdelnsd -F <diskList> command to delete NSDs attached to the failed node. Run mmdelnode command to remove the node, or if you are replacing the node with new hardware, use the same name and IP address to continue.

    To replace the failed node with a new node, start the replacement mode with the hostname and the IP address of the failed node. Install IBM Storage Scale packages and configure SSH authorization with other nodes in the cluster. Run the following command to restore IBM Storage Scale configurations in this replacement node:

    mmsdrrestore -p <cluster manager>  -R <remoteFileCopyCommand> -N <replacement node>

    Use the mmlsmgr command to identify the cluster manager node. Use the Remote file copy command that is configured for the cluster.

  4. Start IBM Storage Scale on the replacement node by running the mmstartup -N <replacement node> command. Confirm that IBM Storage Scale state is active by running the mmgetstate -N <replacement node> command.
  5. Prepare a stanza file to create NSDs by running the mmcrnsd command and add these disks into file system by running the mmadddisk command.