Recovering from a single quorum or non-quorum node failure

A quorum node failure can happen because of various reasons. For example, a node failure might occur when the local hard disk on the node fails and must be replaced. The old content of the /var/mmfs directory is lost after you replace the disk and reinstall the operating system and other software, including the IBM Storage Scale software stack.

Note: The information given in this topic can also be used for recovering from a non-quorum node failure.
The recovery procedure for this case works only if the cluster has still enough quorum nodes available, which can be checked by using the mmgetstate -a command on one of the remaining quorum nodes. It is assumed that the node to be recovered is configured with the same IP address as before and the contents of the /etc/hosts file is consistent with the other remaining quorum nodes as shown in the following example:
# mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         gpfs-cluster-2.localnet.com
  GPFS cluster id:           13445038716777501310
  GPFS UID domain:           gpfs-cluster-2.localnet.com
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:           CCR

 Node  Daemon node name      IP address   Admin node name       Designation
----------------------------------------------------------------------------
   1   node-21.localnet.com  10.0.100.21  node-21.localnet.com  quorum-manager
   2   node-22.localnet.com  10.0.100.22  node-22.localnet.com  quorum-manager
   3   node-23.localnet.com  10.0.100.23  node-23.localnet.com  quorum-manager
   4   node-24.localnet.com  10.0.100.24  node-24.localnet.com
   5   node-25.localnet.com  10.0.100.25  node-25.localnet.com

The entire content of /var/mmfs/ is deleted on the node node‑23 to simulate this case, then the mmgetstate on the node to be recovered returns the following output:
# mmgetstate
mmgetstate: This node does not belong to a GPFS cluster.
mmgetstate: Command failed. Examine previous error messages to determine cause.
The cluster has still quorum:
# mmgetstate -a

 Node number  Node name  GPFS state
-------------------------------------
           1  node-21    active
           2  node-22    active
           3  node-23    unknown
           4  node-24    active
           5  node-25    active
Run the mmccr check command on the node to be recovered as shown in the following example:
# mmccr check -Ye
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::-1:CCR_CLIENT_INIT:811:CCR directory or subdirectory missing:/var/mmfs/ccr:Security:FATAL:
mmccr::0:1:::-1:FC_CCR_AUTH_KEYS:813:File does not exist:/var/mmfs/ssl/authorized_ccr_keys::FATAL:
mmccr::0:1:::-1:FC_CCR_PAXOS_CACHED:811:CCR directory or subdirectory missing:/var/mmfs/ccr/cached::WARNING:
mmccr::0:1:::-1:PC_QUORUM_NODES:812:ccr.nodes file missing or empty:/var/mmfs/ccr/ccr.nodes::FATAL:
mmccr::0:1:::-1:FC_COMMITTED_DIR:812:ccr.nodes file missing or empty:/var/mmfs/ccr/ccr.nodes::FATAL:
In this case, you can recover this node by using the mmsdrrestore command with the -p option. The -p option must specify a healthy quorum node from which the necessary files can be transferred. The mmsdrrestore command must run on the node to be recovered as shown in the following example:
# mmsdrrestore -p node-21
genkeyData1                                                                                                                                                                                                100% 3529     1.8MB/s   00:00
genkeyData2                                                                                                                                                                                                100% 3529     2.8MB/s   00:00
Wed Jul  7 14:42:16 CEST 2021: mmsdrrestore: Processing node node-23.localnet.com
mmsdrrestore: Node node-23.localnet.com successfully restored.
Immediately after the mmsdrrestore command completes, the mmgetstate command still reports that the GPFS is down. However, you can start the GPFS now on the recovered node. The mmgetstate command then shows GPFS as active as shown in the following example:
# mmgetstate

 Node number  Node name  GPFS state
-------------------------------------
           3  node-23    active
The output of the mmccr check command on the recovered shows a healthy status as shown in the following example:
# mmccr check -Ye
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::3:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/ccr.nodes,Security:OK:
mmccr::0:1:::3:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::3:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::3:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::3:PC_LOCAL_SERVER:0:::node-23.localnet.com:OK:
mmccr::0:1:::3:PC_IP_ADDR_LOOKUP:0:::node-23.localnet.com,0.000:OK:
mmccr::0:1:::3:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22,10.0.100.23:OK:
mmccr::0:1:::3:FC_COMMITTED_DIR:0::0:7:OK:
mmccr::0:1:::3:TC_TIEBREAKER_DISKS:0::::OK: