Recovering from a single quorum or non-quorum node failure
A quorum node failure can happen because of various reasons. For example, a node failure might occur when the local hard disk on the node fails and must be replaced. The old content of the /var/mmfs directory is lost after you replace the disk and reinstall the operating system and other software, including the IBM Storage Scale software stack.
Note: The information given in this topic can also be used for recovering from a non-quorum node
failure.
The recovery procedure for this case works only if the cluster has still enough quorum nodes
available, which can be checked by using the mmgetstate -a command on one of the
remaining quorum nodes. It is assumed that the node to be recovered is configured with the same IP
address as before and the contents of the /etc/hosts file is consistent with
the other remaining quorum nodes as shown in the following example:
# mmlscluster
GPFS cluster information
========================
GPFS cluster name: gpfs-cluster-2.localnet.com
GPFS cluster id: 13445038716777501310
GPFS UID domain: gpfs-cluster-2.localnet.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node Daemon node name IP address Admin node name Designation
----------------------------------------------------------------------------
1 node-21.localnet.com 10.0.100.21 node-21.localnet.com quorum-manager
2 node-22.localnet.com 10.0.100.22 node-22.localnet.com quorum-manager
3 node-23.localnet.com 10.0.100.23 node-23.localnet.com quorum-manager
4 node-24.localnet.com 10.0.100.24 node-24.localnet.com
5 node-25.localnet.com 10.0.100.25 node-25.localnet.com
The entire content of /var/mmfs/ is deleted on the node
node‑23 to simulate this case, then the mmgetstate on the node
to be recovered returns the following output:
# mmgetstate
mmgetstate: This node does not belong to a GPFS cluster.
mmgetstate: Command failed. Examine previous error messages to determine cause.
The cluster has still quorum:# mmgetstate -a
Node number Node name GPFS state
-------------------------------------
1 node-21 active
2 node-22 active
3 node-23 unknown
4 node-24 active
5 node-25 active
Run the mmccr check command on the node to be recovered as shown in the
following example:
# mmccr check -Ye
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::-1:CCR_CLIENT_INIT:811:CCR directory or subdirectory missing:/var/mmfs/ccr:Security:FATAL:
mmccr::0:1:::-1:FC_CCR_AUTH_KEYS:813:File does not exist:/var/mmfs/ssl/authorized_ccr_keys::FATAL:
mmccr::0:1:::-1:FC_CCR_PAXOS_CACHED:811:CCR directory or subdirectory missing:/var/mmfs/ccr/cached::WARNING:
mmccr::0:1:::-1:PC_QUORUM_NODES:812:ccr.nodes file missing or empty:/var/mmfs/ccr/ccr.nodes::FATAL:
mmccr::0:1:::-1:FC_COMMITTED_DIR:812:ccr.nodes file missing or empty:/var/mmfs/ccr/ccr.nodes::FATAL:
In this case, you can recover this node by using the mmsdrrestore command with
the -p option. The -p option must specify a healthy quorum
node from which the necessary files can be transferred. The mmsdrrestore command
must run on the node to be recovered as shown in the following
example:
# mmsdrrestore -p node-21
genkeyData1 100% 3529 1.8MB/s 00:00
genkeyData2 100% 3529 2.8MB/s 00:00
Wed Jul 7 14:42:16 CEST 2021: mmsdrrestore: Processing node node-23.localnet.com
mmsdrrestore: Node node-23.localnet.com successfully restored.
Immediately after the mmsdrrestore command completes, the
mmgetstate command still reports that the GPFS is down. However, you can start
the GPFS now on the recovered node. The mmgetstate command then shows GPFS as
active as shown in the following example:
# mmgetstate
Node number Node name GPFS state
-------------------------------------
3 node-23 active
The output of the mmccr check command on the recovered shows a healthy status
as shown in the following example:
# mmccr check -Ye
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::3:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/ccr.nodes,Security:OK:
mmccr::0:1:::3:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::3:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::3:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::3:PC_LOCAL_SERVER:0:::node-23.localnet.com:OK:
mmccr::0:1:::3:PC_IP_ADDR_LOOKUP:0:::node-23.localnet.com,0.000:OK:
mmccr::0:1:::3:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22,10.0.100.23:OK:
mmccr::0:1:::3:FC_COMMITTED_DIR:0::0:7:OK:
mmccr::0:1:::3:TC_TIEBREAKER_DISKS:0::::OK: