Manual repair procedure for a broken multi-node cluster
For a multi-node cluster containing more than one quorum node, the recovery procedure is similar to a single node cluster with just one quorum node. However the number of possible CCR states is different, as the quorum nodes can have different CCR states.
The step to evaluate the CCR's most recent Paxos state file based on the available number of CCR states is also different. The final patched CCR state and the consolidated committed files must be copied to every quorum node in the cluster to bring back the cluster into a working state.
[root@node-11 ~]# mmlscluster
GPFS cluster information
========================
GPFS cluster name: gpfs-cluster-1.localnet.com
GPFS cluster id: 317908494312547875
GPFS UID domain: localnet.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
GPFS cluster configuration servers:
-----------------------------------
Primary server: node-11.localnet.com (not in use)
Secondary server: (none)
Node Daemon node name IP address Admin node name Designation
----------------------------------------------------------------------------
1 node-11.localnet.com 10.0.100.11 node-11.localnet.com quorum
2 node-12.localnet.com 10.0.100.12 node-12.localnet.com quorum
3 node-13.localnet.com 10.0.100.13 node-13.localnet.com quorum
4 node-14.localnet.com 10.0.100.14 node-14.localnet.com
5 node-15.localnet.com 10.0.100.15 node-15.localnet.com
- Run the mmccrcheck command to find the missing or corrupted files on all the
three quorum nodes.
The command gives output similar to the following:[root@node-11 ~]# mmgetstate -a
[root@node-11 ~]# mmdsh -N quorumnodes "mmccr check -Y -e" | grep "mmdsh\|FC_COMMITTED_DIR" node-11.localnet.com: mmccr::0:1:::1:FC_COMMITTED_DIR:5:Files in committed directory missing or corrupted:1:7:WARNING: mmdsh: node-11.localnet.com remote shell process had return code 149. node-12.localnet.com: mmccr::0:1:::2:FC_COMMITTED_DIR:5:Files in committed directory missing or corrupted:1:7:WARNING: mmdsh: node-12.localnet.com remote shell process had return code 149. node-13.localnet.com: mmccr::0:1:::3:FC_COMMITTED_DIR:5:Files in committed directory missing or corrupted:1:7:WARNING: mmdsh: node-13.localnet.com remote shell process had return code 149.
Most mm-commands executed on the different quorum nodes will fail. Run the mmgetstate command to view the error pattern of the failed commands:
The command gives output similar to the following:[root@node-11 ~]# mmgetstate -a
get file failed: Maximum number of retries reached (err 801) gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 149 mmgetstate: Command failed. Examine previous error messages to determine cause.
- Shutdown GPFS on all nodes in the cluster to prevent any
issues that can be caused by an active mmfsd daemon. The mmshutdown
-a command also fails in case the CCR is unavailable. Run the mmdsh
command to bypass the CCR that is still unavailable in such
cases.
The command gives output similar to the following:[root@node-11 ~]# mmdsh -N all mmshutdown
node-13.localnet.com: Mon Feb 26 17:01:18 CET 2018:mmshutdown: Starting force unmount of GPFS file systems node-12.localnet.com: Mon Feb 26 17:01:17 CET 2018:mmshutdown: Starting force unmount of GPFS file systems node-15.localnet.com: Mon Feb 26 17:01:17 CET 2018:mmshutdown: Starting force unmount of GPFS file systems node-14.localnet.com: Mon Feb 26 17:01:17 CET 2018:mmshutdown: Starting force unmount of GPFS file systems node-11.localnet.com: Mon Feb 26 17:01:18 CET 2018:mmshutdown: Starting force unmount of GPFS file systems node-15.localnet.com: Mon Feb 26 17:01:22 CET 2018:mmshutdown: Shutting down GPFS daemons node-13.localnet.com: Mon Feb 26 17:01:23 CET 2018:mmshutdown: Shutting down GPFS daemons node-12.localnet.com: Mon Feb 26 17:01:22 CET 2018:mmshutdown: Shutting down GPFS daemons node-14.localnet.com: Mon Feb 26 17:01:22 CET 2018:mmshutdown: Shutting down GPFS daemons node-11.localnet.com: Mon Feb 26 17:01:23 CET 2018:mmshutdown: Shutting down GPFS daemons node-15.localnet.com: Mon Feb 26 17:02:11 CET 2018:mmshutdown: Finished node-13.localnet.com: Mon Feb 26 17:02:12 CET 2018:mmshutdown: Finished node-11.localnet.com: Mon Feb 26 17:02:12 CET 2018:mmshutdown: Finished node-14.localnet.com: Mon Feb 26 17:02:11 CET 2018:mmshutdown: Finished node-12.localnet.com: Mon Feb 26 17:02:12 CET 2018:mmshutdown: Finished
Note: Use the mmdsh command to stop the mmsdrserv daemon and the startup scripts on all quorum nodes in the cluster:[root@node-11 ~]# mmdsh -N quorumnodes "mmcommon killCcrMonitor"
Use the following command to check if all the GPFS daemons and the monitor scripts have been stopped on all the quorum nodes:
The command gives output similar to the following:[root@node-11 ~]# mmdsh -N quorumnodes "ps -C mmfsd,mmccrmonitor,mmsdrserv"
node-11.localnet.com: PID TTY TIME CMD mmdsh: node-11.localnet.com remote shell process had return code 1. node-12.localnet.com: PID TTY TIME CMD mmdsh: node-12.localnet.com remote shell process had return code 1. node-13.localnet.com: PID TTY TIME CMD mmdsh: node-13.localnet.com remote shell process had return code 1.
- Back up the entire CCR state of the three quorum nodes using the
tar-command:
[root@node-11 ~]#tar -cvf CCR_archive_node 11_20180226170307.tar /var/mmfs/ccr
The command gives output similar to the following:
/var/mmfs/ccr/ /var/mmfs/ccr/ccr.noauth /var/mmfs/ccr/ccr.paxos.1 /var/mmfs/ccr/committed/ /var/mmfs/ccr/committed/mmsysmon.json.3.1.cee097c7.010002 /var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100 /var/mmfs/ccr/committed/ccr.nodes.1.1.e7e9c9f0.010000 /var/mmfs/ccr/committed/clusterEvents.8.11.963fe8ed.010231 /var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100 /var/mmfs/ccr/committed/genKeyData.5.1.a043b58e.010004 /var/mmfs/ccr/committed/mmLockFileDB.4.1.ffffffff.010003 /var/mmfs/ccr/committed/ccr.disks.2.1.ffffffff.010001 /var/mmfs/ccr/committed/mmsdrfs.7.10.e29fc7cd.010226 /var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100 /var/mmfs/ccr/committed/genKeyDataNew.6.1.a043b58e.010005 /var/mmfs/ccr/committed/genKeyDataNew.6.2.94f88a51.01010f /var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100 /var/mmfs/ccr/committed/mmsdrfs.7.11.bf35437.01022a /var/mmfs/ccr/ccr.disks /var/mmfs/ccr/cached/ /var/mmfs/ccr/cached/ccr.paxos /var/mmfs/ccr/ccr.nodes /var/mmfs/ccr/ccr.paxos.2
Note: This example only shows the output for the first quorum node. The command must be executed on all the quorum nodes as needed. - Create some temporary directories to store the collected CCR state files.
Two sub-directories must be created inside this CCRtemp directory to collect the committed files from all quorum nodes. Of the two sub-directories, one acts as the intermediate directory and the other as the final directory. The committed or final directory keeps the intact files used in the final step to copy back the patched Paxos state. The committedTemp or intermediate directory keeps the files only from the current quorum node that are processed during the procedure.
[root@node-11 ~]# mkdir -p /root/CCRtemp/committed /root/CCRtemp/committedTemp [root@node-11 ~]# cd /root/CCRtemp/
- Copy the /var/mmfs/ccr/ccr.paxos.1 and
/var/mmfs/ccr/ccr.paxos.2 files from every quorum node in the cluster to the
current temporary directory, /root/CCRtemp, using the following
command:
[root@node-11 CCRtemp]# scp root@node-11:/var/mmfs/ccr/ccr.paxos.1 ./ccr.paxos.1.node-11 ccr.paxos.1 100% 4096 4.0KB/s 00:00 [root@node-11 CCRtemp]# scp root@node-11:/var/mmfs/ccr/ccr.paxos.2 ./ccr.paxos.2.node-11 ccr.paxos.2 100% 4096 4.0KB/s 00:00
Note: You can see the directory structure by using the following command:[root@node-11 CCRtemp]# ls –al
The command gives output similar to the following:total 40 drwxr-xr-x 4 root root 4096 Feb 26 17:10 . dr-xr-x---. 4 root root 4096 Feb 26 17:07 .. -rw------- 1 root root 4096 Feb 26 17:09 ccr.paxos.1.node-11 -rw------- 1 root root 4096 Feb 26 17:10 ccr.paxos.1.node-12 -rw------- 1 root root 4096 Feb 26 17:10 ccr.paxos.1.node-13 -rw------- 1 root root 4096 Feb 26 17:09 ccr.paxos.2.node-11 -rw------- 1 root root 4096 Feb 26 17:09 ccr.paxos.2.node-12 -rw------- 1 root root 4096 Feb 26 17:10 ccr.paxos.2.node-13 drwxr-xr-x 2 root root 4096 Feb 26 17:07 committed drwxr-xr-x 2 root root 4096 Feb 26 17:07 committedTemp
- Switch to the committedTemp subdirectory, and copy the files from the first
quorum node into this temporary directory using the following
command:
[root@node-11 committedTemp]# scp root@node-11:/var/mmfs/ccr/committed/*
The command gives output similar to the following:ccr.disks.2.1.ffffffff.010001 100% 0 0.0KB/s 00:00 ccr.nodes.1.1.e7e9c9f0.010000 100% 114 0.1KB/s 00:00 clusterEvents.8.11.963fe8ed.010231 100% 323 0.3KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100 100% 0 0.0KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100 100% 0 0.0KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100 100% 0 0.0KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100 100% 0 0.0KB/s 00:00 genKeyData.5.1.a043b58e.010004 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.1.a043b58e.010005 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.2.94f88a51.01010f 100% 3531 3.5KB/s 00:00 mmLockFileDB.4.1.ffffffff.010003 100% 0 0.0KB/s 00:00 mmsdrfs.7.10.e29fc7cd.010226 100% 4793 4.7KB/s 00:00 mmsdrfs.7.11.bf35437.01022a 100% 5395 5.3KB/s 00:00 mmsysmon.json.3.1.cee097c7.010002 100% 38 0.0KB/s 00:00
Note: You can see the directory structure by using the following command:[root@node-11 committedTemp]# ls –al
The command gives output similar to the following:total 48 drwxr-xr-x 2 root root 4096 Feb 26 17:12 . drwxr-xr-x 4 root root 4096 Feb 26 17:10 .. -rw-r--r-- 1 root root 0 Feb 26 17:12 ccr.disks.2.1.ffffffff.010001 -rw-r--r-- 1 root root 114 Feb 26 17:12 ccr.nodes.1.1.e7e9c9f0.010000 -rw-r--r-- 1 root root 323 Feb 26 17:12 clusterEvents.8.11.963fe8ed.010231 -rw-r--r-- 1 root root 0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100 -rw-r--r-- 1 root root 0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100 -rw-r--r-- 1 root root 0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100 -rw-r--r-- 1 root root 0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100 -rw------- 1 root root 3531 Feb 26 17:12 genKeyData.5.1.a043b58e.010004 -rw------- 1 root root 3531 Feb 26 17:12 genKeyDataNew.6.1.a043b58e.010005 -rw------- 1 root root 3531 Feb 26 17:12 genKeyDataNew.6.2.94f88a51.01010f -rw-r--r-- 1 root root 0 Feb 26 17:12 mmLockFileDB.4.1.ffffffff.010003 -rw-r--r-- 1 root root 4793 Feb 26 17:12 mmsdrfs.7.10.e29fc7cd.010226 -rw-r--r-- 1 root root 5395 Feb 26 17:12 mmsdrfs.7.11.bf35437.01022a -rw-r--r-- 1 root root 38 Feb 26 17:12 mmsysmon.json.3.1.cee097c7.010002
- Verify the CRC of the files copied from the first quorum node during the previous step using the
following
command:
[root@node-11 committedTemp]# cksum * | awk '{ printf "%x %s\n", $1, $3 }'
The command gives output similar to the following:ffffffff ccr.disks.2.1.ffffffff.010001 e7e9c9f0 ccr.nodes.1.1.e7e9c9f0.010000 963fe8ed clusterEvents.8.11.963fe8ed.010231 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100 a043b58e genKeyData.5.1.a043b58e.010004 a043b58e genKeyDataNew.6.1.a043b58e.010005 94f88a51 genKeyDataNew.6.2.94f88a51.01010f ffffffff mmLockFileDB.4.1.ffffffff.010003 e29fc7cd mmsdrfs.7.10.e29fc7cd.010226 bf35437 mmsdrfs.7.11.bf35437.01022a cee097c7 mmsysmon.json.3.1.cee097c7.010002
The faulty files can be identified by a CRC mismatch.
- Delete the files with mismatching or faulty CRC using the following
command:
[root@node-11 committedTemp]# rm clusterEvents.8.12.963fe8ed.010232.bad.2*
The command gives output similar to the following:rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100’? y rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100’? y rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100’? y rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100’? y
- Move the remaining files into the committed subdirectory using the following
command:
[root@node-11 committedTemp]# mv -i * ../committed
- Copy the committed files from the next quorum node into the committedTemp
directory using the following
command:
[root@node-11 committedTemp]# scp root@node-12:/var/mmfs/ccr/committed/* .
The command gives output similar to the following:ccr.disks.2.1.ffffffff.010001 100% 0 0.0KB/s 00:00 ccr.nodes.1.1.e7e9c9f0.010000 100% 114 0.1KB/s 00:00 clusterEvents.8.11.963fe8ed.010231 100% 323 0.3KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.18737.3245463360.2018-02-26_16:57:07.695+0100 100% 0 0.0KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.19994.3932075776.2018-02-26_16:57:45.020+0100 100% 0 0.0KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.21275.3060160320.2018-02-26_16:59:33.687+0100 100% 0 0.0KB/s 00:00 clusterEvents.8.12.963fe8ed.010232.bad.22112.354830080.2018-02-26_16:59:59.467+0100 100% 0 0.0KB/s 00:00 genKeyData.5.1.a043b58e.010004 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.1.a043b58e.010005 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.2.94f88a51.01010f 100% 3531 3.5KB/s 00:00 mmLockFileDB.4.1.ffffffff.010003 100% 0 0.0KB/s 00:00 mmsdrfs.7.10.e29fc7cd.010226 100% 4793 4.7KB/s 00:00 mmsdrfs.7.11.bf35437.01022a 100% 5395 5.3KB/s 00:00 mmsysmon.json.3.1.cee097c7.010002 100% 38 0.0KB/s 00:00
- Verify the CRC of the files using the following
command:.
The command gives output similar to the following:[root@node-11 committedTemp]# cksum * | awk '{ printf "%x %s\n", $1, $3 }'
ffffffff ccr.disks.2.1.ffffffff.010001 e7e9c9f0 ccr.nodes.1.1.e7e9c9f0.010000 963fe8ed clusterEvents.8.11.963fe8ed.010231 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.18737.3245463360.2018-02-26_16:57:07.695+0100 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.19994.3932075776.2018-02-26_16:57:45.020+0100 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.21275.3060160320.2018-02-26_16:59:33.687+0100 ffffffff clusterEvents.8.12.963fe8ed.010232.bad.22112.354830080.2018-02-26_16:59:59.467+0100 a043b58e genKeyData.5.1.a043b58e.010004 a043b58e genKeyDataNew.6.1.a043b58e.010005 94f88a51 genKeyDataNew.6.2.94f88a51.01010f ffffffff mmLockFileDB.4.1.ffffffff.010003 e29fc7cd mmsdrfs.7.10.e29fc7cd.010226 bf35437 mmsdrfs.7.11.bf35437.01022a cee097c7 mmsysmon.json.3.1.cee097c7.010002 100% 38 0.0KB/s 00:00
- Delete the files with mismatching CRC value using the following
command:
The command gives output similar to the following:[root@node-11 committedTemp]# rm clusterEvents.8.12.963fe8ed.010232.bad.*
rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.18737.3245463360.2018-02-26_16:57:07.695+0100’? y rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.19994.3932075776.2018-02-26_16:57:45.020+0100’? y rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.21275.3060160320.2018-02-26_16:59:33.687+0100’? y rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.22112.354830080.2018-02-26_16:59:59.467+0100’? y 100% 38 0.0KB/s 00:00
- Copy the remaining files again into the committed subdirectory, using the cp-command and the -i
option .Note: : Prompt n for each file which already exists in the committed subdirectory. This ensures that only the files which do not already exist are copied to the committed subdirectory.
The command gives output similar to the following:root@node-11 committedTemp]# cp -i * ../committed
cp: overwrite ‘../committed/ccr.disks.2.1.ffffffff.010001’? n cp: overwrite ‘../committed/ccr.nodes.1.1.e7e9c9f0.010000’? n cp: overwrite ‘../committed/clusterEvents.8.11.963fe8ed.010231’? n cp: overwrite ‘../committed/genKeyData.5.1.a043b58e.010004’? n cp: overwrite ‘../committed/genKeyDataNew.6.1.a043b58e.010005’? n cp: overwrite ‘../committed/genKeyDataNew.6.2.94f88a51.01010f’? n cp: overwrite ‘../committed/mmLockFileDB.4.1.ffffffff.010003’? n cp: overwrite ‘../committed/mmsdrfs.7.10.e29fc7cd.010226’? n cp: overwrite ‘../committed/mmsdrfs.7.11.bf35437.01022a’? n cp: overwrite ‘../committed/mmsysmon.json.3.1.cee097c7.010002’? n 100% 38 0.0KB/s 00:00
Remove the files in the committedTemp subdirectory using the following command:
The command gives output similar to the following:[root@node-11 committedTemp]# rm *
rm: remove regular empty file ‘ccr.disks.2.1.ffffffff.010001’? y rm: remove regular file ‘ccr.nodes.1.1.e7e9c9f0.010000’? y rm: remove regular file ‘clusterEvents.8.11.963fe8ed.010231’? y rm: remove regular file ‘genKeyData.5.1.a043b58e.010004’? y rm: remove regular file ‘genKeyDataNew.6.1.a043b58e.010005’? y rm: remove regular file ‘genKeyDataNew.6.2.94f88a51.01010f’? y rm: remove regular empty file ‘mmLockFileDB.4.1.ffffffff.010003’? y rm: remove regular file ‘mmsdrfs.7.10.e29fc7cd.010226’? y rm: remove regular file ‘mmsdrfs.7.11.bf35437.01022a’? y rm: remove regular file ‘mmsysmon.json.3.1.cee097c7.010002’? y 100% 38 0.0KB/s 00:00
Note: This step is taken to prepare the committedTemp subdirectory for the files from the next quorum node, if any.- Repeat steps 10 to 14 for all the remaining and available quorum nodes . The /root/CCRtemp/committed directory now contains all the intact files from all the quorum nodes, and it can be used to patch the CCR Paxos state.
- Change back to the parent directory of the current subdirectory and get the most recent Paxos
state based on the Paxos state files in this directory by using the mmccr
readpaxos
command:
The command gives output similar to the following:[root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.1.node-11 | grep seq
dblk: seq 53, mbal (0.0), bal (0.0), inp ((n0,e0),0):(none):-1:None, leaderChallengeVersion 0 [root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.2.node-11 | grep seq dblk: seq 52, mbal (1.1), bal (1.1), inp ((n0,e0),0):lu:3:[1,23333], leaderChallengeVersion 0 [root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.1.node-12 | grep seq dblk: seq 53, mbal (0.0), bal (0.0), inp ((n0,e0),0):(none):-1:None, leaderChallengeVersion 0 [root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.2.node-12 | grep seq dblk: seq 52, mbal (1.1), bal (1.1), inp ((n0,e0),0):lu:3:[1,23333], leaderChallengeVersion 0 [root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.1.node-13 | grep seq dblk: seq 53, mbal (0.0), bal (0.0), inp ((n0,e0),0):(none):-1:None, leaderChallengeVersion 0 [root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.2.node-13 | grep seq dblk: seq 52, mbal (1.1), bal (1.1), inp ((n0,e0),0):lu:3:[1,23333], leaderChallengeVersion 0 100% 38 0.0KB/s 00:00
The CCR has two Paxos state files in its /var/mmfs/ccr directory, ccr.paxos.1 and ccr.paxos.2. CCR writes alternately to these two files. Maintaining dual copies allows the CCR to always have a copy intact in case the write to one the file fails for some reason and makes this file corrupt. The most recent file is the file with the higher sequence number in it. Therefore, the CCR Paxos state file with the highest sequence number is the most recent one. Ensure that you use the path to the most recent Paxos state file while using the mmccr readpaxos command.
In the example above the CCR Paxos state file ccr.paxos.1.node-11 is the most recent one. The ccr.paxos.1.node-11 file has the sequence number 53. In case of a multi-node cluster, not all quorum nodes must have the same set of sequence numbers, based on how many updates the CCR has seen until the readpaxos command is invoked.
The ccr.paxos.1.node-11 file acts as the input file for the following patching step. The mmccr patchpaxos command must be invoked in the current CCR temp directory. The first parameter of the mmccr patchpaxos command is the path to the most recent CCR Paxos state file. The second parameter of the mmccr patchpaxos command is the path to the collected intact CCR files gathered during the previous steps. The third parameter of the mmccr patchpaxos command is the path to the Paxos state file which will be created when this mmccr patchpaxos command is run:
The command gives output similar to the following:[root@node-11 CCRtemp]# mmccr patchpaxos ./ccr.paxos.1.node-11 ./committed/ ./myPatched_ccr.paxos.1
Committed state found in ./ccr.paxos.1.node-11: config: minNodes: 1 version 0 nodes: [(N1,S0,V0,L1), (N2,S1,V0,L1), (N3,S2,V0,L1)] disks: [] leader: id 1 version 3 updates: horizon -1 {(n1,e0): 5, (n1,e1): 33, (n1,e2): 50} values: 1, max deleted version 9 mmRunningCommand = version 3 "" files: 8, max deleted version 0 1 = version 1 uid ((n1,e0),0) crc E7E9C9F0 2 = version 1 uid ((n1,e0),1) crc FFFFFFFF 3 = version 1 uid ((n1,e0),2) crc CEE097C7 4 = version 1 uid ((n1,e0),3) crc FFFFFFFF 5 = version 1 uid ((n1,e0),4) crc A043B58E 6 = version 2 uid ((n1,e1),15) crc 94F88A51 7 = version 11 uid ((n1,e2),42) crc 0BF35437 8 = version 12 uid ((n1,e2),50) crc 963FE8ED Comparing to content of './committed/': match file: name: 'ccr.nodes' suffix: '1.1.e7e9c9f0.010000' id: 1 version: 1 crc: e7e9c9f0 uid: ((n1,e0),0) and file list entry: 1.1.e7e9c9f0.010000 match file: name: 'ccr.disks' suffix: '2.1.ffffffff.010001' id: 2 version: 1 crc: ffffffff uid: ((n1,e0),1) and file list entry: 2.1.ffffffff.010001 match file: name: 'mmsysmon.json' suffix: '3.1.cee097c7.010002' id: 3 version: 1 crc: cee097c7 uid: ((n1,e0),2) and file list entry: 3.1.cee097c7.010002 match file: name: 'mmLockFileDB' suffix: '4.1.ffffffff.010003' id: 4 version: 1 crc: ffffffff uid: ((n1,e0),3) and file list entry: 4.1.ffffffff.010003 match file: name: 'genKeyData' suffix: '5.1.a043b58e.010004' id: 5 version: 1 crc: a043b58e uid: ((n1,e0),4) and file list entry: 5.1.a043b58e.010004 match file: name: 'genKeyDataNew' suffix: '6.2.94f88a51.01010f' id: 6 version: 2 crc: 94f88a51 uid: ((n1,e1),15) and file list entry: 6.2.94f88a51.01010f match file: name: 'mmsdrfs' suffix: '7.11.bf35437.01022a' id: 7 version: 11 crc: bf35437 uid: ((n1,e2),42) and file list entry: 7.11.bf35437.01022a older: name: 'clusterEvents' suffix: '8.11.963fe8ed.010231' id: 8 version: 11 crc: 963fe8ed uid: ((n1,e2),49) *** reverting committed file list version 12 uid ((n1,e2),50) Found 7 matching, 0 deleted, 0 added, 0 updated, 1 reverted, 0 reset Verifying update history Writing 1 changes to ./myPatched_ccr.paxos.1 config: minNodes: 1 version 0 nodes: [(N1,S0,V0,L1), (N2,S1,V0,L1), (N3,S2,V0,L1)] disks: [] leader: id 1 version 3 updates: horizon -1 {(n1,e0): 5, (n1,e1): 33, (n1,e2): 50} values: 1, max deleted version 9 mmRunningCommand = version 3 "" files: 8, max deleted version 0 1 = version 1 uid ((n1,e0),0) crc E7E9C9F0 2 = version 1 uid ((n1,e0),1) crc FFFFFFFF 3 = version 1 uid ((n1,e0),2) crc CEE097C7 4 = version 1 uid ((n1,e0),3) crc FFFFFFFF 5 = version 1 uid ((n1,e0),4) crc A043B58E 6 = version 2 uid ((n1,e1),15) crc 94F88A51 7 = version 11 uid ((n1,e2),42) crc 0BF35437 8 = version 11 uid ((n1,e2),49) crc 963FE8ED 100% 38 0.0KB/s 00:00
- Copy the patched CCR Paxos state file and the files in the committed directory back to the
appropriate directories on every quorum node using the following
command:
The command gives output similar to the following:[root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-11:/var/mmfs/ccr/ccr.paxos.1
myPatched_ccr.paxos.1 100% 160 0.2KB/s 00:00 [root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-11:/var/mmfs/ccr/ccr.paxos.2 myPatched_ccr.paxos.1 100% 160 0.2KB/s 00:00 [root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-12:/var/mmfs/ccr/ccr.paxos.1 myPatched_ccr.paxos.1 100% 160 0.2KB/s 00:00 [root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-12:/var/mmfs/ccr/ccr.paxos.2 myPatched_ccr.paxos.1 100% 160 0.2KB/s 00:00 [root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-13:/var/mmfs/ccr/ccr.paxos.1 myPatched_ccr.paxos.1 100% 160 0.2KB/s 00:00 [root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-13:/var/mmfs/ccr/ccr.paxos.2 myPatched_ccr.paxos.1 100% 160 0.2KB/s 00:00 [root@node-11 CCRtemp]# scp ./committed/* root@node-11:/var/mmfs/ccr/committed/ ccr.disks.2.1.ffffffff.010001 100% 0 0.0KB/s 00:00 ccr.nodes.1.1.e7e9c9f0.010000 100% 114 0.1KB/s 00:00 clusterEvents.8.11.963fe8ed.010231 100% 323 0.3KB/s 00:00 genKeyData.5.1.a043b58e.010004 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.1.a043b58e.010005 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.2.94f88a51.01010f 100% 3531 3.5KB/s 00:00 mmLockFileDB.4.1.ffffffff.010003 100% 0 0.0KB/s 00:00 mmsdrfs.7.10.e29fc7cd.010226 100% 4793 4.7KB/s 00:00 mmsdrfs.7.11.bf35437.01022a 100% 5395 5.3KB/s 00:00 mmsysmon.json.3.1.cee097c7.010002 100% 38 0.0KB/s 00:00 [root@node-11 CCRtemp]# scp ./committed/* root@node-12:/var/mmfs/ccr/committed/ ccr.disks.2.1.ffffffff.010001 100% 0 0.0KB/s 00:00 ccr.nodes.1.1.e7e9c9f0.010000 100% 114 0.1KB/s 00:00 clusterEvents.8.11.963fe8ed.010231 100% 323 0.3KB/s 00:00 genKeyData.5.1.a043b58e.010004 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.1.a043b58e.010005 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.2.94f88a51.01010f 100% 3531 3.5KB/s 00:00 mmLockFileDB.4.1.ffffffff.010003 100% 0 0.0KB/s 00:00 mmsdrfs.7.10.e29fc7cd.010226 100% 4793 4.7KB/s 00:00 mmsdrfs.7.11.bf35437.01022a 100% 5395 5.3KB/s 00:00 mmsysmon.json.3.1.cee097c7.010002 100% 38 0.0KB/s 00:00 [root@node-11 CCRtemp]# scp ./committed/* root@node-13:/var/mmfs/ccr/committed/ ccr.disks.2.1.ffffffff.010001 100% 0 0.0KB/s 00:00 ccr.nodes.1.1.e7e9c9f0.010000 100% 114 0.1KB/s 00:00 clusterEvents.8.11.963fe8ed.010231 100% 323 0.3KB/s 00:00 genKeyData.5.1.a043b58e.010004 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.1.a043b58e.010005 100% 3531 3.5KB/s 00:00 genKeyDataNew.6.2.94f88a51.01010f 100% 3531 3.5KB/s 00:00 mmLockFileDB.4.1.ffffffff.010003 100% 0 0.0KB/s 00:00 mmsdrfs.7.10.e29fc7cd.010226 100% 4793 4.7KB/s 00:00 mmsdrfs.7.11.bf35437.01022a 100% 5395 5.3KB/s 00:00 mmsysmon.json.3.1.cee097c7.010002 100% 38 0.0KB/s 00:00 100% 38 0.0KB/s 00:00
- Start the mmsdrserv daemon and the monitor script which was stopped
previously:
[root@node-11 ~]# mmdsh -N quorumnodes "mmcommon startCcrMonitor"
- Verify that the mmsdrserv daemon and its monitor script have restarted using
the following
command:
The command gives output similar to the following:[root@node-11 ~]# mmdsh -N quorumnodes "ps -C mmfsd,mmccrmonitor,mmsdrserv"
node-11.localnet.com: PID TTY TIME CMD node-11.localnet.com: 3518 ? 00:00:00 mmccrmonitor node-11.localnet.com: 3734 ? 00:00:00 mmsdrserv node-11.localnet.com: 3816 ? 00:00:00 mmccrmonitor node-12.localnet.com: PID TTY TIME CMD node-12.localnet.com: 30356 ? 00:00:00 mmccrmonitor node-12.localnet.com: 30572 ? 00:00:00 mmsdrserv node-12.localnet.com: 30648 ? 00:00:00 mmccrmonitor node-13.localnet.com: PID TTY TIME CMD node-13.localnet.com: 738 ? 00:00:00 mmccrmonitor node-13.localnet.com: 958 ? 00:00:00 mmsdrserv node-13.localnet.com: 1040 ? 00:00:00 mmccrmonitor 100% 38 0.0KB/s 00:00
The mmccr check command succeeds now. The mmccr check gives output similar to the following:[root@node-11 ~]# mmdsh -N quorumnodes "mmccr check -Y -e" | grep "mmdsh\|FC_COMMITTED_DIR" node-12.localnet.com: mmccr::0:1:::2:FC_COMMITTED_DIR:0::0:8:OK: node-11.localnet.com: mmccr::0:1:::1:FC_COMMITTED_DIR:0::0:8:OK: node-13.localnet.com: mmccr::0:1:::3:FC_COMMITTED_DIR:0::0:8:OK:
The mm-commands are active now, however the cluster is still down:[root@node-11 ~]# mmgetstate -a Node number Node name GPFS state ------------------------------------------- 1 node-11 down 2 node-12 down 3 node-13 down 4 node-14 down 5 node-15 down [root@node-11 ~]# mmlscluster GPFS cluster information ======================== GPFS cluster name: gpfs-cluster-1.localnet.com GPFS cluster id: 317908494312547875 GPFS UID domain: localnet.com Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: CCR GPFS cluster configuration servers: ----------------------------------- Primary server: node-11.localnet.com (not in use) Secondary server: (none) Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------------- 1 node-11.localnet.com 10.0.100.11 node-11.localnet.com quorum 2 node-12.localnet.com 10.0.100.12 node-12.localnet.com quorum 3 node-13.localnet.com 10.0.100.13 node-13.localnet.com quorum 4 node-14.localnet.com 10.0.100.14 node-14.localnet.com 5 node-15.localnet.com 10.0.100.15 node-15.localnet.com
- Start GPFS on all nodes and get the clusters up again
using the following command:
The command gives output similar to the following:[root@node-11 ~]# mmstartup -a
Mon Feb 26 18:04:05 CET 2018: mmstartup: Starting GPFS ... [root@node-11 ~]# mmgetstate -a Node number Node name GPFS state ------------------------------------------- 1 node-11 active 2 node-12 active 3 node-13 active 4 node-14 active 5 node-15 active
The master copy of the GPFS configuration file can be corrupted. The CCR patch command rolls back to the latest available intact version of a corrupted file. This means that for the mmsdrfs file, you lose the configuration changes made between the corrupted and the previous intact version. In such cases, it might be necessary to reboot the quorum node to cleanup the cached memory and all drivers in case GPFS shows different errors in the administration log during startup.