Manual repair procedure for a broken multi-node cluster

For a multi-node cluster containing more than one quorum node, the recovery procedure is similar to a single node cluster with just one quorum node. However the number of possible CCR states is different, as the quorum nodes can have different CCR states.

The step to evaluate the CCR's most recent Paxos state file based on the available number of CCR states is also different. The final patched CCR state and the consolidated committed files must be copied to every quorum node in the cluster to bring back the cluster into a working state.

The following output provides information regarding the cluster used for this procedure. This procedure is for a cluster with three quorum nodes. In case of more than three quorum nodes the procedure must be altered accordingly.

[root@node-11 ~]# mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         gpfs-cluster-1.localnet.com
  GPFS cluster id:           317908494312547875
  GPFS UID domain:           localnet.com
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:           CCR

GPFS cluster configuration servers:
-----------------------------------
  Primary server:    node-11.localnet.com (not in use)
  Secondary server:  (none)

 Node  Daemon node name      IP address   Admin node name       Designation
----------------------------------------------------------------------------
   1   node-11.localnet.com  10.0.100.11  node-11.localnet.com  quorum
   2   node-12.localnet.com  10.0.100.12  node-12.localnet.com  quorum
   3   node-13.localnet.com  10.0.100.13  node-13.localnet.com  quorum
   4   node-14.localnet.com  10.0.100.14  node-14.localnet.com  
   5   node-15.localnet.com  10.0.100.15  node-15.localnet.com

Follow these steps to recover the broken cluster:

Run the mmccrcheck command to find the missing or corrupted files on all the three quorum nodes.

[root@node-11 ~]# mmgetstate -a

The command gives output similar to the following:

[root@node-11 ~]# mmdsh -N quorumnodes "mmccr check -Y -e" | grep
"mmdsh\|FC_COMMITTED_DIR"
node-11.localnet.com:  mmccr::0:1:::1:FC_COMMITTED_DIR:5:Files in
committed directory missing or corrupted:1:7:WARNING:
mmdsh: node-11.localnet.com remote shell process had return code 149.
node-12.localnet.com:  mmccr::0:1:::2:FC_COMMITTED_DIR:5:Files in
committed directory missing or corrupted:1:7:WARNING:
mmdsh: node-12.localnet.com remote shell process had return code 149.
node-13.localnet.com:  mmccr::0:1:::3:FC_COMMITTED_DIR:5:Files in
committed directory missing or corrupted:1:7:WARNING:
mmdsh: node-13.localnet.com remote shell process had return code 149.

Most mm-commands executed on the different quorum nodes will fail. Run the mmgetstate command to view the error pattern of the failed commands:

[root@node-11 ~]# mmgetstate -a

The command gives output similar to the following:

get file failed: Maximum number of retries reached (err 801)
gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 149
mmgetstate: Command failed. Examine previous error messages to determine cause.

Shutdown GPFS on all nodes in the cluster to prevent any issues that can be caused by an active mmfsd daemon. The mmshutdown -a command also fails in case the CCR is unavailable. Run the mmdsh command to bypass the CCR that is still unavailable in such cases.

[root@node-11 ~]# mmdsh -N all mmshutdown

The command gives output similar to the following:

node-13.localnet.com:  Mon Feb 26 17:01:18 CET 2018:mmshutdown: Starting force unmount of GPFS file systems
node-12.localnet.com:  Mon Feb 26 17:01:17 CET 2018:mmshutdown: Starting force unmount of GPFS file systems
node-15.localnet.com:  Mon Feb 26 17:01:17 CET 2018:mmshutdown: Starting force unmount of GPFS file systems
node-14.localnet.com:  Mon Feb 26 17:01:17 CET 2018:mmshutdown: Starting force unmount of GPFS file systems
node-11.localnet.com:  Mon Feb 26 17:01:18 CET 2018:mmshutdown: Starting force unmount of GPFS file systems
node-15.localnet.com:  Mon Feb 26 17:01:22 CET 2018:mmshutdown: Shutting down GPFS daemons
node-13.localnet.com:  Mon Feb 26 17:01:23 CET 2018:mmshutdown: Shutting down GPFS daemons
node-12.localnet.com:  Mon Feb 26 17:01:22 CET 2018:mmshutdown: Shutting down GPFS daemons
node-14.localnet.com:  Mon Feb 26 17:01:22 CET 2018:mmshutdown: Shutting down GPFS daemons
node-11.localnet.com:  Mon Feb 26 17:01:23 CET 2018:mmshutdown: Shutting down GPFS daemons
node-15.localnet.com:  Mon Feb 26 17:02:11 CET 2018:mmshutdown: Finished
node-13.localnet.com:  Mon Feb 26 17:02:12 CET 2018:mmshutdown: Finished
node-11.localnet.com:  Mon Feb 26 17:02:12 CET 2018:mmshutdown: Finished
node-14.localnet.com:  Mon Feb 26 17:02:11 CET 2018:mmshutdown: Finished
node-12.localnet.com:  Mon Feb 26 17:02:12 CET 2018:mmshutdown: Finished

Note: Use the mmdsh command to stop the mmsdrserv daemon and the startup scripts on all quorum nodes in the cluster:

[root@node-11 ~]# mmdsh -N quorumnodes "mmcommon killCcrMonitor"

Use the following command to check if all the GPFS daemons and the monitor scripts have been stopped on all the quorum nodes:

[root@node-11 ~]# mmdsh -N quorumnodes 
"ps -C mmfsd,mmccrmonitor,mmsdrserv"

The command gives output similar to the following:

node-11.localnet.com:    PID TTY          TIME CMD
mmdsh: node-11.localnet.com remote shell process had return code 1.
node-12.localnet.com:    PID TTY          TIME CMD
mmdsh: node-12.localnet.com remote shell process had return code 1.
node-13.localnet.com:    PID TTY          TIME CMD
mmdsh: node-13.localnet.com remote shell process had return code 1.

Back up the entire CCR state of the three quorum nodes using the tar-command:

[root@node-11 ~]#tar -cvf CCR_archive_node 11_20180226170307.tar /var/mmfs/ccr

The command gives output similar to the following:

/var/mmfs/ccr/
/var/mmfs/ccr/ccr.noauth
/var/mmfs/ccr/ccr.paxos.1
/var/mmfs/ccr/committed/
/var/mmfs/ccr/committed/mmsysmon.json.3.1.cee097c7.010002
/var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100
/var/mmfs/ccr/committed/ccr.nodes.1.1.e7e9c9f0.010000
/var/mmfs/ccr/committed/clusterEvents.8.11.963fe8ed.010231
/var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100
/var/mmfs/ccr/committed/genKeyData.5.1.a043b58e.010004
/var/mmfs/ccr/committed/mmLockFileDB.4.1.ffffffff.010003
/var/mmfs/ccr/committed/ccr.disks.2.1.ffffffff.010001
/var/mmfs/ccr/committed/mmsdrfs.7.10.e29fc7cd.010226
/var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100
/var/mmfs/ccr/committed/genKeyDataNew.6.1.a043b58e.010005
/var/mmfs/ccr/committed/genKeyDataNew.6.2.94f88a51.01010f
/var/mmfs/ccr/committed/clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100
/var/mmfs/ccr/committed/mmsdrfs.7.11.bf35437.01022a
/var/mmfs/ccr/ccr.disks
/var/mmfs/ccr/cached/
/var/mmfs/ccr/cached/ccr.paxos
/var/mmfs/ccr/ccr.nodes
/var/mmfs/ccr/ccr.paxos.2

Note: This example only shows the output for the first quorum node. The command must be executed on all the quorum nodes as needed.

Create some temporary directories to store the collected CCR state files.
Two sub-directories must be created inside this CCRtemp directory to collect the committed files from all quorum nodes. Of the two sub-directories, one acts as the intermediate directory and the other as the final directory. The committed or final directory keeps the intact files used in the final step to copy back the patched Paxos state. The committedTemp or intermediate directory keeps the files only from the current quorum node that are processed during the procedure.
```
[root@node-11 ~]# mkdir -p /root/CCRtemp/committed /root/CCRtemp/committedTemp
[root@node-11 ~]# cd /root/CCRtemp/
```

Copy the /var/mmfs/ccr/ccr.paxos.1 and /var/mmfs/ccr/ccr.paxos.2 files from every quorum node in the cluster to the current temporary directory, /root/CCRtemp, using the following command:

[root@node-11 CCRtemp]# scp root@node-11:/var/mmfs/ccr/ccr.paxos.1 ./ccr.paxos.1.node-11
ccr.paxos.1                                                                                                                                                                  100% 4096     4.0KB/s   00:00    
[root@node-11 CCRtemp]# scp root@node-11:/var/mmfs/ccr/ccr.paxos.2 ./ccr.paxos.2.node-11
ccr.paxos.2                                                                                                                                                                  100% 4096     4.0KB/s   00:00

Note: You can see the directory structure by using the following command:

[root@node-11 CCRtemp]# ls –al

The command gives output similar to the following:

total 40
drwxr-xr-x  4 root root 4096 Feb 26 17:10 .
dr-xr-x---. 4 root root 4096 Feb 26 17:07 ..
-rw-------  1 root root 4096 Feb 26 17:09 ccr.paxos.1.node-11
-rw-------  1 root root 4096 Feb 26 17:10 ccr.paxos.1.node-12
-rw-------  1 root root 4096 Feb 26 17:10 ccr.paxos.1.node-13
-rw-------  1 root root 4096 Feb 26 17:09 ccr.paxos.2.node-11
-rw-------  1 root root 4096 Feb 26 17:09 ccr.paxos.2.node-12
-rw-------  1 root root 4096 Feb 26 17:10 ccr.paxos.2.node-13
drwxr-xr-x  2 root root 4096 Feb 26 17:07 committed
drwxr-xr-x  2 root root 4096 Feb 26 17:07 committedTemp

Switch to the committedTemp subdirectory, and copy the files from the first quorum node into this temporary directory using the following command:

[root@node-11 committedTemp]# scp root@node-11:/var/mmfs/ccr/committed/*

The command gives output similar to the following:

ccr.disks.2.1.ffffffff.010001                                                                                                                                               100%    0     0.0KB/s   00:00    
ccr.nodes.1.1.e7e9c9f0.010000                                                                                                                                               100%  114     0.1KB/s   00:00    
clusterEvents.8.11.963fe8ed.010231                                                                                                                                          100%  323     0.3KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100                                                                                        100%    0     0.0KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100                                                                                        100%    0     0.0KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100                                                                                        100%    0     0.0KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100                                                                                        100%    0     0.0KB/s   00:00    
genKeyData.5.1.a043b58e.010004                                                                                                                                              100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.1.a043b58e.010005                                                                                                                                           100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.2.94f88a51.01010f                                                                                                                                           100% 3531     3.5KB/s   00:00    
mmLockFileDB.4.1.ffffffff.010003                                                                                                                                            100%    0     0.0KB/s   00:00    
mmsdrfs.7.10.e29fc7cd.010226                                                                                                                                                100% 4793     4.7KB/s   00:00    
mmsdrfs.7.11.bf35437.01022a                                                                                                                                                 100% 5395     5.3KB/s   00:00    
mmsysmon.json.3.1.cee097c7.010002                                                                                                                                           100%   38     0.0KB/s   00:00

Note: You can see the directory structure by using the following command:

[root@node-11 committedTemp]# ls –al

The command gives output similar to the following:

total 48
drwxr-xr-x 2 root root 4096 Feb 26 17:12 .
drwxr-xr-x 4 root root 4096 Feb 26 17:10 ..
-rw-r--r-- 1 root root    0 Feb 26 17:12 ccr.disks.2.1.ffffffff.010001
-rw-r--r-- 1 root root  114 Feb 26 17:12 ccr.nodes.1.1.e7e9c9f0.010000
-rw-r--r-- 1 root root  323 Feb 26 17:12 clusterEvents.8.11.963fe8ed.010231
-rw-r--r-- 1 root root    0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100
-rw-r--r-- 1 root root    0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100
-rw-r--r-- 1 root root    0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100
-rw-r--r-- 1 root root    0 Feb 26 17:12 clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100
-rw------- 1 root root 3531 Feb 26 17:12 genKeyData.5.1.a043b58e.010004
-rw------- 1 root root 3531 Feb 26 17:12 genKeyDataNew.6.1.a043b58e.010005
-rw------- 1 root root 3531 Feb 26 17:12 genKeyDataNew.6.2.94f88a51.01010f
-rw-r--r-- 1 root root    0 Feb 26 17:12 mmLockFileDB.4.1.ffffffff.010003
-rw-r--r-- 1 root root 4793 Feb 26 17:12 mmsdrfs.7.10.e29fc7cd.010226
-rw-r--r-- 1 root root 5395 Feb 26 17:12 mmsdrfs.7.11.bf35437.01022a
-rw-r--r-- 1 root root   38 Feb 26 17:12 mmsysmon.json.3.1.cee097c7.010002

Verify the CRC of the files copied from the first quorum node during the previous step using the following command:

[root@node-11 committedTemp]# cksum * | awk '{ printf "%x %s\n", $1, $3 }'

The command gives output similar to the following:

ffffffff ccr.disks.2.1.ffffffff.010001
e7e9c9f0 ccr.nodes.1.1.e7e9c9f0.010000
963fe8ed clusterEvents.8.11.963fe8ed.010231
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100
a043b58e genKeyData.5.1.a043b58e.010004
a043b58e genKeyDataNew.6.1.a043b58e.010005
94f88a51 genKeyDataNew.6.2.94f88a51.01010f
ffffffff mmLockFileDB.4.1.ffffffff.010003
e29fc7cd mmsdrfs.7.10.e29fc7cd.010226
bf35437 mmsdrfs.7.11.bf35437.01022a
cee097c7 mmsysmon.json.3.1.cee097c7.010002

The faulty files can be identified by a CRC mismatch.

Delete the files with mismatching or faulty CRC using the following command:

[root@node-11 committedTemp]# rm clusterEvents.8.12.963fe8ed.010232.bad.2*

The command gives output similar to the following:

rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.22517.4083746624.2018-02-26_16:57:02.857+0100’? y
rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.24040.4169168640.2018-02-26_16:57:39.226+0100’? y
rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.26086.4229216064.2018-02-26_16:59:25.250+0100’? y
rm: remove regular empty file ‘clusterEvents.8.12.963fe8ed.010232.bad.27281.1088599808.2018-02-26_16:59:59.681+0100’? y

Move the remaining files into the committed subdirectory using the following command:
```
[root@node-11 committedTemp]# mv -i * ../committed
```

Copy the committed files from the next quorum node into the committedTemp directory using the following command:

[root@node-11 committedTemp]# scp root@node-12:/var/mmfs/ccr/committed/* .

The command gives output similar to the following:

ccr.disks.2.1.ffffffff.010001                                                                                                                                               100%    0     0.0KB/s   00:00    
ccr.nodes.1.1.e7e9c9f0.010000                                                                                                                                               100%  114     0.1KB/s   00:00    
clusterEvents.8.11.963fe8ed.010231                                                                                                                                          100%  323     0.3KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.18737.3245463360.2018-02-26_16:57:07.695+0100                                                                                        100%    0     0.0KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.19994.3932075776.2018-02-26_16:57:45.020+0100                                                                                        100%    0     0.0KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.21275.3060160320.2018-02-26_16:59:33.687+0100                                                                                        100%    0     0.0KB/s   00:00    
clusterEvents.8.12.963fe8ed.010232.bad.22112.354830080.2018-02-26_16:59:59.467+0100                                                                                         100%    0     0.0KB/s   00:00    
genKeyData.5.1.a043b58e.010004                                                                                                                                              100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.1.a043b58e.010005                                                                                                                                           100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.2.94f88a51.01010f                                                                                                                                           100% 3531     3.5KB/s   00:00    
mmLockFileDB.4.1.ffffffff.010003                                                                                                                                            100%    0     0.0KB/s   00:00    
mmsdrfs.7.10.e29fc7cd.010226                                                                                                                                                100% 4793     4.7KB/s   00:00    
mmsdrfs.7.11.bf35437.01022a                                                                                                                                                 100% 5395     5.3KB/s   00:00    
mmsysmon.json.3.1.cee097c7.010002                                                                                                                                           100%   38     0.0KB/s   00:00

Verify the CRC of the files using the following command:.

[root@node-11 committedTemp]# cksum * | awk '{ printf "%x %s\n", $1, $3 }'

The command gives output similar to the following:

ffffffff ccr.disks.2.1.ffffffff.010001
e7e9c9f0 ccr.nodes.1.1.e7e9c9f0.010000
963fe8ed clusterEvents.8.11.963fe8ed.010231
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.18737.3245463360.2018-02-26_16:57:07.695+0100
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.19994.3932075776.2018-02-26_16:57:45.020+0100
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.21275.3060160320.2018-02-26_16:59:33.687+0100
ffffffff clusterEvents.8.12.963fe8ed.010232.bad.22112.354830080.2018-02-26_16:59:59.467+0100
a043b58e genKeyData.5.1.a043b58e.010004
a043b58e genKeyDataNew.6.1.a043b58e.010005
94f88a51 genKeyDataNew.6.2.94f88a51.01010f
ffffffff mmLockFileDB.4.1.ffffffff.010003
e29fc7cd mmsdrfs.7.10.e29fc7cd.010226
bf35437 mmsdrfs.7.11.bf35437.01022a
cee097c7 mmsysmon.json.3.1.cee097c7.010002                                                                                                                                   100%   38     0.0KB/s   00:00

Delete the files with mismatching CRC value using the following command:

[root@node-11 committedTemp]# rm clusterEvents.8.12.963fe8ed.010232.bad.*

The command gives output similar to the following:

rm: remove regular empty file 
‘clusterEvents.8.12.963fe8ed.010232.bad.18737.3245463360.2018-02-26_16:57:07.695+0100’? y
rm: remove regular empty file 
‘clusterEvents.8.12.963fe8ed.010232.bad.19994.3932075776.2018-02-26_16:57:45.020+0100’? y
rm: remove regular empty file 
‘clusterEvents.8.12.963fe8ed.010232.bad.21275.3060160320.2018-02-26_16:59:33.687+0100’? y
rm: remove regular empty file 
‘clusterEvents.8.12.963fe8ed.010232.bad.22112.354830080.2018-02-26_16:59:59.467+0100’? y
                                                                                                                                 
100%   38     0.0KB/s   00:00

Copy the remaining files again into the committed subdirectory, using the cp-command and the -i option .

Note: : Prompt n for each file which already exists in the committed subdirectory. This ensures that only the files which do not already exist are copied to the committed subdirectory.

root@node-11 committedTemp]# cp -i * ../committed

The command gives output similar to the following:

cp: overwrite ‘../committed/ccr.disks.2.1.ffffffff.010001’? n
cp: overwrite ‘../committed/ccr.nodes.1.1.e7e9c9f0.010000’? n
cp: overwrite ‘../committed/clusterEvents.8.11.963fe8ed.010231’? n
cp: overwrite ‘../committed/genKeyData.5.1.a043b58e.010004’? n
cp: overwrite ‘../committed/genKeyDataNew.6.1.a043b58e.010005’? n
cp: overwrite ‘../committed/genKeyDataNew.6.2.94f88a51.01010f’? n
cp: overwrite ‘../committed/mmLockFileDB.4.1.ffffffff.010003’? n
cp: overwrite ‘../committed/mmsdrfs.7.10.e29fc7cd.010226’? n
cp: overwrite ‘../committed/mmsdrfs.7.11.bf35437.01022a’? n
cp: overwrite ‘../committed/mmsysmon.json.3.1.cee097c7.010002’? n                                                                                                                                100%   38     0.0KB/s   00:00

Remove the files in the committedTemp subdirectory using the following command:

[root@node-11 committedTemp]# rm *

The command gives output similar to the following:

rm: remove regular empty file ‘ccr.disks.2.1.ffffffff.010001’? y
rm: remove regular file ‘ccr.nodes.1.1.e7e9c9f0.010000’? y
rm: remove regular file ‘clusterEvents.8.11.963fe8ed.010231’? y
rm: remove regular file ‘genKeyData.5.1.a043b58e.010004’? y
rm: remove regular file ‘genKeyDataNew.6.1.a043b58e.010005’? y
rm: remove regular file ‘genKeyDataNew.6.2.94f88a51.01010f’? y
rm: remove regular empty file ‘mmLockFileDB.4.1.ffffffff.010003’? y
rm: remove regular file ‘mmsdrfs.7.10.e29fc7cd.010226’? y
rm: remove regular file ‘mmsdrfs.7.11.bf35437.01022a’? y
rm: remove regular file ‘mmsysmon.json.3.1.cee097c7.010002’? y                                                                                                                                100%   38     0.0KB/s   00:00

Note: This step is taken to prepare the committedTemp subdirectory for the files from the next quorum node, if any.

Repeat steps 10 to 14 for all the remaining and available quorum nodes . The /root/CCRtemp/committed directory now contains all the intact files from all the quorum nodes, and it can be used to patch the CCR Paxos state.

Change back to the parent directory of the current subdirectory and get the most recent Paxos state based on the Paxos state files in this directory by using the mmccr readpaxos command:

[root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.1.node-11 | grep seq

The command gives output similar to the following:

dblk: seq 53, mbal (0.0), bal (0.0), inp ((n0,e0),0):(none):-1:None, leaderChallengeVersion 0
[root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.2.node-11 | grep seq
dblk: seq 52, mbal (1.1), bal (1.1), inp ((n0,e0),0):lu:3:[1,23333], leaderChallengeVersion 0
[root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.1.node-12 | grep seq
dblk: seq 53, mbal (0.0), bal (0.0), inp ((n0,e0),0):(none):-1:None, leaderChallengeVersion 0
[root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.2.node-12 | grep seq
dblk: seq 52, mbal (1.1), bal (1.1), inp ((n0,e0),0):lu:3:[1,23333], leaderChallengeVersion 0
[root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.1.node-13 | grep seq
dblk: seq 53, mbal (0.0), bal (0.0), inp ((n0,e0),0):(none):-1:None, leaderChallengeVersion 0
[root@node-11 CCRtemp]# mmccr readpaxos ccr.paxos.2.node-13 | grep seq
dblk: seq 52, mbal (1.1), bal (1.1), inp ((n0,e0),0):lu:3:[1,23333], leaderChallengeVersion 0                                                                                                                              100%   38     0.0KB/s   00:00

The CCR has two Paxos state files in its /var/mmfs/ccr directory, ccr.paxos.1 and ccr.paxos.2. CCR writes alternately to these two files. Maintaining dual copies allows the CCR to always have a copy intact in case the write to one the file fails for some reason and makes this file corrupt. The most recent file is the file with the higher sequence number in it. Therefore, the CCR Paxos state file with the highest sequence number is the most recent one. Ensure that you use the path to the most recent Paxos state file while using the mmccr readpaxos command.

In the example above the CCR Paxos state file ccr.paxos.1.node-11 is the most recent one. The ccr.paxos.1.node-11 file has the sequence number 53. In case of a multi-node cluster, not all quorum nodes must have the same set of sequence numbers, based on how many updates the CCR has seen until the readpaxos command is invoked.

The ccr.paxos.1.node-11 file acts as the input file for the following patching step. The mmccr patchpaxos command must be invoked in the current CCR temp directory. The first parameter of the mmccr patchpaxos command is the path to the most recent CCR Paxos state file. The second parameter of the mmccr patchpaxos command is the path to the collected intact CCR files gathered during the previous steps. The third parameter of the mmccr patchpaxos command is the path to the Paxos state file which will be created when this mmccr patchpaxos command is run:

[root@node-11 CCRtemp]# mmccr patchpaxos ./ccr.paxos.1.node-11 ./committed/ ./myPatched_ccr.paxos.1

The command gives output similar to the following:

Committed state found in ./ccr.paxos.1.node-11:
config: minNodes: 1 version 0
  nodes: [(N1,S0,V0,L1), (N2,S1,V0,L1), (N3,S2,V0,L1)]
  disks: []
leader: id 1 version 3
updates: horizon -1
  {(n1,e0): 5, (n1,e1): 33, (n1,e2): 50}
values: 1, max deleted version 9
  mmRunningCommand = version 3 ""
files: 8, max deleted version 0
  1 = version 1 uid ((n1,e0),0) crc E7E9C9F0
  2 = version 1 uid ((n1,e0),1) crc FFFFFFFF
  3 = version 1 uid ((n1,e0),2) crc CEE097C7
  4 = version 1 uid ((n1,e0),3) crc FFFFFFFF
  5 = version 1 uid ((n1,e0),4) crc A043B58E
  6 = version 2 uid ((n1,e1),15) crc 94F88A51
  7 = version 11 uid ((n1,e2),42) crc 0BF35437
  8 = version 12 uid ((n1,e2),50) crc 963FE8ED  

Comparing to content of './committed/':

  match file: name: 'ccr.nodes' suffix: '1.1.e7e9c9f0.010000' id: 1 version: 1 crc: e7e9c9f0 uid: ((n1,e0),0) and file list entry: 1.1.e7e9c9f0.010000
  match file: name: 'ccr.disks' suffix: '2.1.ffffffff.010001' id: 2 version: 1 crc: ffffffff uid: ((n1,e0),1) and file list entry: 2.1.ffffffff.010001
  match file: name: 'mmsysmon.json' suffix: '3.1.cee097c7.010002' id: 3 version: 1 crc: cee097c7 uid: ((n1,e0),2) and file list entry: 3.1.cee097c7.010002
  match file: name: 'mmLockFileDB' suffix: '4.1.ffffffff.010003' id: 4 version: 1 crc: ffffffff uid: ((n1,e0),3) and file list entry: 4.1.ffffffff.010003
  match file: name: 'genKeyData' suffix: '5.1.a043b58e.010004' id: 5 version: 1 crc: a043b58e uid: ((n1,e0),4) and file list entry: 5.1.a043b58e.010004
  match file: name: 'genKeyDataNew' suffix: '6.2.94f88a51.01010f' id: 6 version: 2 crc: 94f88a51 uid: ((n1,e1),15) and file list entry: 6.2.94f88a51.01010f
  match file: name: 'mmsdrfs' suffix: '7.11.bf35437.01022a' id: 7 version: 11 crc: bf35437 uid: ((n1,e2),42) and file list entry: 7.11.bf35437.01022a
  older:   name: 'clusterEvents' suffix: '8.11.963fe8ed.010231' id: 8 version: 11 crc: 963fe8ed uid: ((n1,e2),49)
  *** reverting committed file list version 12 uid ((n1,e2),50)

Found 7 matching, 0 deleted, 0 added, 0 updated, 1 reverted, 0 reset


Verifying update history

Writing 1 changes to ./myPatched_ccr.paxos.1
config: minNodes: 1 version 0
  nodes: [(N1,S0,V0,L1), (N2,S1,V0,L1), (N3,S2,V0,L1)]
  disks: []
leader: id 1 version 3
updates: horizon -1
  {(n1,e0): 5, (n1,e1): 33, (n1,e2): 50}
values: 1, max deleted version 9
  mmRunningCommand = version 3 ""
files: 8, max deleted version 0
  1 = version 1 uid ((n1,e0),0) crc E7E9C9F0
  2 = version 1 uid ((n1,e0),1) crc FFFFFFFF
  3 = version 1 uid ((n1,e0),2) crc CEE097C7
  4 = version 1 uid ((n1,e0),3) crc FFFFFFFF
  5 = version 1 uid ((n1,e0),4) crc A043B58E
  6 = version 2 uid ((n1,e1),15) crc 94F88A51
  7 = version 11 uid ((n1,e2),42) crc 0BF35437
  8 = version 11 uid ((n1,e2),49) crc 963FE8ED                                                                                                                            100%   38     0.0KB/s   00:00

Copy the patched CCR Paxos state file and the files in the committed directory back to the appropriate directories on every quorum node using the following command:

[root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-11:/var/mmfs/ccr/ccr.paxos.1

The command gives output similar to the following:

myPatched_ccr.paxos.1                                                                                                                                                        100%  160     0.2KB/s   00:00    

[root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-11:/var/mmfs/ccr/ccr.paxos.2
myPatched_ccr.paxos.1                                                                                                                                                        100%  160     0.2KB/s   00:00    

[root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-12:/var/mmfs/ccr/ccr.paxos.1
myPatched_ccr.paxos.1                                                                                                                                                        100%  160     0.2KB/s   00:00    

[root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-12:/var/mmfs/ccr/ccr.paxos.2
myPatched_ccr.paxos.1                                                                                                                                                        100%  160     0.2KB/s   00:00    

[root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-13:/var/mmfs/ccr/ccr.paxos.1
myPatched_ccr.paxos.1                                                                                                                                                        100%  160     0.2KB/s   00:00    

[root@node-11 CCRtemp]# scp myPatched_ccr.paxos.1 root@node-13:/var/mmfs/ccr/ccr.paxos.2
myPatched_ccr.paxos.1                                                                                                                                                        100%  160     0.2KB/s   00:00    

[root@node-11 CCRtemp]# scp ./committed/* root@node-11:/var/mmfs/ccr/committed/
ccr.disks.2.1.ffffffff.010001                                                                                                                                               100%    0     0.0KB/s   00:00    
ccr.nodes.1.1.e7e9c9f0.010000                                                                                                                                               100%  114     0.1KB/s   00:00    
clusterEvents.8.11.963fe8ed.010231                                                                                                                                          100%  323     0.3KB/s   00:00    
genKeyData.5.1.a043b58e.010004                                                                                                                                              100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.1.a043b58e.010005                                                                                                                                           100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.2.94f88a51.01010f                                                                                                                                           100% 3531     3.5KB/s   00:00    
mmLockFileDB.4.1.ffffffff.010003                                                                                                                                            100%    0     0.0KB/s   00:00    
mmsdrfs.7.10.e29fc7cd.010226                                                                                                                                                100% 4793     4.7KB/s   00:00    
mmsdrfs.7.11.bf35437.01022a                                                                                                                                                 100% 5395     5.3KB/s   00:00    
mmsysmon.json.3.1.cee097c7.010002                                                                                                                                           100%   38     0.0KB/s   00:00    

[root@node-11 CCRtemp]# scp ./committed/* root@node-12:/var/mmfs/ccr/committed/
ccr.disks.2.1.ffffffff.010001                                                                                                                                               100%    0     0.0KB/s   00:00    
ccr.nodes.1.1.e7e9c9f0.010000                                                                                                                                               100%  114     0.1KB/s   00:00    
clusterEvents.8.11.963fe8ed.010231                                                                                                                                          100%  323     0.3KB/s   00:00    
genKeyData.5.1.a043b58e.010004                                                                                                                                              100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.1.a043b58e.010005                                                                                                                                           100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.2.94f88a51.01010f                                                                                                                                           100% 3531     3.5KB/s   00:00    
mmLockFileDB.4.1.ffffffff.010003                                                                                                                                            100%    0     0.0KB/s   00:00    
mmsdrfs.7.10.e29fc7cd.010226                                                                                                                                                100% 4793     4.7KB/s   00:00    
mmsdrfs.7.11.bf35437.01022a                                                                                                                                                 100% 5395     5.3KB/s   00:00    
mmsysmon.json.3.1.cee097c7.010002                                                                                                                                           100%   38     0.0KB/s   00:00    

[root@node-11 CCRtemp]# scp ./committed/* root@node-13:/var/mmfs/ccr/committed/
ccr.disks.2.1.ffffffff.010001                                                                                                                                               100%    0     0.0KB/s   00:00    
ccr.nodes.1.1.e7e9c9f0.010000                                                                                                                                               100%  114     0.1KB/s   00:00    
clusterEvents.8.11.963fe8ed.010231                                                                                                                                          100%  323     0.3KB/s   00:00    
genKeyData.5.1.a043b58e.010004                                                                                                                                              100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.1.a043b58e.010005                                                                                                                                           100% 3531     3.5KB/s   00:00    
genKeyDataNew.6.2.94f88a51.01010f                                                                                                                                           100% 3531     3.5KB/s   00:00    
mmLockFileDB.4.1.ffffffff.010003                                                                                                                                            100%    0     0.0KB/s   00:00    
mmsdrfs.7.10.e29fc7cd.010226                                                                                                                                                100% 4793     4.7KB/s   00:00    
mmsdrfs.7.11.bf35437.01022a                                                                                                                                                 100% 5395     5.3KB/s   00:00    
mmsysmon.json.3.1.cee097c7.010002                                                   100%   38     0.0KB/s   00:00                                                                                                                               100%   38     0.0KB/s   00:00

Start the mmsdrserv daemon and the monitor script which was stopped previously:
```
[root@node-11 ~]# mmdsh -N quorumnodes "mmcommon startCcrMonitor"
```

Verify that the mmsdrserv daemon and its monitor script have restarted using the following command:

[root@node-11 ~]# mmdsh -N quorumnodes "ps -C mmfsd,mmccrmonitor,mmsdrserv"

The command gives output similar to the following:

node-11.localnet.com:    PID TTY          TIME CMD
node-11.localnet.com:   3518 ?        00:00:00 mmccrmonitor
node-11.localnet.com:   3734 ?        00:00:00 mmsdrserv
node-11.localnet.com:   3816 ?        00:00:00 mmccrmonitor
node-12.localnet.com:    PID TTY          TIME CMD
node-12.localnet.com:  30356 ?        00:00:00 mmccrmonitor
node-12.localnet.com:  30572 ?        00:00:00 mmsdrserv
node-12.localnet.com:  30648 ?        00:00:00 mmccrmonitor
node-13.localnet.com:    PID TTY          TIME CMD
node-13.localnet.com:    738 ?        00:00:00 mmccrmonitor
node-13.localnet.com:    958 ?        00:00:00 mmsdrserv
node-13.localnet.com:   1040 ?        00:00:00 mmccrmonitor                                                                                                                              100%   38     0.0KB/s   00:00

The mmccr check command succeeds now. The mmccr check gives output similar to the following:

[root@node-11 ~]# mmdsh -N quorumnodes "mmccr check -Y -e" | grep "mmdsh\|FC_COMMITTED_DIR" 
node-12.localnet.com:  mmccr::0:1:::2:FC_COMMITTED_DIR:0::0:8:OK:
node-11.localnet.com:  mmccr::0:1:::1:FC_COMMITTED_DIR:0::0:8:OK:
node-13.localnet.com:  mmccr::0:1:::3:FC_COMMITTED_DIR:0::0:8:OK:

The mm-commands are active now, however the cluster is still down:

[root@node-11 ~]# mmgetstate -a

 Node number  Node name        GPFS state  
-------------------------------------------
       1      node-11          down
       2      node-12          down
       3      node-13          down
       4      node-14          down
       5      node-15          down

[root@node-11 ~]# mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         gpfs-cluster-1.localnet.com
  GPFS cluster id:           317908494312547875
  GPFS UID domain:           localnet.com
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:           CCR

GPFS cluster configuration servers:
-----------------------------------
  Primary server:    node-11.localnet.com (not in use)
  Secondary server:  (none)

 Node  Daemon node name      IP address   Admin node name       Designation
----------------------------------------------------------------------------
   1   node-11.localnet.com  10.0.100.11  node-11.localnet.com  quorum
   2   node-12.localnet.com  10.0.100.12  node-12.localnet.com  quorum
   3   node-13.localnet.com  10.0.100.13  node-13.localnet.com  quorum
   4   node-14.localnet.com  10.0.100.14  node-14.localnet.com  
   5   node-15.localnet.com  10.0.100.15  node-15.localnet.com

Start GPFS on all nodes and get the clusters up again using the following command:

[root@node-11 ~]# mmstartup -a

The command gives output similar to the following:

Mon Feb 26 18:04:05 CET 2018: mmstartup: Starting GPFS ...

[root@node-11 ~]# mmgetstate -a

 Node number  Node name        GPFS state  
-------------------------------------------
       1      node-11          active
       2      node-12          active
       3      node-13          active
       4      node-14          active
       5      node-15          active

Note:

The master copy of the GPFS configuration file can be corrupted. The CCR patch command rolls back to the latest available intact version of a corrupted file. This means that for the mmsdrfs file, you lose the configuration changes made between the corrupted and the previous intact version. In such cases, it might be necessary to reboot the quorum node to cleanup the cached memory and all drivers in case GPFS shows different errors in the administration log during startup.