
Steps to restore an I/O node
If an I/O node fails due to a hardware or OS problem, and the OS is no longer accessible, you must restore the node using the existing configuration settings stored in xCAT, which typically resides on the EMS node.
- Disable the GPFS auto load using the mmchconfig command. Note: This prevents GPFS from restarting automatically upon reboot.
[ems]# mmlsconfig autoload autoload yes [ems]# mmchconfig autoload=no [ems]# mmlsconfig autoload autoload no - List the recovery groups using the mmlsrecoverygroup command to verify that
the replacement node is not an active recovery group server currently.
[ems1]# mmlsrecoverygroup recovery group vdisks vdisks servers ------------------ ----------- ------ ------- rg_gssio1 3 18 gssio1,gssio2 rg_gssio2 3 18 gssio2,gssio1List the current active recovery group server for each recovery group.[ems1]# mmlsrecoverygroup rg_gssio1 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio1 gssio1,gssio2 [ems1]# mmlsrecoverygroup rg_gssio2 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio2 gssio2,gssio1Note: If you are restoring gssio1, the active recovery group server for gssio1 should be gssio2. If it is not set to gssio2, you need to run the mmchrecoverygroup command to change it.[ems1]# mmchrecoverygroup rg_gssio1 --servers <NEW PRIMARY NODE>,<OLD PRIMARY NODE> [root@gssio1 ~]# mmchrecoverygroup rg_gssio1 --servers gssio2,gssio1 [ems1]# mmlsrecoverygroup rg_gssio1 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio2 gssio1,gssio2 [ems1]# mmlsrecoverygroup rg_gssio2 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio2 gssio2,gssio1 - Create a backup of the replacement node's network file.
[ems]# rm -rf /tmp/replacement_node_network_backup [ems]# mkdir /tmp/replacement_node_network_backup [ems]# scp <REPLACEMENT NODE>:/etc/sysconfig/network-scripts/ifcfg-* /tmp/replacement_node_network_backup/ [ems]# scp gssio2:/etc/sysconfig/network-scripts/ifcfg-* /tmp/replacement_node_network_backup/Note: This is an optional step, and can only be taken when the replacement node can be accessed.. - Check for the RHEL images available for install on the EMS node.
The RHEL image is needed in order to re-image the node that is being restored. The OS image should be located on the EMS node under the following directory:
[ems]# ls /tftpboot/xcat/osimage/ rhels7.3-ppc64-install-gss - Configure the replacement node's boot state to Install for the specified OS
image.
[ems]# nodeset <REPLACEMENT NODE> osimage=<OS_ISO_image> [root@ems1 ~]# nodeset gssio2 osimage=rhels7.3-ppc64-install-gss gssio2: install rhels7.3-ppc64-gss - Ensure that the remote console is properly configured on the EMS node.
[ems]# makeconservercf <REPLACEMENT NODE> [root@ems1 ~]# makeconservercf gssio2 - Reboot the replaced node to initiate the installation process.
[ems]# rnetboot <REPLACEMENT NODE> -V [root@ems1 ~]# rnetboot gssio2 -V lpar_netboot Status: List only ent adapters lpar_netboot Status: -v (verbose debug) flag detected lpar_netboot Status: -i (force immediate shutdown) flag detected lpar_netboot Status: -d (debug) flag detected node:gssio2 Node is gssio2 ... # Network boot proceeding - matched BOOTP, exiting. # Finished. sending commands ~. to expect gssio2: SuccessMonitor the progress of the installation, and wait for the xcatpost/yum/etc script to finish.[ems]# watch "nodestat <REPLACEMENT NODE>; echo; tail /var/log/consoles/<REPLACEMENT NODE>" [root@ems1 ~]# watch "nodestat gssio2; echo; tail /var/log/consoles/gssio2" gssio2: noping ... gssio2: install rhels7.3-ppc64-gss ... gssio2: sshd[ems]# watch -n .5 "ssh <REPLACEMENT NODE> 'ps -eaf | grep -v grep' | egrep 'xcatpost|yum|rpm|vpd'" [root@ems1 ~]# watch -n .5 "ssh gssio2 'ps -eaf | grep -v grep' | egrep 'xcatpost|yum|rpm|vpd'"Note: Depending on what needs to be updated, the node might reboot one or more time. You need to wait until there is no process output before taking the next step. - Verify that the upgrade files have been copied to the I/O node sync directory,
/install/gss/sync/ppc64/.
[ems]# ssh <REPLACEMENT NODE> "ls /install/gss/sync/ppc64/" [root@ems1]# ssh gssio2 "ls /install/gss/sync/ppc64/" gssio2: mofedWait for the directory to sync. After the mofed directory is created, you can take the next step.
- Copy the host files from the healthy node to the replacement node.
[ems]# scp /etc/hosts <REPLACEMENT NODE>:/etc/ [root@ems1 mofed]# scp /etc/hosts gssio2:/etc/ - Configure the network on the replacement node.
If you had backed up the network files previously, you can copy them over to the node, and restart the node. Verify that the names of the devices are consistent with the backed up version before replacing the files.
You can also apply the Red Hat updates not included in the xCAT image, if necessary.
- Rebuild the GPFS kernel extensions on the replacement node.
If the kernel patches were applied, it may be necessary to rebuild the GPFS portability layer by running the mmbuildgpl command.
[ems]# ssh <REPLACEMENT NODE> "/usr/lpp/mmfs/bin/mmbuildgpl" [root@ems1 ~]# ssh gssio2 "/usr/lpp/mmfs/bin/mmbuildgpl" -------------------------------------------------------- mmbuildgpl: Building GPL module begins at Wed Nov 8 17:18:21 EST 2017. -------------------------------------------------------- Verifying Kernel Header... kernel version = 31000514 (3.10.0-514.28.1.el7.ppc64, 3.10.0-514.28.1) module include dir = /lib/modules/3.10.0-514.28.1.el7.ppc64/build/include module build dir = /lib/modules/3.10.0-514.28.1.el7.ppc64/build kernel source dir = /usr/src/linux-3.10.0-514.28.1.el7.ppc64/include Found valid kernel header file under /usr/src/kernels/3.10.0-514.28.1.el7.ppc64/include Verifying Compiler... make is present at /bin/make cpp is present at /bin/cpp gcc is present at /bin/gcc g++ is present at /bin/g++ ld is present at /bin/ld Verifying Additional System Headers... Verifying kernel-headers is installed ... Command: /bin/rpm -q kernel-headers The required package kernel-headers is installed make World ... make InstallImages ... -------------------------------------------------------- mmbuildgpl: Building GPL module completed successfully at Wed Nov 8 17:18:39 EST 2017. - Restore the GPFS configuration from an existing healthy node in the cluster.
[ems]# ssh <REPLACEMENT NODE> "/usr/lpp/mmfs/bin/mmsdrrestore -p <GOOD NODE>" [root@ems ~]# ssh gssio2 "/usr/lpp/mmfs/bin/mmsdrrestore -p ems1" mmsdrrestore: Processing node gssio1 mmsdrrestore: Node gssio1 successfully restored.Note: This code is executed on the replacement node, and the -p option is applied to an existing healthy node. - Start GPFS on the recovered node, and enable the GPFS auto load.
- Before starting GPFS, verify that the replacement node is still in DOWN
state.
[ems]# mmgetstate -aL Node number Node name Quorum Nodes up Total nodes GPFS state Remarks ------------------------------------------------------------------------------------ 1 gssio1 2 2 5 active quorum node 2 gssio2 0 0 5 down quorum node 3 ems1 2 2 5 active quorum node 4 gsscomp1 2 2 5 active 5 gsscomp 2 2 5 active - Start GPFS on the replacement
node.
[ems]# mmstartup -N <REPLACEMENT NODE> mmstartup: Starting GPFS ... - Verify that the replacement node is
active.
[ems]# mmgetstate -aL Node number Node name Quorum Nodes up Total nodes GPFS state Remarks ------------------------------------------------------------------------------------ 1 gssio1 2 3 5 active quorum node 2 gssio2 2 3 5 active quorum node 3 ems1 2 3 5 active quorum node 4 gsscomp1 2 3 5 active 5 gsscomp2 2 3 5 active - Ensure that all the file systems are mounted on the replacement
node.
[ems]# mmmount all -N <REPLACEMENT NODE> [ems]# mmlsmount all -L - Re-enable the GPFS auto
load.
[ems]# mmlsconfig autoload autoload no [ems]# mmchconfig autoload=yes mmchconfig: Command successfully completed [ems]# mmlsconfig autoload autoload yes
- Before starting GPFS, verify that the replacement node is still in DOWN
state.
- Verify that the recovered node is now the active recovery group server for it's recovery
group.
[ems1]# mmlsrecoverygroup recovery group vdisks vdisks servers ------------------ ----------- ------ ------- rg_gssio1 3 18 gssio1,gssio2 rg_gssio2 3 18 gssio2,gssio1View the active node for each recovery group.[ems1]# mmlsrecoverygroup rg_gssio1 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio1 gssio1,gssio2 [ems1]# mmlsrecoverygroup rg_gssio2 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio2 gssio2,gssio1The recovered node gssio1 must have automatically taken over its recovery group. In the event that gssio1 did not, you need to manually set it as the active recovery group server for its recovery group.[ems1]# mmchrecoverygroup rg_gssio1 --servers <NEW PRIMARY NODE>,<OLD PRIMARY NODE> [root@gssio1 ~]# mmchrecoverygroup rg_gssio1 --servers gssio2,gssio1 [ems1]# mmlsrecoverygroup rg_gssio1 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio2 gssio1,gssio2 [ems1]# mmlsrecoverygroup rg_gssio2 -L | grep "active recovery" -A2 active recovery group server servers ----------------------------------------------- ------- gssio2 gssio2,gssio1 - Verify that the NVRAM partition exists, and ensure the following:
- There should be 11 partitions.
- Partitions 6 through11 should be 2GB.
- Partitions 6 through 9 are marked as xfs for file system.
- Partitions 10 and 11 should not have a file system associated with it.
- After re-imaging, the node that was re-imaged will have an xfs file system as shown:
If the partitions do not exist, you need to create them. For more information, see Re-creating the NVR partitions[ems]# ssh gssio1 "lsblk | egrep 'NAME|sda[0-9]'" NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT ├─sda1 8:1 0 8M 0 part ├─sda2 8:2 0 500M 0 part /boot ├─sda3 8:3 0 246.1G 0 part / ├─sda4 8:4 0 1K 0 part ├─sda5 8:5 0 3.9G 0 part [SWAP] ├─sda6 8:6 0 2G 0 part ├─sda7 8:7 0 2G 0 part ├─sda8 8:8 0 2G 0 part ├─sda9 8:9 0 2G 0 part ├─sda10 8:10 0 2G 0 part └─sda11 8:11 0 2G 0 part [ems1]# ssh gssio1 "parted /dev/sda -l | egrep 'boot, prep' -B 1 -A 10" Number Start End Size Type File system Flags 1 1049kB 9437kB 8389kB primary boot, prep 2 9437kB 534MB 524MB primary xfs 3 534MB 265GB 264GB primary xfs 4 265GB 284GB 18.9GB extended 5 265GB 269GB 4194MB logical linux-swap(v1) 6 269GB 271GB 2097MB logical xfs 7 271GB 273GB 2097MB logical xfs 8 273GB 275GB 2097MB logical xfs 9 275GB 277GB 2097MB logical xfs 10 277GB 279GB 2097MB logical 11 279GB 282GB 2097MB logical [ems1]# ssh gssio2 "lsblk | egrep 'NAME|sda[0-9]'" NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT ├─sda1 8:1 0 8M 0 part ├─sda2 8:2 0 500M 0 part /boot ├─sda3 8:3 0 246.1G 0 part / ├─sda4 8:4 0 1K 0 part ├─sda5 8:5 0 3.9G 0 part [SWAP] ├─sda6 8:6 0 2G 0 part ├─sda7 8:7 0 2G 0 part ├─sda8 8:8 0 2G 0 part ├─sda9 8:9 0 2G 0 part ├─sda10 8:10 0 2G 0 part └─sda11 8:11 0 2G 0 part [ems1]# ssh gssio2 "parted /dev/sda -l | egrep 'boot, prep' -B 1 -A 10" Number Start End Size Type File system Flags 1 1049kB 9437kB 8389kB primary boot, prep 2 9437kB 534MB 524MB primary xfs 3 534MB 265GB 264GB primary xfs 4 265GB 284GB 18.9GB extended 5 265GB 269GB 4194MB logical linux-swap(v1) 6 269GB 271GB 2097MB logical xfs 7 271GB 273GB 2097MB logical xfs 8 273GB 275GB 2097MB logical xfs 9 275GB 277GB 2097MB logical xfs 10 277GB 279GB 2097MB logical xfs 11 279GB 282GB 2097MB logical xfs - View the current NVR device status.
[ems1]# mmlsrecoverygroup rg_gssio1 -L --pdisk | egrep "n[0-9]s[0-9]" n1s01 1, 1 NVR 1816 MiB ok n2s01 0, 0 NVR 1816 MiB missing [ems1]# mmlsrecoverygroup rg_gssio2 -L --pdisk | egrep "n[0-9]s[0-9]" n1s02 1, 1 NVR 1816 MiB ok n2s02 0, 0 NVR 1816 MiB missingNote: The missing NVR devices must be recreated or replaced. For more information, see Re-creating NVRAM pdisks
