Cluster Down Debug Procedure
-Cluster down ? Here is the checklist
Author: Li Cao, Weinan Liu
The root cause of a cluster down issue may not be Platform HPC related. It can most of the time be isolated and then fixed following few simple steps, as illustrated in the diagram below. The check list applies to HPC 3.0, 3.2, 4.1.1, 184.108.40.206
1. Initial Checklist
The initial checklist includes the following items:
- disk status
- network connectivity
- cache cleanup
- workload management check
1.1 Disk status check
This is done using the df command.
[root@phpc4111b ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
51606140 12774264 36210436 27% /
tmpfs 961340 0 961340 0% /dev/shm
/dev/sda1 495844 37724 432520 9% /boot
26391624 176104 24874900 1% /home.PCMHA
* If /var is full, please refer to 2. Postgresql Database Checking
1.2 Network connectivity check:
Making sure there is no network connectivity problem(s) involves checking both the management and computational networks. Connectivity on the management network can be checked using the ping command (pinging the compute nodes from the management node for example). The computational (High Performance) network is usually an InfiniBand based network and requires few steps for a complete check.
a. Check subnet manager is running
This is the most common IB issue in the case subnet management is not handled at switch level. On these clusters, the subnet manager runs on one of the compute nodes, when the master node does not have an InfiniBand HCA. In the example below, the subnet manager is running on compute-00-00.
[root@compute-00-00 ~]# service opensmd status
opensm (pid 11311) is running...
If the subnet manager for some reason crashed, you need to restart it.
b. Check ib status
- show host adatapter status
- Look at port state. State can be Down, Active or in the INIT state (INIT state usually indicates a problem with the subnet manager). Physical State should be LinkUp
[root@compute-00-00 ~]# ibstat
CA type: MT25408
Number of ports: 2
Firmware version: 2.3.0
Hardware version: a0
Node GUID: 0x0002c9030002847c
System image GUID: 0x0002c9030002847f
Physical state: Polling
Base lid: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030002847d
Physical state: LinkUp
Base lid: 2
SM lid: 2
Capability mask: 0x0251086a
Port GUID: 0x0002c9030002847e
If the port is in init state, the subnet manager should be down, refer to 1.2 Network connectivity check to get it back.
If you have only one port, it’s normal to have Physical state: Disabled, State: Down or Polling for the empty port
show InfiniBand nodes in topology.
- ibnodes - show InfiniBand nodes in topology
[root@c661f6ufm1 ~]# ibnodes
Ca : 0x5cf3fc000004ecca ports 2 "c661f6dx04 HCA-1"
Ca : 0x5cf3fc000004ecda ports 2 "c661f6dx02 HCA-1"
Ca : 0x5cf3fc000004ed1a ports 2 "c661f6dx03 HCA-1"
Ca : 0x5cf3fc000005e567 ports 2 "c661f6dx08 HCA-1"
Ca : 0x5cf3fc000005e4eb ports 2 "c661f6dx05 HCA-1"
Ca : 0x5cf3fc000005e3f7 ports 2 "c661f6dx06 HCA-1"
Ca : 0x5cf3fc000005e42f ports 2 "c661f6dx07 HCA-1"
Ca : 0x0002c90300ee1bf0 ports 2 "c661f6gpfs01 HCA-1"
Ca : 0x0002c90300ee1ba0 ports 2 "c661f6gpfs02 HCA-1"
Ca : 0x0002c90300fdf330 ports 2 "c661f6ufm1 HCA-1"
Switch : 0x0002c903008ef000 ports 36 "MF0;c661f6ibsw1:SX6036/U1" enhanced port 0 lid 7 lmc 0
A node may have one of two nodes enabled, verify the output of ibnodes and ibstate match
1.3 Cache cleanup
a. Browser cache cleanup
Clearing your Web browser cache forces the browser to load the latest versions of Web pages and programs you visit.
Deleting web cache in Internet Explorer (IE) varies based on your IE and Windows version. The options to remove cached web pages are found under Tools (Internet Options or Safety) then Browsing History in IE.
Click Tools from the Firefox menubar
Under the Advanced options, click the Network tab
Clear cache under Cached Web Content
b. perf cleanup
The perfadmin tool is used to stop and then start all the services. This will restart the data loaders failed to be started
[root@pcm4111 ~]# perfadmin stop all
Service <jobdt> stopped successfully.
Service <plc> stopped successfully.
Service <plc_group2> stopped successfully.
Service <purger> stopped successfully.
Service <vdatam> stopped successfully.
[root@pcm4111 ~]# perfadmin start all
Service <jobdt> has started already.
Service <plc> has started already.
Service <plc_group2> has started already.
Service <purger> has started already.
Service <vdatam> has started already.
1.4 Workload management check
a. Restart the WLM daemons
[root@pcm4111 ~]# lsfstartup
Starting up all LIMs ...
Do you really want to start up LIM on all hosts ? [y/n]y
Start up LIM on <pcm4111> ...... done
Waiting for Master LIM to start up ... Master LIM is ok
Starting up all RESes ...
Do you really want to start up RES on all hosts ? [y/n]y
Start up RES on <pcm4111> ...... done
Starting all slave daemons on LSBATCH hosts ...
Do you really want to start up slave batch daemon on all hosts ? [y/n] y
Start up slave batch daemon on <pcm4111> ...... done
Done starting up LSF daemons on the local LSF cluster ...
b. If GUI could not submit job, try it from commands: bsub
[root@compute001 ~]# bsub
bsub> sleep 1000
bsub> Job <206> is submitted to default queue <medium_priority>.
c. If the slave hosts can not connect to master and bhost command shows the node state is “UNKNOWN”
In some cases the configuration option "dos-control l4port" in Dell Switch make the lims on compute nodes cannot communicate with each other. Disabling "dos-control l4port" can solve the problem.
2. Postgresql Database Checking
If the /var directory is full, the postgresql database will be down and can not be started up anymore.. If we can see the outputs from the above steps has the similar information below, we can suppose the problem relates to such case. You can refer to the steps below to get the issue cleared.
2.1 Backup the current hpc database.
We need to run it with bash shell.
#pg_dump -U hpcadmin hpcdb > /nsf/hpcdbackup
2.2 Drop the old data from the database
#delete from ci_server_snap where time_stamp < '%';
(please replace % to property time.)
2.3 Create new partition
2.4 Free the disk.
(In this step, we need to make sure we have some free disk under the /var directory. The clear action needs the buffer space to free the disk)
3. HA Checking - Cluster with HA Enabled
3.1 HA checklist
a. Go through all known issues in release notes.
b. Check the Storage system.
c. Check the Network .
Ensure all NIC links are up and all ports are up
d. Check available disk space on master and master candidate host.
e. run the following command in order to display the HA status:
for hpc 3.x
for hpc 4.x
# egosh service list
# pcmhatools info
# pcmhatools status
3.2 HA recovery
Try the steps below to go back to original state after automatic failover, if the check list above did not work.
a. Fix the issue which triggered the failover
b. Change the fail mode to manual:
For phc 3.x:
# kusu-failmode -m manual
Failover mode is currently set to: Manual
For phc 4.x:
# pcmhatool failmode -m manual
Failover mode is currently set to: Manual
c. Restart service kusuha-heartbeat on the primary installer and then the failover node (for 3.x only, skip this for 4.x)
d. Run kusu-failto on the current primary installer to switch the "Installer" role back to the original installer.
For phc 3.x:
For phc 4.x:
# pcmhatool failto
e. Change the fail mode back to auto:
For phc 3.x:
# kusu-failmode -m auto
Failover mode is currently set to: Auto
For phc 4.x:
# pcmhatool failmode -m auto
Failover mode: Auto
4. Storage Checking - NFS/GPFS:
4.1 mount directory checklist
Check the if the systems below had been mounted
Cluster without HA enabled :
Cluster with HA enabled (phpc version 3.*)
4.2 GPFS health check command:
a. Use "mmgetstate -a" to check the fs status to see if all nodes are active. The command display the state of the GPFS daemon all the nodes. And users can double check with the result of command "mmstartup -a all" to start up the node stating “down”. If failed, this might be GPFS issue, please contact the GPFS support to resolve it.
[root@gpfstest5 ~]# mmgetstate -a
Node number Node name GPFS state
3 gpfstest5 active
4 gpfstest9 active
5 gpfstest6 active
6 gpfstest8 active
More info on mmgetstate command : http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmgetstate.htm
b. Use mmlsmount <fsname> -L to see if all the nodes in fs are listed. This command is to show the basic information of one GPFS file system. And users can double check with the result of command "mmmount -a all" to see if FS is stable. If not, this might be GPFS issue, please contact the GPFS support to resolve it.
# mmlsmount citifs -L
File system citifs is mounted on 4 nodes:
More info on mmlsmount command : http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs500.doc/bl1pdg_mmlsm.htm
5. Restart MN
If none above works, try restarting the MN.
[root@phpc4111b ~]# reboot –h now
GPFS Frequently Asked Questions and Answers: http://www-01.ibm.com/support/knowledgecenter/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html