Troubleshoot an LSF cluster's diminished status
These troubleshooting tips help you to resolve a variety of common problem scenarios with LSF cluster health and .
If the LSF cluster is shown in diminished state, refer to the following list for useful tips on resolving specific issues that are related to this state:
- Ping managementhost/rtm host. Both hosts must
be able to ping each other.
- Add the IP address or host name to either /etc/hosts or LSF_TOP/hosts and restart the cluster.
- Add the RTM host name as a client host in the lsf.cluster.clname file. Then, run badmin reconfig, lsadmin reconfig, and restart the management host LIM only.
- Check whether EGO is enabled in LSF. When you add a cluster to RTM, make sure that the setting for "EGO enabled" matches the LSF cluster setting.
- The minor collection period may be set to be greater than the major collection period in the Job Collection Settings for the cluster.
- Check the LIM log. If the LSF LIM is not accepting requests from the RTM host, it logs a message at the default log level when it rejects API requests from the RTM host. Add the RTM host as an LSF client.
- Check for firewall issues:
telnet <lsfmanagementhost> <lim port>
Source the LSF environment in the RTM host and run commands like lsid, lsload, and bhosts, to check if LSF is installed on a shared file system that is not accessible on the RTM host.
- Check that RTM is able to
get data with LSF APIs from the
management host. Run the following command:
./gridhosts -C <clusterid> -d
- Check that the appkey in the database
matches with the one in RTM_TOP/rtm/etc/.appkey.
Example:
# cat /opt/rtm/etc/.appkey a064a7beac71a0596181b6939980eff620ffd6b4 # mysql cacti -e "select * from settings" | grep -i key app_key a064a7beac71a0596181b6939980eff620ffd6b4
- Verify that the LSF version matches the RTM cluster configuration for that cluster.
- Check whether the grid_processes ran for an unreasonably long time:
mysql cacti -e "select * from grid_processes" ps -aef | grep -i grid
Because of a known design issue, if grid binaries for one cluster hang, other clusters are shown in diminished status.
Use reasonable LSF API timeouts and timeouts in RTM configuration. If the problem persists even after updating timeout periods, do the following steps:
-
Disconnect from the cluster.
-
Stop the hanging grid binaries so that data collection for other clusters continues. Try to identify why the grid binaries for the cluster hang.
-
$mysql -u root -p -e "select * from cacti.settings where name='poller_interval'"