Troubleshoot an LSF cluster's diminished status

These troubleshooting tips help you to resolve a variety of common problem scenarios with LSF cluster health and .

If the LSF cluster is shown in diminished state, refer to the following list for useful tips on resolving specific issues that are related to this state:

  • Ping managementhost/rtm host. Both hosts must be able to ping each other.
    • Add the IP address or host name to either /etc/hosts or LSF_TOP/hosts and restart the cluster.
    • Add the RTM host name as a client host in the lsf.cluster.clname file. Then, run badmin reconfig, lsadmin reconfig, and restart the management host LIM only.
  • Check whether EGO is enabled in LSF. When you add a cluster to RTM, make sure that the setting for "EGO enabled" matches the LSF cluster setting.
  • The minor collection period may be set to be greater than the major collection period in the Job Collection Settings for the cluster.
  • Check the LIM log. If the LSF LIM is not accepting requests from the RTM host, it logs a message at the default log level when it rejects API requests from the RTM host. Add the RTM host as an LSF client.
  • Check for firewall issues:

    telnet <lsfmanagementhost> <lim port>

    Source the LSF environment in the RTM host and run commands like lsid, lsload, and bhosts, to check if LSF is installed on a shared file system that is not accessible on the RTM host.

  • Check that RTM is able to get data with LSF APIs from the management host. Run the following command:

    ./gridhosts -C <clusterid> -d

  • Check that the appkey in the database matches with the one in RTM_TOP/rtm/etc/.appkey.

    Example:

    # cat /opt/rtm/etc/.appkey
    a064a7beac71a0596181b6939980eff620ffd6b4
    # mysql cacti -e "select * from settings"  | grep -i key
    app_key a064a7beac71a0596181b6939980eff620ffd6b4
    
  • Verify that the LSF version matches the RTM cluster configuration for that cluster.
  • Check whether the grid_processes ran for an unreasonably long time:
    mysql cacti -e "select * from grid_processes"
    ps -aef | grep -i grid
    

    Because of a known design issue, if grid binaries for one cluster hang, other clusters are shown in diminished status.

    Use reasonable LSF API timeouts and timeouts in RTM configuration. If the problem persists even after updating timeout periods, do the following steps:

    • Disconnect from the cluster.

    • Stop the hanging grid binaries so that data collection for other clusters continues. Try to identify why the grid binaries for the cluster hang.

You might see the lim is shut down and cluster status is Down but the Load/Batch of management host status is OK.
Note: RTM is not in real time and lim status is not refreshed in real time. The data depends on how often data is polled (configurable) and how often data is aggregated (5-minute cron job and daily aggregation). There is a configuration option to control the interval named poller_interval with 300 seconds as a default value.
Example:
$mysql -u root -p -e "select * from cacti.settings where name='poller_interval'"