Monitoring cluster performance

The Cluster Dashboard provides latest information about the performance of the LSF clusters. If you notice a performance degradation in current values when compared to historical values, you can take corrective action before issues become critical. The metrics are displayed using the LSF performance monitoring metrics (perfmon).

Important: In order for the performance monitoring to have data, you need to start performance metric collection either by specifying SCHED_METRIC_ENABLE=y in the LSF configuration file lsb.params, or by using the command badmin perfmon start.
Note: To enable perfmon metrics collection in RTM, go to Console > Clusters > Clusters. Select an LSF cluster from the list and in the Cluster Settings page, click the Poller tab. In the Poller tab, select Enable LSF Perfmon Collection and select a Badmin Perform Interval.

You have to enable performance monitoring only once. If you have enabled it from RTM, then you do not have to enable in LSF.

The Cluster Dashboard may show the following information:
  • Cluster Name: The LSF cluster name.

  • Cluster Status: The status of the cluster.

  • Master Status: The status of the management host in the cluster.

  • PAU: The type of the host currently controlling the cluster. Valid values are as follows:

    • P: Primary management host

    • A: Failover host

    • U: Unknown host type

  • Collect Status: The data collection status for the cluster.

  • CPU %%: The cluster’s overall CPU utilization rate, as a percentage.

  • Slot %%: The entire cluster’s slot utilization, as a percentage.

  • Efic %%: The entire cluster’s CPU efficiency for running jobs. Efficiency is calculated with this formula: cpu_time / (run_time × #_of_cpus).

  • Total CPUs: The total number of CPUs in the cluster.

  • Host Slots: The total number of slots available to run jobs in the cluster.

  • Pend Jobs: The total number of pending jobs in the cluster.

  • Run Jobs: The total number of running jobs in the cluster.

  • Susp Jobs: The total number of suspended jobs in the cluster (including system suspended and user suspended jobs).

  • Hourly Started: The total number of jobs that are started during the last hour.

  • Hourly Done: The total number of jobs that are completed during the last hour.

  • Hourly Exit: The total number of jobs that are cancelled during the last hour (unsuccessful completion).