Monitoring cluster performance
The Cluster Dashboard provides latest information about the performance of the LSF clusters. If you notice a performance degradation in current values when compared to historical values, you can take corrective action before issues become critical. The metrics are displayed using the LSF performance monitoring metrics (perfmon).
You have to enable performance monitoring only once. If you have enabled it from RTM, then you do not have to enable in LSF.
-
Cluster Name: The LSF cluster name.
-
Cluster Status: The status of the cluster.
-
Master Status: The status of the management host in the cluster.
-
PAU: The type of the host currently controlling the cluster. Valid values are as follows:
-
P: Primary management host
-
A: Failover host
-
U: Unknown host type
-
-
Collect Status: The data collection status for the cluster.
-
CPU %%: The cluster’s overall CPU utilization rate, as a percentage.
-
Slot %%: The entire cluster’s slot utilization, as a percentage.
-
Efic %%: The entire cluster’s CPU efficiency for running jobs. Efficiency is calculated with this formula: cpu_time / (run_time × #_of_cpus).
-
Total CPUs: The total number of CPUs in the cluster.
-
Host Slots: The total number of slots available to run jobs in the cluster.
-
Pend Jobs: The total number of pending jobs in the cluster.
-
Run Jobs: The total number of running jobs in the cluster.
-
Susp Jobs: The total number of suspended jobs in the cluster (including system suspended and user suspended jobs).
-
Hourly Started: The total number of jobs that are started during the last hour.
-
Hourly Done: The total number of jobs that are completed during the last hour.
-
Hourly Exit: The total number of jobs that are cancelled during the last hour (unsuccessful completion).