Monitoring cluster performance

Edit online

The Cluster Dashboard provides latest information about the performance of the LSF clusters. If you notice a performance degradation in current values when compared to historical values, you can take corrective action before issues become critical. The metrics are displayed using the LSF performance monitoring (perfmon) metrics.

Before you begin

For the performance monitoring to have data, you must start performance metric collection either by specifying the SCHED_METRIC_ENABLE=y configuration in the lsb.params LSF configuration file, or by running the badmin perfmon start command.

Procedure

Enable perfmon metrics collection in RTM:

Tip: You only have to enable performance monitoring once: if you have enabled it from RTM, then you do not have to enable in LSF.
1. Go to Console > Clusters > Clusters and select an LSF cluster from the list.
2. From the Cluster Settings page, click the Poller tab, and select Enable LSF Perfmon Collection.
3. Select Badmin Perform Interval.
The Cluster Dashboard may show the following information:
- Cluster Name: The LSF cluster name.
- Cluster Status: The status of the cluster.
- Master Status: The status of the management host in the cluster.
- PAU: The type of the host currently controlling the cluster. Valid values are as follows:
  - P: Primary management host
  - A: Failover host
  - U: Unknown host type
- Collect Status: The data collection status for the cluster.
- CPU %%: The cluster’s overall CPU utilization rate, as a percentage.
- Slot %%: The entire cluster’s slot utilization, as a percentage.
- Efic %%: The entire cluster’s CPU efficiency for running jobs. Efficiency is calculated with this formula: cpu_time/(run_time × number_of_cpus).
- Total CPUs: The total number of CPUs in the cluster.
- Host Slots: The total number of slots available to run jobs in the cluster.
- Pend Jobs: The total number of pending jobs in the cluster.
- Run Jobs: The total number of running jobs in the cluster.
- Susp Jobs: The total number of suspended jobs in the cluster (including system suspended and user suspended jobs).
- Hourly Started: The total number of jobs that are started during the last hour.
- Hourly Done: The total number of jobs that are completed during the last hour.
- Hourly Exit: The total number of jobs that are cancelled during the last hour (unsuccessful completion).
Create LSF perfmon metrics graphs in RTM:
1. Go to Console > Templates > Device.
2. Add a new device template. Provide a name for it (for example, call it LSF_Perfmon), set the Class value to Unassigned , and click Create.
3. Associate LSF perfmon graph templates for the new device template.
  Click Add Graph Template, select the graph templates you want, and click Add next to each graph template.
  The following graph templates are available. You can add multiple; you must at least add one:
  - GRID - LSF Host Info Requests
  - GRID - LSF Host Match Criteria
  - GRID - LSF Job Buckets
  - GRID - LSF Job Info Requests
  - GRID - LSF Job Scheduling Interval
  - GRID - LSF Job Submit Requests
  - GRID - LSF Jobs Completed
  - GRID - LSF Jobs Dispatched
  - GRID - LSF Jobs Submitted
  - GRID - LSF MBatchD Requests
  - GRID - LSF MBD File Descriptor Usage
  - GRID - LSF Performance
  - GRID - LSF Queue Info Requests
  Click Save to add the selected graph templates.
4. Go to ConsoleManagementDevices.
5. Add a new device. Set these fields:
  - Description: For example, LSF_Perfmon device.
  - Hostname: For example, ABChost.
  - Device Template: This is the device template you created in the previous step. For example, LSF_Perfmon.
  - LSF Cluster Association: For example, myLSFcluster.
  Click Create.
  
  It will take some time for the graphs to build and collect data.
6. Verify that the graphs are created and that they have data. From the Graphs tab, select LSF_Perfmon device as the device. Select the List View (using the expanding arrow). You can also show the filter section (using the double expanding arrow).