Job Info section

The Job Info section is in the Grid menu bar.

Host Job Statistics Viewer

Go to the Host Job Statistics Viewer by clicking By Host under the Job Info section of the Grid menu bar. Information about hosts in a cluster is shown.

Host Name. The name of the host. Click a host name to show running jobs for this host (on the Job Info > Details page).
Cluster. The LSF cluster to which this host belongs.
Type. The type of host, as defined in the LSF configuration.
Model. The model of the host, as defined in the LSF configuration.
Load/Batch. The current Load and Batch status of the host.

If you choose Out of Service option on the Status field, jobs with RES-Down status are seen on this column. If you choose In Service, jobs with RES-Down status are not displayed here. You must set how the Down Service Satus is treated (through Console > Configuration > Grid Settings > THold > Grid Settings) for the corresponding option to be available in the Status filter.
CPU Fact. The processor factor of the host, as defined in the LSF configuration.
CPU Pct. The current processor utilization on the host.
RunQ 1m. The exponentially averaged effective processor run queue length for this host over the last minute.
Mem Usage. The percentage of memory usage of all jobs that are running on this host as a percentage of total memory.
Page Usage. The percentage of page usage of all jobs that are running on this host as a percentage of total page size.
Page Rate. The memory page scan rate of the host.
Max Slots. The maximum number of job slots that can be allocated to this host.
Num Slots. The number of jobs slots that are used by jobs that are dispatched to this host.
Run Slots. The number of job slots that are used by jobs that are running on this host.
SSUSP Slots. The number of job slots that are used by system-suspended jobs on the host.
USUSP Slots. The number of job slots that are used by user-suspended jobs on the host.
Reserve Slots. The number of jobs slots that are used by pending jobs that have job slots that are reserved within the host.

If graphs are created for this host, a graph icon is displayed to the left of the host name. Click the icon to view graphs for the host.

Host Group Job Statistics

Go to the Host Group Job Statistics Viewer by clicking By Host Group under the Job Info section of the Grid menu bar. In many respects, the information that is shown is similar to that obtained through the LSF bhosts command with condensed host groups.

The Status filter is populated with all unique Load and Batch statuses that are currently experienced by hosts in any cluster.

Job information by LSF host group is shown:

Host Group. The name of the LSF host group. Click a host group name to show running jobs for this group (on the Job Info > Details page).
Cluster. The LSF cluster to which this host group belongs.
Load/Batch. The current Load and Batch status for the host group. If no Status filter is set, this field shows N/A. Otherwise, it shows the current value that is selected for the Status filter.

If you choose Out of Service option on the Status field, jobs with RES-Down status are displayed. If you choose In Service, jobs with RES-Down status are not displayed. You must set how the Down Service Satus is treated (through Console > Configuration > Grid Settings > THold > Grid Settings) for the corresponding option to be available in the Status filter.
Total Hosts. The total number of hosts in this host group.
Avg CPU %. The average processor utilization for hosts in this host group.
Avg r1m. The average exponentially averaged effective processor run queue length for this host group over the last minute.
Avg Effic. The average efficiency of the host group.
Total CPU. The overall processor utilization rate of the host group.
Max Memory. The maximum memory that is used by the host group.
Max Swap. The maximum swap usage of the host group.
Max Slots. The maximum number of job slots available for this host group.
Num Slots. The number of jobs slots that are used by jobs that are dispatched to this host group.
Run Slots. The number of job slots that are used by jobs that are running on this host group.
SSUSP Slots. The number of job slots that are used by system suspended jobs on the host group.
USUSP Slots. The number of job slots that are used by user suspended jobs on the host group.
Reserve Slots. The number of jobs slots that are used by pending jobs that have job slots that are reserved within the host group.

You can view Job Statistics graphs by clicking the corresponding View Host Group Graphs action icon of the Host Group. These graphs present images and statistics to help you control jobs flexibly.

GRID - Host Group - CPU Utilization

Presents statistics on processor utilization when the host is the server and its status is not in Unavail, Unreach, Unlicensed, or Closed-Admin.
GRID - Host Group - Available Memory

Presents a general image of available memory by host group level for more flexibility in controlling jobs.
GRID - Host Group - Host Details

Presents a general image of memory usage by host group level.
GRID - Host Group - Memory Stats

Presents statistics on average or maximum memory when job status is RUNNING, USUSP, and SSUSP.

Project Viewer

Go to the Project Viewer by clicking By Project under the Job Info section of the Grid menu bar. Resources are shown in a cluster by project.

The information that is shown on the page is as follows:

Project Name. The name of the project. Click a project name to show running jobs for this project (on the Job Info > Details page).
Cluster Name. The LSF cluster to which this project belongs.
Total Slots. The total number of job slots that are used for this project.
Pending Slots. The number of job slots that are used by pending jobs for this project.
Running Slots. The number of job slots that are used by running jobs for this project.
Avg Effic. The average efficiency of this project.
Max Mem. The maximum memory that is used by this project.
Avg Mem. The average amount of memory that is used by this project.
Max Swap. The maximum swap space that is used by this project.
Avg Swap. The average swap space that is used by this project.
Total CPU. The overall processor utilization of this project.

License Project Viewer

Go to the License Project Viewer by clicking By License Project under the Job Info section of the Grid menu bar. Resources are shown in a cluster by project.

The information that is shown on the page is as follows:

License Project. The name of the license project. Click a licenses project name to show running jobs for this licenses project (on the Job Info > Details page).
Cluster Name. The LSF cluster to which this license project belongs.
Total Slots. The total number of job slots that are used for this license project.
Pending Slots. The number of job slots that are used by pending jobs for this license project.
Running Slots. The number of job slots that are used by running jobs for this license project.
Avg Effic. The average efficiency of this license project.
Max Mem. The maximum memory that is used by this license project.
Avg Mem. The average amount of memory that is used by this license project.
Max Swap. The maximum swap space that is used by this license project.
Avg Swap. The average swap space that is used by this license project.
Total CPU. The overall processor utilization of this license project.

Queue Viewer

Go to the Queue Viewer by clicking By Queue under the Job Info section of the Grid menu bar. This display is similar to the LSF bqueues command, with these exceptions: It includes the average and maximum run time of jobs in that queue, and the average and maximum pending time for the queues.

The information that is shown on the page is as follows:

Queue Name. The name of the LSF queue. Click a queue name to show running jobs in this queue (on the Job Info > Details page).
Cluster Name. The LSF cluster to which this queue belongs.
Priority. The priority of the queue.
Status Reason. The status of the queue, with further detail about the status.
Max Slots. The maximum number of job slots that can be used by the jobs in the queue.
Num Slots. The total number of available slots for this queue.
Run Slots. The number of job slots that are used by running jobs in the queue.
Pend Slots. The number of job slots that are used by pending jobs in the queue.
Suspend Slots. The number of jobs slots that are used by suspended jobs in the queue.
AVG Pend. The average number of job slots that are held by pending jobs in the queue.
MAX Pend. The maximum number of job slots that are held by pending jobs in the queue.
AVG Run. The average number of job slots that are held by running jobs in the queue.
MAX Run. The maximum number of job slots that are held by running jobs in the queue.

If you select any queue, you are directed to a display of all “RUNNING” jobs within that queue.

Viewing a Fairshare Tree

Fairshare shows the slot usage and a ratio of all the charge groups on the same level in a Fairshare tree.

To access the queue Fairshare Tree in the Queue Viewer, go to Grid > Job Info > By Queue. When Fairshare tree data is found in a queue, a View Fairshare Tree icon appears in the Actions icon list for the queue. The plugin shows the immediate sub-level of the selected node and displays the usage and configured ratio in a Fairshare Tree tab.

To configure the Fairshare Tree Host Slot Selection Criteria, go to Console > Grid Management. Choose a cluster or Add a cluster, and select the Advanced tab.

View Job Array Listing page

Go to the View Job Array Listing page by clicking By Array under the Job Info section of the Grid menu bar. The page shows information similar to the LSF bjobs –A <job_id> command, but also includes aggregate information for the job array as a whole.

The information that is shown on the page is as follows:

Array ID. The job array ID.
Job Name. The name of the job.
User ID. The identifier of the user who submitted the job array.
Total Jobs. The total number of jobs in the job array.
Pending Jobs. The number of jobs that remain pending in the job array.
Running Jobs. The number of currently running jobs.
Done Jobs. The number of jobs that are completed without error.
Exit Jobs. The number of jobs where errors prevented the job from completing.
Array Effic. The average CPU efficiency of jobs in the job array.
Avg Memory. The average memory that is used by jobs in the array.
Avg Swap. The average swap space that is used by jobs in the array.
Total CPU Time. The total CPU time that is used by all started jobs in the job array.

If you select any array, you are directed to a display of all "ACTIVE" and "FINISHED" jobs within the job array.

View Job Listing page

Go to the View Job Listing page by clicking Details under the Job Info section of the Grid menu bar. Filter batch job information to view only the job types you are interested in.

You can filter your view of the data by providing a resource string that conforms to the LSF bhosts -R command format. This displays jobs that are running on hosts that match the resource requirement. It has no comprehension of any job-specific resource requirements.

Clear the Dynamic check box if you do not want to immediately update page information each time you change a filter setting, and instead want to wait until you complete all filter settings and then click Go.

The information that is shown on the page is as follows:

Job ID. The job ID that LSF® assigned to the job. Click a job number to view an information page that contains details about that job (including general job information, job submission details, the job execution environment, current/last job status, and a graphical job history).
Job Name. The name of the job.
Status. The status of the job.
State Changes. The number of times that the status of the job was changed.
User ID. The LSF user who submitted the job.
Mem Usage. Total resident memory usage of all processes in the job, in MB.
VM Size. Total virtual memory usage of all processes in the job, in MB.
CPU Usage. CPU utilization for this job.
CPU Effic. The efficiency with which this job is using the CPU allocated to it, expressed as a percentage.
Start Time. The time at which the job was started.
Pend. The length of time in which the job was in the pending state.
Run. The length of time for which the job was running.
SSusp. The length of time the job was suspended by the system.

At the bottom of the Details page there are color-codes that indicate job efficiency thresholds, including Warning, Alarm, Flapping, and Dependencies. You can set the colors for each of these thresholds from the Console tab, on the Grid Settings > Status/Events page, along with the thresholds themselves.

View Jobs by Application

Go to the Batch Application page by clicking By Application under the Job Info section of the Grid menu bar. You can quickly check job details by application on the page.

By default, only applications with running jobs are listed. To view all applications, click that Include Unused Applications check box when you filter the search.

Note:

LSF Application must be configured first at:

$LSF_ENVDIR/lsbatch/<clustername>/configdir/lsb.applications.
Unused application statuses are cleared at regular intervals. You can configure statuses by going to Console > Grid Setting > Poller tab. Set the interval under Cluster Graph Management section.

You can click View Active Jobs by Application action icon to view the active jobs by application.

If you want to see the running jobs by application, click the Application name link.

You can also click View Application Graphs action icon to see the job application graphs. The following are the six new graphs:

GRID - Applications - Efficiency
GRID - Applications - Memory Stats
GRID - Applications - Pending Jobs
GRID - Applications - Running Jobs
GRID - Applications - Total CPU
GRID - Applications - VM Stats

View Application Profile in Job Detail page

Go to the Job Detail page by clicking Details under the Job Info section of the Grid menu bar. Filter batch job information to view only the job types you are interested in. You can filter the jobs by application by selecting All or the specific application through the Apps field. Click Go after you complete the filter setting.

Double-click the JobID of the job you want to see more details. Click Job Detail tab.

On the Submission Details section of the page, the Application details are displayed.

Note:

Only users with the View Application Data permission can see the Applications Data. Permission can be set in the realm permission for each user account.

Purge Application information

Purge old application information by setting the application purge frequency.

Click the Console tab.
Under the Configuration section of the Console menu bar, click Grid Settings.
Click the Poller tab.
On the Cluster Graph Management section, set the Queue Host Group and Application Purge Frequency to how long after the application is removed from the system you want to have the corresponding graphs removed.
Click Save.

View Jobs by Group

Go to the Batch Job Groups page by clicking By Group under the Job Info section of the Grid menu bar. You can quickly check job details by group from the page.

You can click View Active Jobs by Job Group action icon to view jobs by job group.

If you want to view running jobs by job group, click the corresponding Group Name link.

Note:

If you filter the Group by Level, data from hidden sublevels are shown in the upper level job group.

The maximum level on the Level drop-down list is based on the job group aggregation level setting.
By default, unused job groups are not displayed. You can filter your search to show unused groups by clicking the Include Unused Groups search criteria.

You can also click View Job Group Graphs action icon to show more information about the group. It displays six graphical figures that represent statistics on job efficiency, memory stats, pending jobs, running CPU, running jobs, and VM stats. These graphs provide a general image of your jobs and you control the job flexibly through:

GRID – Job Groups - Efficiency
GRID - Job Groups - Memory Stats
GRID - Job Groups - Pending Jobs
GRID - Job Groups - Running Jobs
GRID - Job Groups - Total CPU
GRID - Job Groups - VM Stats

Set Job Group Aggregation Level Limit

Set the job group aggregation level to limit the Level filter options.

Click the Console tab.
Under the Configuration section of the Console menu bar, click Grid Settings.
Click the Aggregation tab.
On the Job Group Tracking section, specify the Aggregation Level.

RTM aggregates job group information only by the specified level. If the level is set to 0, the job group data is not aggregated.

Note:
Only users with the View Job Group Data permission can see the Group Data. Permission can be set in the realm permission for each user account.
Click Save.

Purge Job Group information

Purge old job group information by setting the job group purge frequency.

Click the Console tab.
Under the Configuration section of the Console menu bar, click Grid Settings.
Click the Poller tab.
On the Cluster Graph Management section, set the User, User Group, Job Group, Project, License Project Purge Frequency to how long after the job group is not updated you want to automatically remove the corresponding graphs.
Click Save.

View Job Group in Job Detail page

Go to the Job Detail page by clicking Details under the Job Info section of the Grid menu bar. Filter batch job information to view only the job types you are interested in. In this release, you can filter the jobs by selecting the group through the JGroup field. Click Go after you complete the filter setting.

Double-click the JobID of the job you want to see more details. Click Job Detail tab.

On the Submission Details section of the page, the Job Group details are displayed.

View Exception Status in Job Detail page

Important:

You must define the following parameters in the LSF lsb.queues file:

JOB_IDLE for idle job exception handling
JOB_OVERRUN for exception handling of jobs that run longer than specified run time
JOB_UNDERRUN for exception handling of jobs that exits before the specified number of minutes

After you configure these parameters, you can submit jobs to the new queue which now includes the exception status definition.

Go to the Job Detail page by clicking Details under the Job Info section of the Grid menu bar. Filter batch job information to view only the job types you are interested in. Click Go after you complete the filter setting

Double-click the JobID of the job you want to see more details. Click Job Detail tab.

On the Current/Last Status section of the page, the Exception Status (such as idle, overrun, or underrun) details are shown. Abnormal exit shows job exit information that is predefined in LSF.

Monitor Host level rusage in the Job Details page

You can monitor host level memory usage for every job you submit using the blaunch command in LSF. Make sure that the LSF_HPC_EXTENSIONS = HOST_RUSAGE parameter is configured in the lsf.conf file.

Enable Pending Reason History Reporting

You can collect and store all pending reasons throughout a job’s lifecycle. From this information, you can identify the area where jobs are pending for the longest duration.

Note:

By default, the pending reason collection and aggregation feature is disabled.
Collecting pending reason from LSF increases the polling time.
For large clusters with over 100K jobs per day, you must configure CONDENSE_PENDING_REASON parameter in LSF. Refer to the LSF documentation for more details.

Click the Console tab.
Under the Configuration section of the Console menu bar, click Grid Settings.
Click the Poller tab.
On the Job Collection Settings section, click Enable Pending Reason History Reporting check box.
Click Save.

Pending reasons duration analysis starts and a Gantt chart displays the Pending Reason History Report on the Pending Reasons tab in the job’s Details page.

Filter pending reasons on the Settings page

You can set the Pending Reason History Report to ignore specific pending reasons.

Click Grid tab.
Click Settings tab.
Click Pending Reason tab.
Click the check box corresponding to the reasons you want the system to ignore.

You can also enter the reason on the Ignore Pending Reason by textbox under the General Setting (by RegExp) section.
Click Save.

Note:

You can enter the reason to be ignored from General Setting (by RegExp).
The Pending Reason and Suspend Reason sections lists the normal pending reasons (such as job slot limit reached, load information unavailable, idle time is not long enough).
The Load Indices section lists resource-based reasons (such as r15s 15-second load averaged over the last 15 seconds, r1m 1-minute load average over the last minute, io Disk I/O rate averaged over the last minute).

The list includes all currently known RTM pending and suspend reasons. Not all LSF implemented pending and suspend reasons are included.

View Job Pending Reason History page

Go to the Job Pending Reason History page by clicking Details under the Job Info section of the Grid menu bar. Filter batch job information to view only the job types you are interested in. Click Go after you complete the filter settings.

Double-click the JobID of the job you want to see the pending reason history details.

Click Pending Reasons tab. The Job Pending Reason History page displays the Pending Reason History report.

Note:

The Pending Reasons tab is only visible if a job has at least one pending reason that occurred in the past.
By default, the report is sorted by pending duration, lists the top 20 pending reasons, and applies the ignore setting of the RTM system.
- Pending Duration: displays reasons that are sorted by the total duration that the job spent for each reason
- Pending Reason Text: sorts by the pending reason text
- Start Time of the Pending Reason: sorts by the start time each pending reason occurred
- End Time of the Pending Reason: sorts by the end time of each pending reason occurred
Time header (unit) automatically adapts as second/minute/hour/day by pending reason history.
Report shows Now vertical line if the job is pending or suspended.
Red line is displayed in the report when the job state is changed to running state. It signifies the current time of the job when it is not yet finished.

The Pending Reasons tab remains even if the job is finished.