Monitoring jobs
The RTM JobIQ (Grid Heuristics) tab provides LSF users a customizable view of their workload compared to the workload of other users. JobIQ also provides multiple cross sections of both users and cluster-wide data contained in RTM.
JobIQ displays information about all your jobs, and compares those jobs to jobs from all other users. It gives you insight into whether your jobs are running longer than they typically run. Two days of job history give you a view into your running and finished jobs. You can also control your jobs directly including requeue, kill, suspend, and other job actions.
Use JobIQ to determine the status of all of your jobs by queue and project. See how many jobs are running, pending, and suspended, or where your job is in the queue compared to other uses. You can also see the exit reasons for jobs that exited abnormally.
Filter the job view
Customize your view, including the panes to display, and the user to view. You can save those settings for the next time you use the page. The RTM administrator can also set the JobIQ page to be a user's default page on login.
Filter your job view by common problems to show just the information that you want to see. Choose the user, cluster, and queue.
Choose the Views and Charts that you want to see in your JobIQ dashboard.
Limit the amount of rows of information displayed in each table (All, 5, 10, or 20 records), and choose how often to refresh the view (1 - 5 minutes, or never).
See your workload history
View your workload history for the last two days. The Summary view creates the Current Status by Cluster (all Queues) table, which shows pending, running, and suspended jobs in all queues, including the number of jobs finished and exited per hour, hourly throughput, and 5-minute throughput.
View your overall workload across multiple clusters
The Daily Throughput by Cluster table shows totals for finished and exited jobs for the current and previous day. Average TT Today shows the mean turnaround time for all jobs that ran today. Turnaround time is the total number of minutes from job submission to job finish.
- Pending jobs per queue over the jobs that are pending ahead of yours based on the LSF fairshare scheduling dynamic priority. For example, a pending job display of 670 / 230 means that you have 670 pending jobs for your project, and all other users have 230 pending jobs ahead of you in the same queue.
- Running and suspending jobs over other users with running and suspended jobs in the same queue.
- Hourly and 5-minute throughput over other users throughput in the same queue is shown under TPut(1Hr) and TPut(5Min).
- Based on both the runtime heuristics of the 70th percentile of job run time for project in that queue and throughput, Estimate shows an estimate of when your jobs are expected to complete.
- Several warning types are shown in a "stoplight graph". The graph gives status for idle jobs, slow jobs, jobs with dependencies, and memory exceptions.
- Idle Jobs is red if you have running jobs that are not using any CPU time
- Long Jobs is red if your jobs exceed the 90th percentile of jobs from the queue and project with the same resource requirements
- Pend Dpnd is yellow if the job has dependencies, and red if you have invalid job dependencies
- Mem Use is yellow if your jobs are using less memory than you reserved and red if your jobs are using much more memory than you reserved
In the Current Status by Queue/Project table, click the name of a queue or project to see charts in the Graphs pane. The graphs also summarize pending, running, and finished jobs for the selected time period. Multiple time series graphs of your workload as compared to others over the time period that is selected in the filter area are displayed. On the General pane, the grid heuristics plug-in provides multiple high-level graphs, by queue, that summarize memory usage and job throughput over the selected time period.
View the resources that your workload is consuming
View the resources that your workload is consuming resources as compared to other users. In the Current Status by Queue/Project table, click the number of jobs that are pending, running, or suspended to show a list of jobs in that state. In the page, filter the job list by cluster, queue, project, or job status, and export the table to a .csv file for further analysis. Warning status for each job is shown in the list: idle jobs, slow jobs, jobs with dependencies, and memory exceptions
Click a job ID in the job listing table to see details about that job.
In the job details view, use Job Graphs to see graphs of resource usage for the job. The graphs show memory consumption, CPU time, and running process IDs (PIDs) for the job, and the variation over time for those resources. The Hosts Graphs shows various host-based statistics for the job.
View license usage
The Feature Checkouts for User table allows you to examine license usage by user. The page also contains a column to show checked out licenses.
View Pending Reasons by Queue
The View Pending Reasons by Queue for User table displays the number of pending jobs per user and their reasons, as well as the time spent pending.
See which jobs have exceptions
Choose Exit Analysis to get insight into potential problem jobs. The Exit Analysis by Queue/Project table shows Jobs that have either runtime, idling, or memory exceptions over the last day. You can then drill down to control those jobs. The exit analysis page shows the number of jobs that exited and the LSF exit reason.
Control your jobs
In the Choose an action menu to control any of your jobs in the list. Move jobs up and down in the queue, switch jobs to another queue, force jobs to run now (with appropriate permission), suspend, resume, kill, requeue, or signal jobs.
page, use the