GRID - Cluster GridJobs Stats

GridJobs shows statistics for the poller that is extracting job level detail from an LSF cluster. The Cluster GridJobs agent takes three passes, discretely gathering data about pending, running, and completed jobs. Data is extracted from the LSF management batch daemon (MBD) as is referenced in the graph.

The PeakMBD measure shows which of the three phases took the longest to respond. Typically this pass that gathers data about pending jobs. The total MBD time (the purple line) is the sum of the three phases. In this case the grid jobs collection is dominated by a high number of pending jobs in the cluster. This is not obvious from that graph, but a large job array before this snapshot was taken.

The red-line is the actual elapsed run time of the poller and the difference between the red line and the purple line is the over-head that is involved in writing all the job level data to the SQL database. As the following shows, on busy clusters the run time of this poller can be significant. The polling interval must be chosen with some care to make sure that RTM does not result in excessive load on the LSF cluster. This graph is useful for understanding the load characteristics of this agent.

Figure 1. GRID - Cluster GridJobs Stats