Monitor performance metrics in real time
Enable performance metric collection, tune the metric sampling period, and use badmin perfmon view to display current performance.
Enable metric collection
Set the SCHED_METRIC_ENABLE=Y parameter in the lsb.params file to enable performance metric collection.
Start performance metric collection dynamically:
badmin perfmon start sample_period
Optionally, you can set a sampling period, in seconds. If no sample period is specified, the default sample period set in the SCHED_METRIC_SAMPLE_PERIOD parameter in the lsb.params file is used.
Stop sampling:
badmin perfmon stop
SCHED_METRIC_ENABLE and SCHED_METRIC_SAMPLE_PERIOD can be specified independently. That is, you can specify SCHED_METRIC_SAMPLE_PERIOD and not specify SCHED_METRIC_ENABLE. In this case, when you turn on the feature dynamically (using badmin perfmon start), the sampling period valued defined in SCHED_METRIC_SAMPLE_PERIOD will be used.
badmin perfmon start and badmin perfmon stop override the configuration setting in lsb.params. Even if SCHED_METRIC_ENABLE is set, if you run badmin perfmon start, performance metric collection is started. If you run badmin perfmon stop, performance metric collection is stopped.
Tune the metric sampling period
Set SCHED_METRIC_SAMPLE_PERIOD in lsb.params to specify an initial cluster-wide performance metric sampling period.
Set a new sampling period in seconds:
badmin perfmon setperiod sample_period
Collecting and recording performance metric data may affect the performance of LSF. Smaller sampling periods will result in the lsb.streams file growing faster.
Display current performance
Use the badmin perfmon view command to view real-time performance metric information.
- The number of queries handled by mbatchd
- The number of queries for each of jobs, queues, and hosts. (bjobs, bqueues, and bhosts, as well as other daemon requests)
- The number of jobs submitted (divided into job submission requests and jobs actually submitted)
- The number of jobs dispatched
- The number of jobs reordered, that is, the number of jobs that reused the resource allocation of a finished job (RELAX_JOB_DISPATCH_ORDER is enabled in lsb.params or lsb.queues)
- The number of jobs completed
- The number of jobs sent to remote cluster
- The number of jobs accepted from remote cluster
- Scheduler performance metrics:
- A shorter scheduling interval means the job is scheduled more quickly
- Number of different resource requirement patterns for jobs in use which may lead to different candidate host groups. The more matching hosts required, the longer it takes to find them, which means a longer scheduling session. The complexity increases with the number of hosts in the cluster.
- Number of scheduler buckets in which jobs are put based on resource requirements and different scheduling policies. More scheduler buckets means a longer scheduling session.
badmin perfmon view
Performance monitor start time: Fri Jan 19 15:07:54
End time of last sample period: Fri Jan 19 15:25:55
Sample period : 60 Seconds
------------------------------------------------------------------
Metrics Last Max Min Avg Total
------------------------------------------------------------------
Processed requests: mbatchd 0 25 0 8 159
Jobs information queries 0 13 0 2 46
Hosts information queries 0 0 0 0 0
Queue information queries 0 0 0 0 0
Job submission requests 0 10 0 0 10
Jobs submitted 0 100 0 5 100
Jobs dispatched 0 0 0 0 0
Jobs reordered 0 0 0 0 0
Jobs completed 0 13 0 5 100
Jobs sent to remote cluster 0 12 0 5 100
Jobs accepted from remote cluster 0 0 0 0 0
------------------------------------------------------------------
File Descriptor Metrics Free Used Total
------------------------------------------------------------------
MBD file descriptor usage 800 424 1024
------------------------------------------------------------------
Scheduler Metrics Last Max Min Avg
------------------------------------------------------------------
Scheduling interval in seconds(s) 5 12 5 8
Host matching criteria 5 5 0 5
Job buckets 5 5 0 5
Scheduler metrics are collected at the end of each scheduling session.
Performance metrics information is calculated at the end of each sampling period. Running badmin perfmon view before the end of the sampling period displays metric data collected from the sampling start time to the end of last sample period.
If no metrics have been collected because the first sampling period has not yet ended, badmin perfmon view displays:
badmin perfmon view
Performance monitor start time: Thu Jan 25 22:11:12
End time of last sample period: Thu Jan 25 22:11:12
Sample period : 120 Seconds
------------------------------------------------------------------
No performance metric data available. Please wait until first sample period ends.
badmin perfmon output
- Sample Period
- Current sample period
- Performance monitor start time
- The start time of sampling
- End time of last sample period
- The end time of last sampling period
- Metric
- The name of metrics
- Total
- This is accumulated metric counter value for each metric. It is counted from Performance monitor start time to End time of last sample period.
- Last Period
- Last sampling value of metric. It is calculated per sampling period. It is represented as the
metric value per period, and normalized by the following formula:
LastPeriod = (Metric Counter Value of Last Period / Sample Period Interval) * Sample Period
- Max
- Maximum sampling value of metric. It is reevaluated in each sampling period by comparing Max and Last Period. It is represented as the metric value per period.
- Min
- Minimum sampling value of metric. It is reevaluated in each sampling period by comparing Min and Last Period. It is represented as the metric value per period.
- Avg
- Average sampling value of metric. It is recalculated in each sampling period. It is represented
as the metric value per period, and normalized by the following formula:
Avg = (Total / (Last PeriodEndTime - SamplingStartTime)) * Sample Period
Reconfigure your cluster with performance metric sampling enabled
If performance metric sampling is enabled dynamically with badmin perfmon start, you must enable it again after running badmin mbdrestart.
- If performance metric sampling is enabled by default, StartTime will be reset to the point mbatchd is restarted.
- Use the badmin mbdrestart command when the SCHED_METRIC_ENABLE and SCHED_METRIC_SAMPLE_PERIOD parameters are changed. The badmin reconfig command is the same as the badmin mbdrestart command in this context.
Performance metric logging in lsb.streams
By default, collected metrics are written to lsb.streams.
However, performance metric can still be turned on even if ENABLE_EVENT_STREAM=N is defined. In this case, no metric data will be logged.
If EVENT_STREAM_FILE is defined and is valid, collected metrics should be written to EVENT_STREAM_FILE.
If ENABLE_EVENT_STREAM=N is defined, metrics data will not be logged.
Job arrays and job packs
Every job submitted in a job array or job pack is counted individually, except for the Job submission requests metric.
The entire job array or job pack counts as just one job submission request.
Job rerun
Job rerun occurs when execution hosts become unavailable while a job is running, and the job will be put to its original queue first and later will be dispatched when a suitable host is available.
In this case, only one submission request, one job submitted, and n jobs dispatched, n jobs completed are counted (n represents the number of times the job reruns before it finishes successfully).
Job requeue
Requeued jobs may be dispatched, run, and exit due to some special errors again and again. The job data always exists in the memory, so LSF only counts one job submission request and one job submitted, and counts more than one job dispatched.
For jobs completed, if a job is requeued with brequeue, LSF counts two jobs completed, since requeuing a job first kills the job and later puts the job into pending list. If the job is automatically requeued, LSF counts one job completed when the job finishes successfully.
Job replay
When job replay is finished, submitted jobs are not counted in job submission and job submitted, but are counted in job dispatched and job finished.