Logging mbatchd performance metrics

LSF provides a feature that lets you log performance metrics for mbatchd. This feature is useful for troubleshooting large clusters where a cluster has performance problems. In such cases, mbatchd performance may be slow in handling high volume request such as:

Job submission
Job status requests
Job rusage requests
Client info requests causing mbatchd to fork

For example, the output for a large cluster may appear as follows:

Nov 14 20:03:25 2012 25408 4 10.1.0 sample period: 120 120
Nov 14 20:03:25 2012 25408 4 10.1.0 job_submission_log_jobfile logJobInfo: 14295 0 
                                 179 0 3280 0 10 0 160 0 10 0 990
Nov 14 20:03:25 2012 25408 4 10.1.0 job_submission do_submitReq: 14295 0 180 0 9409 
                                 0 100 0 4670 0 10 0 1750
Nov 14 20:03:25 2012 25408 4 10.1.0 job_status_update statusJob: 2089 0 1272 1 2840 
                                 0 10 0 170 0 10 0 120
Nov 14 20:03:25 2012 25408 4 10.1.0 job_dispatch_read_jobfile readLogJobInfo: 555 0 
                                 256 0 360 0 10 0 70 0 10 0 50
Nov 14 20:03:25 2012 25408 4 10.1.0 mbd_query_job fork: 0 0 0 0 0 0 0 0 0 0 0 0 0
Nov 14 20:03:25 2012 25408 4 10.1.0 mbd_channel chanSelect/chanPoll: 30171 0 358 0 30037 
                                 0 10 0 3930 0 10 0 1270
Nov 14 20:03:25 2012 25408 4 10.1.0 mbd_query_host fork: 0 0 0 0 0 0 0 0 0 0 0 0 0
Nov 14 20:03:25 2012 25408 4 10.1.0 mbd_query_queue fork: 0 0 0 0 0 0 0 0 0 0 0 0 0
Nov 14 20:03:25 2012 25408 4 10.1.0 mbd_query_child fork: 19 155 173 160 3058 0 0 0 0 
                                 150 170 160 3040
Nov 14 20:03:25 2012 25408 4 10.1.0 mbd_other_query fork: 0 0 0 0 0 0 0 0 0 0 0 0 0
Nov 14 20:03:25 2012 25408 4 10.1.0 mbd_non_query_fork fork: 0 0 0 0 0 0 0 0 0 0 0 0 0

In the first line (sample period: 120 120) the first value is the configured sample period in seconds. The second value is the real sample period in seconds.

The format for each remaining line is:

metricsCategoryName functionName count rt_min rt_max rt_avg rt_total ut_min ut_max ut_avg ut_total st_min st_max st_avg st_total

Where:

Count: Total number of calls to this function in this sample period
rt_min: Min runtime of one call to the function in this sample period
rt_max: Maximum runtime of one call to the function in this sample period
rt_avg: Average runtime of the calls to the function in this sample period
rt_total: Total runtime of all the calls to the function in this sample period
ut_min: Minimum user mode CPU time of one call to the function in this sample period
ut_max: Max user mode CPU time of one call to the function in this sample period
ut_avg: Average user mode CPU time of the calls to the function in this sample period
ut_total: Total user mode CPU time of all the calls to the function in this sample period
st_min: Min system mode CPU time of one call to the function in this sample period
st_max: Max system mode CPU time of one call to the function in this sample period
st_avg: Average system mode CPU time of the calls to the function in this sample period
st_total: Total system mode CPU time of all the calls to the function in this sample period

All time values are in milliseconds.

The mbatchd performance logging feature can be enabled and controlled statically through the following parameters in lsf.conf:

LSB_ENABLE_PERF_METRICS_LOG: Lets you enable or disable this feature.
LSB_PERF_METRICS_LOGDIR: Sets the directory in which performance metric data is logged.
LSB_PERF_METRICS_SAMPLE_PERIOD: Determines the sampling period for performance metric data.

For more information on these parameters, see the IBM Platform Configuration Reference.

You can also enable the mbatchd performance metric logging feature dynamically with the badmin perflog command. The -t, -d and -f command options let you specify the sample period, the duration for data logging, and the output directory. To turn off mbatchd performance metric logging, use the badmin perflog -o command.

For more information, see badmin.

If you define this feature statically, performance metrics are logged in the mbatchd.perflog.<hostname> file. If you define the feature dynamically, performance metrics are logged in the log file defined in the command. If you define the feature statically, then dynamically, the data sample period, the log file directory, and the duration will be those defined by the command. After the duration expires, or you turn off the feature dynamically, the statically defined settings are restored.