Maintaining cluster performance under high query load

LSF provides basic performance metrics (badmin perfmon), which samples three major query operations (bjobs, bhosts, and bqueues). LSF also provides the badmin diagnose -c query command option to trace query sources for further troubleshooting.

About this task

Frequent LSF query activities can put a heavy load on the management host, causing it to compete with other system components for CPU, memory, and network bandwidth. The following steps describe how to use the LSF commands to check the cluster performance under high query load.

Procedure

Check if the cluster is under heavy query load.

Enable LSF performance monitoring by running the badmin perfmon start command.

Run the badmin perfmon view command periodically to view the data for reference.

Optionally, you can set a sampling period in seconds.

% badmin perfmon view
Performance monitor start time: Thu Sep  26 07:00:08
End time of last sample period: Thu Sep  26 12:56:48
Sample period:                  60 Second(s)

------------------------------------------------------------------------------
Metrics                                Last     Max     Min     Avg     Total
------------------------------------------------------------------------------
Processed requests: mbatchd           13568   20085    8627   14244   5080543
Job information queries                2311    4969    1938    2989   1066407
Host information queries                  6      73       1      12      4611
Queue information queries                 9      42       4      10      3885
Job submission requests                1703    2518     554    1525    544160
...

Set SCHED_METRIC_ENABLE=Y in the lsb.params to enable performance metric collection by default.

If no sample period is specified with the badmin perfmon start command, LSF uses the default sample period from the SCHED_METRIC_SAMPLE_PERIOD parameter in the lsb.params file.

Enable query diagnosis if the query load is high.

If the current query load is higher than the average query load in the past, enable query diagnosis to get detailed query records with one of the following methods.
You can enable query diagnosis tracking by default with the following parameters in lsb.params:
```
ENABLE_DIAGNOSE=query
DIAGNOSE_LOGDIR=file_path
```
You can manually turn off query diagnosis with the –o option:
```
% badmin diagnose –c query -o
```
- Run a dynamic query diagnosis by running the badmin diagnose -c query command option.
  badmin diagnose -c query -f file_path -d query_duration
  
  The -d option is the duration to keep track of query information.
  
  For example,
```
% badmin diagnose -c query -f /tmp/queryload -d 60
Dynamic query diagnosis started and will last 60 minutes.
```
- Enable query diagnosis tracking by default by enabling the following parameters in the lsb.params file:
```
ENABLE_DIAGNOSE=query
DIAGNOSE_LOGDIR=file_path
```
  You can manually turn off query diagnosis by using the -o option:
  
  badmin diagnose –c query -o
Check query data to identify excessive load.
LSF provides the following fields for each query request:
- Time stamp: Time that LSF handled the request
- Command: LSF commands (API maps to a command)
- User: The user that issued the request
- Host: The host where the request was issued
- Data Size: How many bytes of data are returned to the client
- Command options: Query command options. Key options include -u all and -l for bjobs.
For example, run the following queries:
```
% bjobs 2 > /dev/null
% bjobs -uall -l |wc
% bhosts |wc
% bhosts -l |wc
% bqueues |wc
% bqueues -lr |wc
% cat /tmp/queryload.querylog.hostA
Sep 26 17:33:37 2022 bjobs,lsfadmin,hostA,124,0x1A
Sep 26 17:34:37 2022 bjobs,lsfadmin,hostA,755635420,0xC00010
Sep 26 17:34:52 2022 bhosts,lsfadmin,hostA,2975544,-
Sep 26 17:34:58 2022 bhosts,lsfadmin,hostA,2975524,-
Sep 26 17:34:58 2022 bhosts,lsfadmin,hostA,6482088,0x0
Sep 26 17:35:13 2022 bqueues,lsfadmin,hostA,29040,0x401
Sep 26 17:35:23 2022 bqueues,lsfadmin,hostA,32598264,0x1
```
The second record shows that bjobs –uall –l sent 720 MB of job data to the client (over half million jobs in the cluster). The last record shows that bqueues –rl might have 31 MB of data transfer (with a fair share tree).

What to do next

The following are tips to analyze the data:

Sort data based on hosts. Analyze the data to see if a client, such as a script, is performing frequent query operations (for example, the script contains do queries in a loop). After identifying the host, administrators can log in to the host to find the processes that are imposing a high query load.
Sort data based on users to check who is generating most queries. After understanding the reasons for the high usage, administrators can devise a better solution to achieve the same goals.
Sort data based on data size. Large data transfers can use network bandwidth at the management host and affect responsiveness and performance of the entire cluster. Look for solutions to replace the heavy queries or upgrade the management host to a faster network.

The following are tips to mitigate query impacts on the cluster:

Enable LSF multithreaded queries. Instead of forking another mbatchd daemon to serve user queries, LSF supports multithreaded mbatchd queries to serve most LSF query operations. Multithreaded mbatchd is useful for large clusters (a thousand hosts or more) with a large job load (hundreds of thousands of jobs).
Use the following parameters in the lsf.conf file to enable mulithreaded mbatchd query:
```
LSB_QUERY_PORT=free_port_number
LSB_QUERY_ENH=y
```
Set a maximum concurrent query threshold. LSF query clients normally keep trying after waiting for a short time, which can trigger a large amount of query threads or processes consuming CPU, memory, and network at the same time. In large clusters, job querying can grow quickly. If your site sees a lot of high traffic job querying, you can tune LSF to limit the number of job queries that mbatchd can handle. This helps decrease the load on the management host. Prevent a peak query load from overloading the system by enabling MAX_CONCURRENT_QUERY in the lsb.params file to define the maximum number of concurrent job queries to be handled by mbatchd.
For example,
```
MAX_CONCURRENT_QUERY = 50
```
Avoid using heavy query operations. LSF provides a few methods to minimize data processing volume, which helps common scripting and administration needs:
1. Write scripts to parse job information.
  Use the bjobs -o field_list command to output only the specified data fields in a format that can be more easily parsed by scripts. Compared to the default bjobs -l output, which sends all job information back to the client, using bjobs -o can save unnecessary network traffic.
  
  For example:
```
% bjobs -o "jobid stat queue job_group delimiter='|'" 3
JOBID|STAT|QUEUE|JOB_GROUP
3|RUN|low_normal|/myproj/user1
```
2. Gather cluster status information.
  Use the badmin showstatus command to show cluster summary information, compared with a script that runs bjobs –u all, bqueues, and bhosts to get same data.
```
% badmin showstatus
LSF runtime mbatchd information
...
```
3. Get user job summary information.
  Use the bjobs -sum command to summarize job user information.
  
  For example,
```
% bjobs –u user1 –q regression -sum
RUN        SSUSP      USUSP      UNKNOWN    PEND       FWD_PEND
0          0          0          0          640369     0
```