Maintaining cluster performance under high query load
LSF provides basic performance metrics (badmin perfmon), which samples three major query operations (bjobs, bhosts, and bqueues). LSF also provides the badmin diagnose -c query command option to trace query sources for further troubleshooting.
About this task
Frequent LSF query activities can put a heavy load on the management host, causing it to compete with other system components for CPU, memory, and network bandwidth. The following steps describe how to use the LSF commands to check the cluster performance under high query load.
Procedure
What to do next
The following are tips to analyze the data:
- Sort data based on hosts. Analyze the data to see if a client, such as a script, is performing frequent query operations (for example, the script contains do queries in a loop). After identifying the host, administrators can log in to the host to find the processes that are imposing a high query load.
- Sort data based on users to check who is generating most queries. After understanding the reasons for the high usage, administrators can devise a better solution to achieve the same goals.
- Sort data based on data size. Large data transfers can use network bandwidth at the management host and affect responsiveness and performance of the entire cluster. Look for solutions to replace the heavy queries or upgrade the management host to a faster network.
The following are tips to mitigate query impacts on the cluster:
- Enable LSF multithreaded queries. Instead of forking another mbatchd daemon
to serve user queries, LSF supports multithreaded mbatchd queries to serve most
LSF query operations. Multithreaded mbatchd is useful for large clusters (a
thousand hosts or more) with a large job load (hundreds of thousands of jobs).
Use the following parameters in the lsf.conf file to enable mulithreaded mbatchd query:
LSB_QUERY_PORT=free_port_number LSB_QUERY_ENH=y - Set a maximum concurrent query threshold. LSF query clients normally keep trying after waiting
for a short time, which can trigger a large amount of query threads or processes consuming CPU,
memory, and network at the same time. In large clusters, job querying can grow quickly. If your site
sees a lot of high traffic job querying, you can tune LSF to limit the number of job queries that
mbatchd can handle. This helps decrease the load on the management host. Prevent a peak
query load from overloading the system by enabling MAX_CONCURRENT_QUERY in the
lsb.params file to define the maximum number of concurrent job queries to be
handled by mbatchd.
For example,
MAX_CONCURRENT_QUERY = 50 - Avoid using heavy query operations. LSF
provides a few methods to minimize data processing volume, which helps common scripting and
administration needs:
- Write scripts to parse job information.
Use the bjobs -o field_list command to output only the specified data fields in a format that can be more easily parsed by scripts. Compared to the default bjobs -l output, which sends all job information back to the client, using bjobs -o can save unnecessary network traffic.
For example:
% bjobs -o "jobid stat queue job_group delimiter='|'" 3 JOBID|STAT|QUEUE|JOB_GROUP 3|RUN|low_normal|/myproj/user1 - Gather cluster status information.
Use the badmin showstatus command to show cluster summary information, compared with a script that runs bjobs –u all, bqueues, and bhosts to get same data.
% badmin showstatus LSF runtime mbatchd information ... - Get user job summary information.
Use the bjobs -sum command to summarize job user information.
For example,
% bjobs –u user1 –q regression -sum RUN SSUSP USUSP UNKNOWN PEND FWD_PEND 0 0 0 0 640369 0
- Write scripts to parse job information.