Troubleshooting
Problem
MBD not responding to any user commands
Symptom
All b* commands get the following output:
batch system daemon not responding ...... still trying
Cause
There are too many interactive or block mode jobs in the cluster
Diagnosing The Problem
Use LSF Support Tool to help you find out how many such jobs in your cluster. Run following command with the current lsb.events file as parameter:
~>support_tools -E <path>/lsb.events
The following output tells you the total number of jobs of each kind:
......
3121 interactive jobs (bsub -I)
1420 interactive jobs with pty (bsub -Ip or -Is)
1292 block mode jobs (bsub -K)
......
If the total number amounts to tens of thousand, or even thousands, it can significantly slow down mbatchd.
Resolving The Problem
By default, mbatchd spawns a process for each finished interactive or block mode job. This is a heavy operation both in time and resources. The following parameter enables mbatchd to use threads instead of processes to handle these jobs, which is much more efficient.
To enable and define the parameter, please follow these steps:
1) In lsf.conf, find LSB_NUM_NIOS_CALLBACK_THREADS=<n>
2) As "n" is the number of threads in the thread pool, you can start with value "4".
3) Restart mbatchd by "badmin mbdrestart"
Lastly, please monitor b* commands – they should get responses faster.
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
isg3T1022426