IBM Support

LSF Cluster not responding because of too many interactive or block mode jobs

Troubleshooting


Problem

MBD not responding to any user commands

Symptom

All b* commands get the following output:

batch system daemon not responding ...... still trying

Cause

There are too many interactive or block mode jobs in the cluster

Diagnosing The Problem

Use LSF Support Tool to help you find out how many such jobs in your cluster. Run following command with the current lsb.events file as parameter:

~>support_tools -E <path>/lsb.events

The following output tells you the total number of jobs of each kind:

......


3121 interactive jobs (bsub -I)
1420 interactive jobs with pty (bsub -Ip or -Is)
1292 block mode jobs (bsub -K)

......

If the total number amounts to tens of thousand, or even thousands, it can significantly slow down mbatchd.

Resolving The Problem

By default, mbatchd spawns a process for each finished interactive or block mode job. This is a heavy operation both in time and resources. The following parameter enables mbatchd to use threads instead of processes to handle these jobs, which is much more efficient.

To enable and define the parameter, please follow these steps:

1) In lsf.conf, find LSB_NUM_NIOS_CALLBACK_THREADS=<n>

2) As "n" is the number of threads in the thread pool, you can start with value "4".

3) Restart mbatchd by "badmin mbdrestart"

Lastly, please monitor b* commands – they should get responses faster.

[{"Product":{"code":"SSETD4","label":"Platform LSF"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"--","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"9.1.0;9.1.1;9.1.2;9.1.3","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSETD4","label":"Platform LSF"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":null,"Platform":[{"code":"","label":null}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 June 2018

UID

isg3T1022426