Troubleshooting
Problem
Interactive jobs fail with "ls_rsetenv: A connect sys call failed: Connection refused"
Symptom
Interactive jobs (lsrun, lsgrun, bsub -I) fail with the following error for specific hosts:
ls_rsetenv: A connect sys call failed: Connection refused
Cause
As per the error message, “lsrun” failed
probably because of connection failure between submission host and execution
host. It could be caused by communication failure between both hosts, or RES daemon not
running on execution host, or RES not listening on the designated port.
Diagnosing The Problem
Here is the checklist to diagnose the possible causes
of the issue.
1. Ensure the following items are
verified:
a. submission host can connect to execution host by host
name (nslookup )
b. RES daemon is
up and listerning on the port defined in lsf.conf (ps -ef | grep res)
2. If /etc/services or the NIS or NIS+ database are
used to define daemon port number, to isolate the issue, define RES daemon port
in lsf.conf which will automatically override the definition in other places.
3. Run
strace lsrun -m servername csh which
gives the detailed trace of the connection. See if any error message is
logged.a. If any, examine the trace log to find the condition that
fails and throws the error.
b. If no error, however, it indicates
wrong RES port is given by NIS or /etc/services and thus trigger connection
failure. Look into the relevant settings. To avoid this
mis-configuration, you can always define LSF daemon port in lsf.conf.
[{"Product":{"code":"SSETD4","label":"Platform LSF"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"7.0.5;7.0.6;8.0;8.3;9.1.0;9.1.1;9.1.2;9.1.3","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
isg3T1020458