APAR status
Closed as program error.
Error description
<PLATFORMS e.g. 'uname ?a, cat /etc/*release? >: [rhel5.x ] <COMMANDS OR DAEMONS>: -- list all related components [ pim ] <DESCRIPTION>: -- symptom of the problem a customer would see [ set LSF_PIM_LINUX_ENHANCE=Y in lsf.conf Customer is noticing several jobs are going into an unknown state, and being killed by monitoring script 1. bjobs output Wed Oct 7 18:50:59 2015: Starting (Pid 25536); Wed Oct 7 18:51:21 2015: Unknown; unable to reach the execution host; Wed Oct 7 18:51:38 2015: External Message "preExec ok Oct07-18:51 Wed Oct 7 18:51:47 2015: Running; Wed Oct 7 18:52:29 2015: Unknown; unable to reach the execution host; Wed Oct 7 18:52:49 2015: Running; Wed Oct 7 18:52:50 2015: Running with execution home Wed Oct 7 18:53:24 2015: Unknown; unable to reach the execution host; Wed Oct 7 18:54:02 2015: Running; Wed Oct 7 18:54:24 2015: Unknown; unable to reach the execution host; Wed Oct 7 18:54:33 2015: Running; Wed Oct 7 19:01:25 2015: Unknown; unable to reach the execution host; Wed Oct 7 19:02:25 2015: Running; Wed Oct 7 19:02:25 2015: Exited; job has been forced to exit. The CPU time used is unknown; Wed Oct 7 19:02:25 2015: Completed <exit> Summary of time in seconds spent in various states by Wed Oct 7 19:04: 29 2015 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 344 0 533 0 0 153 1030 From mbatchd log, mbatchd was unable to reach sbatchd on the execution host. Oct 8 03:05:50 2015 98414 3 9.1.3 EM_jobCtrlDecsn: job <4142139> failed to be dispatched to host Oct 8 03:05:52 2015 98414 3 9.1.3 start_ajob: Failed to call sbatchd on host <xxxx>: Timeout on connect call to server Oct 8 03:05:52 2015 98414 3 9.1.3 EM_jobCtrlDecsn: job <4142142> failed to be dispatched to host xxxx Oct 8 03:22:56 2015 98414 3 9.1.3 signal_job: Failed to call sbatchd on host <xxxx>: Timeout on connect call to server Oct 8 03:26:01 2015 98414 3 9.1.3 signal_job: Failed to call sbatchd on host <xxxx>: Timeout on connect call to server Mbatchd tried to dispatch jobs to the host but sbatchd communication timed out. From execution host, sbatchd log shows pim communication with sbatchd also timed out. Oct 5 01:34:58 2015 22395 3 9.1.3 lib.pim.c/getJInfo_(): select() failed. Communication time out. Oct 5 01:36:46 2015 22395 Last message repeated 8 time(s). Oct 5 01:36:49 2015 22395 4 9.1.3 createJobTmpDir: Job level tmp directory is set to </tmp/1543901.tmpdir> for job <1543901> Oct 5 01:37:06 2015 22395 3 9.1.3 lib.pim.c/getJInfo_(): select() failed. Communication time out. Oct 5 01:39:19 2015 22395 Last message repeated 13 time(s). Oct 5 01:39:25 2015 22395 4 9.1.3 createJobTmpDir: Job level tmp directory is set to </tmp/1544075.tmpdir> for job <1544075> Oct 5 01:39:25 2015 22395 Last message repeated 1 time(s). Oct 5 01:40:05 2015 22395 3 9.1.3 lib.pim.c/getJInfo_(): select() failed. Communication time out. pim.log Nov 23 10:20:46 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.64 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:20:46 2015 21715 6 1.2.10 L0 updateProcs rtime 14386.38 ms, utime 1710.00 ms, stime 12400.00 ms Nov 23 10:21:30 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.76 ms, utime 10.00 ms, stime 0.00 ms Nov 23 10:21:30 2015 21715 6 1.2.10 L0 updateProcs rtime 14806.72 ms, utime 1670.00 ms, stime 12680.00 ms Nov 23 10:22:14 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.48 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:22:14 2015 21715 6 1.2.10 L0 updateProcs rtime 13551.52 ms, utime 1580.00 ms, stime 11540.00 ms Nov 23 10:22:56 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.00 ms, utime 10.00 ms, stime 0.00 ms Nov 23 10:22:56 2015 21715 6 1.2.10 L0 updateProcs rtime 12167.89 ms, utime 1400.00 ms, stime 10650.00 ms Nov 23 10:23:38 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 2.88 ms, utime 10.00 ms, stime 0.00 ms Nov 23 10:23:38 2015 21715 6 1.2.10 L0 updateProcs rtime 12246.14 ms, utime 1450.00 ms, stime 10640.00 ms Nov 23 10:24:07 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 2.92 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:24:07 2015 21715 6 1.2.10 L0 updateProcs rtime 13746.43 ms, utime 1460.00 ms, stime 11990.00 ms Nov 23 10:24:25 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 2.96 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:24:25 2015 21715 6 1.2.10 L0 updateProcs rtime 14792.26 ms, utime 1420.00 ms, stime 12350.00 ms Nov 23 10:24:44 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 2.96 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:24:44 2015 21715 6 1.2.10 L0 updateProcs rtime 16189.69 ms, utime 1470.00 ms, stime 13150.00 ms Nov 23 10:25:27 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.16 ms, utime 10.00 ms, stime 0.00 ms Nov 23 10:25:27 2015 21715 6 1.2.10 L0 updateProcs rtime 13459.57 ms, utime 1440.00 ms, stime 11740.00 ms Nov 23 10:26:11 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.36 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:26:11 2015 21715 6 1.2.10 L0 updateProcs rtime 14060.33 ms, utime 1500.00 ms, stime 12270.00 ms Nov 23 10:26:55 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.44 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:26:55 2015 21715 6 1.2.10 L0 updateProcs rtime 14184.08 ms, utime 1520.00 ms, stime 12290.00 ms Nov 23 10:27:40 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 3.52 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:27:40 2015 21715 6 1.2.10 L0 updateProcs rtime 14985.64 ms, utime 1600.00 ms, stime 13130.00 ms Nov 23 10:28:26 2015 21715 6 1.2.10 L0 addDeadProcesses rtime 4.12 ms, utime 0.00 ms, stime 0.00 ms Nov 23 10:28:26 2015 21715 6 1.2.10 L0 updateProcs rtime 15211.35 ms, utime 1620.00 ms, stime 12920.00 ms Nov 23 10:29:13 2015 21715 6 1.2.10 L0 ]
Local fix
n/a
Problem summary
P101511. This fix reduces the pim reply time when sbatchd call PIM. If the previous PIM update to the pim.info file was more than the LSF_CALL_PIM_SELET_TIMEOUT value in seconds (default is 10 seconds), PIM replies to the next sbatchd call firstly before update pim.info to reduce sbatchd hanging time.
Problem conclusion
The fix resolve this issue.
Temporary fix
Comments
APAR Information
APAR number
P101511
Reported component name
LSF STAND EDITI
Reported component ID
5725G8201
Reported release
913
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2015-12-07
Closed date
2016-07-26
Last modified date
2016-07-26
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
LSF STAND EDITI
Fixed component ID
5725G8201
Applicable component levels
R913 PSY
UP
[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSWRJV","label":"IBM Spectrum LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"913","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSETD4","label":"Platform LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"913","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Document Information
Modified date:
26 July 2016