IBM Support

P101511: SBATCHD CALL PIM TIMEOUT, BLOCKING COMMUNICATION BETWEEN MBATCHD AND SBATCHD

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • <PLATFORMS e.g. 'uname ?a, cat /etc/*release? >:
    [rhel5.x ]
    <COMMANDS OR DAEMONS>: -- list all related components
    [ pim    ]
    <DESCRIPTION>: -- symptom of the problem a customer would see
    [
    set  LSF_PIM_LINUX_ENHANCE=Y in lsf.conf
    Customer is noticing several jobs are going into an unknown
    state, and being killed by monitoring script
    
    1. bjobs output
    Wed Oct  7 18:50:59 2015: Starting (Pid 25536);
    Wed Oct  7 18:51:21 2015: Unknown; unable to reach the execution
    host;
    Wed Oct  7 18:51:38 2015: External Message "preExec ok
    Oct07-18:51
    Wed Oct  7 18:51:47 2015: Running;
    Wed Oct  7 18:52:29 2015: Unknown; unable to reach the execution
    host;
    Wed Oct  7 18:52:49 2015: Running;
    Wed Oct  7 18:52:50 2015: Running with execution home
    Wed Oct  7 18:53:24 2015: Unknown; unable to reach the execution
    host;
    Wed Oct  7 18:54:02 2015: Running;
    Wed Oct  7 18:54:24 2015: Unknown; unable to reach the execution
    host;
    Wed Oct  7 18:54:33 2015: Running;
    Wed Oct  7 19:01:25 2015: Unknown; unable to reach the execution
    host;
    Wed Oct  7 19:02:25 2015: Running;
    Wed Oct  7 19:02:25 2015: Exited; job has been forced to exit.
    The CPU
    time used is unknown;
    Wed Oct  7 19:02:25 2015: Completed <exit>
    
    Summary of time in seconds spent in various states by  Wed Oct
    7 19:04:
    29 2015
      PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
      344      0        533      0        0        153      1030
    
    From mbatchd log, mbatchd was unable to reach sbatchd on the
    execution
    host.
    
    Oct  8 03:05:50 2015 98414 3 9.1.3 EM_jobCtrlDecsn: job
    <4142139>
    failed to be dispatched to host
    Oct  8 03:05:52 2015 98414 3 9.1.3 start_ajob: Failed to call
    sbatchd
    on host <xxxx>: Timeout on connect call to server
    Oct  8 03:05:52 2015 98414 3 9.1.3 EM_jobCtrlDecsn: job
    <4142142>
    failed to be dispatched to host xxxx
    Oct  8 03:22:56 2015 98414 3 9.1.3 signal_job: Failed to call
    sbatchd
    on host <xxxx>: Timeout on connect call to server
    Oct  8 03:26:01 2015 98414 3 9.1.3 signal_job: Failed to call
    sbatchd
    on host <xxxx>: Timeout on connect call to server
    
    Mbatchd tried to dispatch jobs to the host but sbatchd
    communication
    timed out.
    
    From execution host, sbatchd log shows pim communication with
    sbatchd
    also timed out.
    
    Oct  5 01:34:58 2015 22395 3 9.1.3 lib.pim.c/getJInfo_():
    select()
    failed. Communication time out.
    Oct  5 01:36:46 2015 22395 Last message repeated 8 time(s).
    Oct  5 01:36:49 2015 22395 4 9.1.3 createJobTmpDir: Job level
    tmp
    directory is set to  </tmp/1543901.tmpdir> for job <1543901>
    Oct  5 01:37:06 2015 22395 3 9.1.3 lib.pim.c/getJInfo_():
    select()
    failed. Communication time out.
    Oct  5 01:39:19 2015 22395 Last message repeated 13 time(s).
    Oct  5 01:39:25 2015 22395 4 9.1.3 createJobTmpDir: Job level
    tmp
    directory is set to  </tmp/1544075.tmpdir> for job <1544075>
    Oct  5 01:39:25 2015 22395 Last message repeated 1 time(s).
    Oct  5 01:40:05 2015 22395 3 9.1.3 lib.pim.c/getJInfo_():
    select()
    failed. Communication time out.
    
    pim.log
    
    Nov 23 10:20:46 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.64 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:20:46 2015 21715 6 1.2.10 L0 updateProcs rtime
    14386.38 ms, utime 1710.00 ms, stime 12400.00 ms
    Nov 23 10:21:30 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.76 ms, utime 10.00 ms, stime 0.00 ms
    Nov 23 10:21:30 2015 21715 6 1.2.10 L0 updateProcs rtime
    14806.72 ms, utime 1670.00 ms, stime 12680.00 ms
    Nov 23 10:22:14 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.48 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:22:14 2015 21715 6 1.2.10 L0 updateProcs rtime
    13551.52 ms, utime 1580.00 ms, stime 11540.00 ms
    Nov 23 10:22:56 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.00 ms, utime 10.00 ms, stime 0.00 ms
    Nov 23 10:22:56 2015 21715 6 1.2.10 L0 updateProcs rtime
    12167.89 ms, utime 1400.00 ms, stime 10650.00 ms
    Nov 23 10:23:38 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    2.88 ms, utime 10.00 ms, stime 0.00 ms
    Nov 23 10:23:38 2015 21715 6 1.2.10 L0 updateProcs rtime
    12246.14 ms, utime 1450.00 ms, stime 10640.00 ms
    Nov 23 10:24:07 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    2.92 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:24:07 2015 21715 6 1.2.10 L0 updateProcs rtime
    13746.43 ms, utime 1460.00 ms, stime 11990.00 ms
    Nov 23 10:24:25 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    2.96 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:24:25 2015 21715 6 1.2.10 L0 updateProcs rtime
    14792.26 ms, utime 1420.00 ms, stime 12350.00 ms
    Nov 23 10:24:44 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    2.96 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:24:44 2015 21715 6 1.2.10 L0 updateProcs rtime
    16189.69 ms, utime 1470.00 ms, stime 13150.00 ms
    Nov 23 10:25:27 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.16 ms, utime 10.00 ms, stime 0.00 ms
    Nov 23 10:25:27 2015 21715 6 1.2.10 L0 updateProcs rtime
    13459.57 ms, utime 1440.00 ms, stime 11740.00 ms
    Nov 23 10:26:11 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.36 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:26:11 2015 21715 6 1.2.10 L0 updateProcs rtime
    14060.33 ms, utime 1500.00 ms, stime 12270.00 ms
    Nov 23 10:26:55 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.44 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:26:55 2015 21715 6 1.2.10 L0 updateProcs rtime
    14184.08 ms, utime 1520.00 ms, stime 12290.00 ms
    Nov 23 10:27:40 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    3.52 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:27:40 2015 21715 6 1.2.10 L0 updateProcs rtime
    14985.64 ms, utime 1600.00 ms, stime 13130.00 ms
    Nov 23 10:28:26 2015 21715 6 1.2.10 L0 addDeadProcesses rtime
    4.12 ms, utime 0.00 ms, stime 0.00 ms
    Nov 23 10:28:26 2015 21715 6 1.2.10 L0 updateProcs rtime
    15211.35 ms, utime 1620.00 ms, stime 12920.00 ms
    Nov 23 10:29:13 2015 21715 6 1.2.10 L0
    ]
    

Local fix

  • n/a
    

Problem summary

  • P101511. This fix reduces the pim reply time when sbatchd call
    PIM. If the previous PIM update to the pim.info file was more
    than the LSF_CALL_PIM_SELET_TIMEOUT value in seconds (default is
     10 seconds), PIM replies to the next sbatchd call firstly
    before update pim.info to reduce sbatchd hanging time.
    

Problem conclusion

  • The fix resolve this issue.
    

Temporary fix

Comments

APAR Information

  • APAR number

    P101511

  • Reported component name

    LSF STAND EDITI

  • Reported component ID

    5725G8201

  • Reported release

    913

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2015-12-07

  • Closed date

    2016-07-26

  • Last modified date

    2016-07-26

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    LSF STAND EDITI

  • Fixed component ID

    5725G8201

Applicable component levels

  • R913 PSY

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSWRJV","label":"IBM Spectrum LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"913","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSETD4","label":"Platform LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"913","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
26 July 2016