IBM Support

P102164: GPU JOBS OFTEN GET TERMINATED.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • When cgroup enforcement is enabled for GPU, jobs requiring
    more GPUs (2 out of 2 available GPUs or 3/4 out of 4 available
    GPUs) often get terminated. The percentage of the failure is
    close to 100%. While jobs requiring less GPUs (1 out of 2 or 1/2
     out of 4) can always succeed.
    

Local fix

  • n/a
    

Problem summary

  • Fix to ensure that GPU jobs can run successfully on
    linux3.10-glibc2.17-x86_64.
    

Problem conclusion

  • Fix it.
    

Temporary fix

Comments

APAR Information

  • APAR number

    P102164

  • Reported component name

    LSF STAND EDITI

  • Reported component ID

    5725G8201

  • Reported release

    A10

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-03-28

  • Closed date

    2017-04-11

  • Last modified date

    2017-04-11

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    LSF STAND EDITI

  • Fixed component ID

    5725G8201

Applicable component levels

  • RA10 PSY

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSWRJV","label":"IBM Spectrum LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A10","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSETD4","label":"Platform LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A10","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
11 April 2017