APAR status
Closed as program error.
Error description
When cgroup enforcement is enabled for GPU, jobs requiring more GPUs (2 out of 2 available GPUs or 3/4 out of 4 available GPUs) often get terminated. The percentage of the failure is close to 100%. While jobs requiring less GPUs (1 out of 2 or 1/2 out of 4) can always succeed.
Local fix
n/a
Problem summary
Fix to ensure that GPU jobs can run successfully on linux3.10-glibc2.17-x86_64.
Problem conclusion
Fix it.
Temporary fix
Comments
APAR Information
APAR number
P102164
Reported component name
LSF STAND EDITI
Reported component ID
5725G8201
Reported release
A10
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2017-03-28
Closed date
2017-04-11
Last modified date
2017-04-11
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
LSF STAND EDITI
Fixed component ID
5725G8201
Applicable component levels
RA10 PSY
UP
[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSWRJV","label":"IBM Spectrum LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A10","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSETD4","label":"Platform LSF"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A10","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Document Information
Modified date:
11 April 2017