Troubleshooting
Problem
Bjobs -l shows no GPU usage information after GPU jobs finish, while nvidia-smi and "dcgmi stats --pid xxx -v" shows the usage.
Symptom
Sbatchd log shows the error like:
Feb 4 17:36:42 2021 40899 3 10.1 checkGPUStatus: Fail to get GPU healthy status: API version mismatch
Document Location
Worldwide
[{"Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSWRJV","label":"IBM Spectrum LSF"},"ARM Category":[{"code":"a8m50000000CeHPAA0","label":"GPU"}],"ARM Case Number":"TS004901860","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]
Log InLog in to view more of this document
This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.
Was this topic helpful?
Document Information
Modified date:
09 February 2021
UID
ibm16413727