Troubleshooting
Problem
LSF GPU job exited with error message "No device were found".
Symptom
$ bsub -gpu num=6 -I -m hostA -q testq nvidia-smi
Job <110> is submitted to queue <testq>.
<<Waiting for dispatch ...>>
<<Starting on hostB>>
No devices were found
However, direct execution of "nvidia-smi" on the host without going through LSF works fine.
The "lsload" command still shows all GPUs in OK status and with no error.
Enabled GPU resource enforcement through the Linux cgroup with LSB_RESOURCE_ENFORCE="gpu" in lsf.conf.
Enabled NVIDIA DCGM feature in LSF with LSF_DCGM_PORT in lsf.conf.
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSWRJV","label":"IBM Spectrum LSF"},"ARM Category":[{"code":"a8m50000000CeHPAA0","label":"GPU"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"10.1.0"}]
Log InLog in to view more of this document
This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.
Was this topic helpful?
Document Information
Modified date:
25 November 2021
UID
ibm16510044