Troubleshooting
Problem
Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook.
Symptom
Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook:
[2024-02-23 14:30:12,934] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 590) of binary: /home/wsuser/.conda/envs/pytorch/bin/python
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m3p0000006xtiAAA","label":"Services-\u003EData Science Tools-\u003EWatson Machine Learning"}],"ARM Case Number":"TS015456393","Platform":[{"code":"PF040","label":"Red Hat OpenShift"}],"Version":"4.6.6"}]
Log InLog in to view more of this document
This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.
Was this topic helpful?
Document Information
Modified date:
29 April 2024
UID
ibm17149745