Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook.

Troubleshooting

Problem

Symptom

Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook:

[2024-02-23 14:30:12,934] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 590) of binary: /home/wsuser/.conda/envs/pytorch/bin/python

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m3p0000006xtiAAA","label":"Services-\u003EData Science Tools-\u003EWatson Machine Learning"}],"ARM Case Number":"TS015456393","Platform":[{"code":"PF040","label":"Red Hat OpenShift"}],"Version":"4.6.6"}]

Log InLog in to view more of this document

This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.

Tips

Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook.

Troubleshooting

Problem

Symptom

Document Location

Log InLog in to view more of this document

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?