IBM Support

Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook.

Troubleshooting


Problem

Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook.

Symptom

Distributed training works with 2 GPU environment, but not with 3 or 4 GPU environment in a python notebook:
[2024-02-23 14:30:12,934] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 590) of binary: /home/wsuser/.conda/envs/pytorch/bin/python

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m3p0000006xtiAAA","label":"Services-\u003EData Science Tools-\u003EWatson Machine Learning"}],"ARM Case Number":"TS015456393","Platform":[{"code":"PF040","label":"Red Hat OpenShift"}],"Version":"4.6.6"}]

Log InLog in to view more of this document

This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.

Document Information

Modified date:
29 April 2024

UID

ibm17149745