CUDA Initialization errors with persistence mode disabled

Troubleshooting

Problem

While accessing the GPUs - CUDA fails with cudaErrorInitializationError along with nvidia-smi has GPU's showing 'off'

Symptom

GPU's are unusable. nvidia-smi shows 'off'

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off   | 00000002:01:00.0 Off |                    0 |
| N/A   28C    P0    30W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off   | 00000006:01:00.0 Off |                    0 |
| N/A   31C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Cause

Two main causes of this problem are:

1. nvidia-persistenced.service daemon is not running. If the service is not active, go ahead and start it and see if GPUs are accessible now.

# systemctl start nvidia-persistenced.service

2. In another case - If nvidia-persistenced service is active but the problem is still there - Look for the below messages in 'systemctl status nvidia-persistenced -l'

Device NUMA memory is already online. This likely means that some non-NVIDIA software has auto-online the device memory before nvidia-persistenced could.

This likely indicates that the server is missing udev rules required for CUDA/Nvidia.

Environment

RHEL 7.6 with Nvidia/CUDA toolkit.

Diagnosing The Problem

Look for the presence of /etc/udev/rules.d/40-redhat.rules and confirm whether the Memory section has been commented out or not. By Default, Red Hat will auto-online the NUMA memory however for CUDA to work - This default action needs to be disabled.

Resolving The Problem

1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules:
# cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/

2. Edit the /etc/udev/rules.d/40-redhat.rules file:
# vi /etc/udev/rules.d/40-redhat.rules

3. Comment out the entire "Memory hotadd request" section and save the change:

# Memory hotadd request
#SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
#PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"

#ENV{.state}="online"
#PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
#ATTR{state}=="offline", ATTR{state}="$env{.state}"
#LABEL="memory_hotplug_end"

4. Restart the system for the changes to take effect:
# reboot

Document Location

Worldwide

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power ->PowerLinux"},"Component":"","Platform":[{"code":"PF043","label":"Red Hat"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}}]

Tips

CUDA Initialization errors with persistence mode disabled

Troubleshooting

Problem

Symptom

Cause

Environment

Diagnosing The Problem

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?