IBM Support

CUDA Initialization errors with persistence mode disabled

Troubleshooting


Problem

While accessing the GPUs - CUDA fails with cudaErrorInitializationError along with nvidia-smi has GPU's showing 'off'

Symptom

GPU's are unusable.  nvidia-smi shows 'off' 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off   | 00000002:01:00.0 Off |                    0 |
| N/A   28C    P0    30W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off   | 00000006:01:00.0 Off |                    0 |
| N/A   31C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Cause

Two main causes of this problem are:
1. nvidia-persistenced.service daemon is not running. If the service is not active, go ahead and start it and see if GPUs are accessible now. 
# systemctl start nvidia-persistenced.service
2. In another case -  If nvidia-persistenced service is active but the problem is still there - Look for the below messages in 'systemctl status nvidia-persistenced -l'
Device NUMA memory is already online. This likely means that some non-NVIDIA software has auto-online the device memory before nvidia-persistenced could.
This likely indicates that the server is missing udev rules required for CUDA/Nvidia. 

Environment

RHEL 7.6 with Nvidia/CUDA toolkit.

Diagnosing The Problem

Look for the presence of /etc/udev/rules.d/40-redhat.rules and confirm whether the Memory section has been commented out or not. By Default, Red Hat will auto-online the NUMA memory however for CUDA to work - This default action needs to be disabled.

Resolving The Problem

1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules:
   # cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
2. Edit the /etc/udev/rules.d/40-redhat.rules file:
   # vi /etc/udev/rules.d/40-redhat.rules
3. Comment out the entire "Memory hotadd request" section and save the change:
   # Memory hotadd request
   #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
   #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
   #ENV{.state}="online"
   #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
   #ATTR{state}=="offline", ATTR{state}="$env{.state}"
   #LABEL="memory_hotplug_end"
4. Restart the system for the changes to take effect:
   # reboot

Document Location

Worldwide

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power ->PowerLinux"},"Component":"","Platform":[{"code":"PF043","label":"Red Hat"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
16 December 2019

UID

ibm11136344