Troubleshooting
Problem
While accessing the GPUs - CUDA fails with cudaErrorInitializationError along with nvidia-smi has GPU's showing 'off'
Symptom
GPU's are unusable. nvidia-smi shows 'off'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 00000002:01:00.0 Off | 0 |
| N/A 28C P0 30W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 00000006:01:00.0 Off | 0 |
| N/A 31C P0 31W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Cause
Two main causes of this problem are:
1. nvidia-persistenced.service daemon is not running. If the service is not active, go ahead and start it and see if GPUs are accessible now.
# systemctl start nvidia-persistenced.service
2. In another case - If nvidia-persistenced service is active but the problem is still there - Look for the below messages in 'systemctl status nvidia-persistenced -l'
Device NUMA memory is already online. This likely means that some non-NVIDIA software has auto-online the device memory before nvidia-persistenced could.
This likely indicates that the server is missing udev rules required for CUDA/Nvidia.
Environment
RHEL 7.6 with Nvidia/CUDA toolkit.
Diagnosing The Problem
Look for the presence of /etc/udev/rules.d/40-redhat.rules and confirm whether the Memory section has been commented out or not. By Default, Red Hat will auto-online the NUMA memory however for CUDA to work - This default action needs to be disabled.
Resolving The Problem
1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules:
# cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
# cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
2. Edit the /etc/udev/rules.d/40-redhat.rules file:
# vi /etc/udev/rules.d/40-redhat.rules
# vi /etc/udev/rules.d/40-redhat.rules
3. Comment out the entire "Memory hotadd request" section and save the change:
# Memory hotadd request
#SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
#PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
#SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
#PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
#ENV{.state}="online"
#PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
#ATTR{state}=="offline", ATTR{state}="$env{.state}"
#LABEL="memory_hotplug_end"
#PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
#ATTR{state}=="offline", ATTR{state}="$env{.state}"
#LABEL="memory_hotplug_end"
4. Restart the system for the changes to take effect:
# reboot
# reboot
Document Location
Worldwide
[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power ->PowerLinux"},"Component":"","Platform":[{"code":"PF043","label":"Red Hat"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}}]
Was this topic helpful?
Document Information
Modified date:
16 December 2019
UID
ibm11136344