General Page
What to do with “cudaSuccess (3 vs. 0) initialization error” on an IBM POWER9 processor-based system?
Author: Douglas Lehr
cudaSuccess (3 vs. 0) initialization
error tldr;
If you’re on an IBM® Power® System AC922 server and are experiencing CUDA-related initialization or memory errors when running in a containerized platform (such as Docker, Kubernetes, or Red Hat® OpenShift®), you may have a mismatch in your platform’s cpuset slice due to a race condition that brings the GPU memory online.
Run https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh on the host to see if you’re affected. The script also provides a –correct parameter to fix any affected slices.
Introduction
Over the past few years, computing has followed two dominant trends: containerization of workloads and accelerated machine learning with GPUs, specifically from NVIDIA.
As these two technologies started to overlap, problems arose. One being passing specific kernel modules (NVIDIA) into a container, along with kernel devices, in our case NVIDIA GPUs. To reconcile these issues, NVIDIA has provided a fantastic set of tools for both vanilla Docker (using nvidia-docker) as well as Kubernetes-like environments (using k8s-device-plugin).
Though these tools are good, there was still room for users to experience issues running GPU workloads in containers. One such issue is the dreaded CudaInitializationError (Cudasuccess (3 vs. 0)) error. For users who have spent a lot of time running containers with GPUs enabled, this error might have occurred at least once in your endeavors. The message flatly states, CUDA failed to initialize. Why? Well there are many possible reasons. Some revolve around software making improper CUDA calls, sometimes the devices aren’t set up properly or maybe your device driver isn’t passed into the container correctly. While most of these issues are documented, there’s one case that is light on descriptions, and I’m going to tackle that here.
The issue I’m referring to is tied to Linux® cpusets, the IBM Power System AC922 (ppc64le architecture) server (based on the IBM POWER9™ processor technology), and containerization.
Background
Power AC922 server (hardware configuration)
Before we jump into how cpusets affect running NVIDIA GPUs in a container, we need to understand what IBM and NVIDIA did with their joint POWER9/NVLink 2.0 venture. The POWER9 processor-based servers, specifically Power AC922 servers come with two physical POWER9 processors and up to four NVIDIA Tesla V100 GPUs. Section 2.1 of IBM Power System AC922 from IBM Redbooks® describes the hardware layout. In short, Power AC922 servers use NVLink 2.0 to connect the GPUs directly to the CPUs, instead of the traditional PCIe bus. This allows for faster bandwidth, lower latency, and the most important part of this whole discussion: coherent access to GPU memory.
It is because of this coherency that we experience the uniqueness of this problem. To allow the GPU memory to be accessible by applications running on the CPU, the decision was made to treat the GPUs as additional nonuniform memory access (NUMA) nodes.
A sample numactl –hardware command from an IBM Power AC922 server illustrates this setup:
numactl --hardware
available: 6 nodes (0,8,252-255)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 2 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 0 size: 257742 MB
node 0 free: 48358 MB
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 8 size: 261735 MB
node 8 free: 186807 MB
node 252 cpus:
node 252 size: 16128 MB
node 252 free: 16115 MB
node 253 cpus:
node 253 size: 16128 MB
node 253 free: 16117 MB
node 254 cpus:
node 254 size: 16128 MB
node 254 free: 16117 MB
node 255 cpus:
node 255 size: 16128 MB
node 255 free: 16117 MB
Note: To avoid any potential issues with collisions on CPU nodes versus GPU nodes, the numbering for GPUs starts at 255 and goes backwards, while CPUs start at 0. On a Power AC922 server, we have two CPUs (0,8) with 80 threads each and 256 GB of memory each, and four GPUs (252- 255) with 16 GB of memory each. (GPU threads aren’t listed here.)
cpusets
Now that you understand the hardware makeup of a Power AC922 server, let’s dive into a little bit of background on cpusets (https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt). Cpusets is a mechanism that allows CPU and memory nodes to be assigned for tasks, services, virtual machines, or containers. This allows the kernel to limit what resources can be seen. There are many aspects to cpusets and you can spend hours reading about all of them. In our case, we’re mostly interested in the cpuset.mems file under sysfs. cpusets.mems lists what memory nodes are available at a given time. The default values are kept in /sys/fs/cgroup/cpuset/cpuset.mems with various subdirectories keeping their own copy of cpuset.mems
The GPU nodes, however, do not get enabled by default. The systemd service, nvidia-persistence, will bring the GPU memory online and the cpusets will get updated.
For example:
nvidia-persistenced service up systemctl start
nvidia-persistenced cat
/sys/fs/cgroup/cpuset/cpuset.mems 0,8,252-255
nvidia-persistenced service down systemctl stop
nvidia-persistenced cat
/sys/fs/cgroup/cpuset/cpuset.mems 0,8
Slices
Let us understand the background before getting to the crux of the issue is the concept of a slice unit. To the Linux kernel, “A slice unit is a concept for hierarchically managing resources of a group of processes.”
In the example considered in this article, there are three slices that we need to be concerned about. With RHEL 7.6, using Red Hat’s version of Docker or Podman, the slice in question is system.slice (normally located at /sys/fs/cgroup/cpuset/system.slice).
For Kubernetes or OpenShift, use kubepods.slice which is located at /sys/fs/cgroup/cpuset/kubepods.slice
Finally, later docker-ce versions appear to use the docker slice, which is at /sys/fs/cgroup/cpuset/docker. I’m not sure why they lost the “.slice” to the name, but that’s neither here nor there.
Within these slices, a subslice is created each time a container gets spun up, passing along the necessary cpuset information. Each slice and subslice contains various details, including the cpuset.mems file that contains the memory nodes.
So, what happened?
We talked about the Power AC922 CPU memory being coherently attached to GPU memory. Well, GPU memory would need to stay online at all times. Normally when a device is no longer in use, the kernel will tear down the corresponding kernel module and devices. In order to keep the GPUs online, a systemd service was created (aptly named nvidia-persistenced). With this service, we can guarantee that the GPU memory will stay online regardless of the GPUs active use. The problem? This service starts using systemd, same as Docker and Kubernetes. Unless Docker or Kubernetes explicitly waits for the nvidia-persistenced service to start up and finish the process of bringing the GPU memory online, which could take up to 5 minutes past startup, they will take what’s available in the master cpuset and use it as the base system configuration.
When a process grabs the Linux control group (cgroup) too early, the cpuset.mems file will reflect an incomplete list of memory resources. For example, “0,8,253-255″, which tells us that there are two CPU nodes and only three GPU nodes. If a system actually had just three GPU nodes then this is a valid description, but odds are that the system has four GPUs and the value should have been “0,8,252-255″ to signify all four GPUs are present.
Even when a containerization platform has an incomplete list of the GPU memory nodes, the problem will get masked until CUDA tries to initialize memory against that node. On starting up a container, the NVIDIA driver and devices will be passed through, depending on the rules you have set up, regardless of the memory nodes that are specified in the cgroup.
That is, although your cpuset.mems indicates that you have 253-255 (nvidia0-nvidia2), and 252 (nvidia3) is missing, the NVIDIA container plug-ins or hooks can still pass nvidia3 into a container, because by the time the container was started, all four GPUs were online. We now have a case where we have GPU devices that don’t exist are being passed into the cgroup.
Why doesn’t this fail all the time?
When a machine is in this incorrect state, GPU devices and drivers can be added to a container, and even driver-based commands such as nvidia-smi can provide the correct output. This is because none of those commands try to allocate memory on the GPU. I’m sure someone will speak up and tell me I’m wrong, and driver commands do in fact allocate some memory on a GPU and they’re probably right, but they’re not using the cgroup values to do so, and odds are, the request is being sent to the host and run by the driver itself.
When the code in a container tries to allocate CUDA memory against a device that doesn’t have a corresponding value in the cpuset.mems file, errors occur. Normally it’ll show up as cudaSuccess (3 vs. 0) initialization error, but other flavors can show up depending on how the memory is trying to be allocated. A lot of code, such as CUDA’s deviceQuery from its sample code package, will try to query all devices available to it. When CUDA tries to allocate memory against it, things start to go wrong. Normally if you knew which device wasn’t in the cpuset.mems file, you could use options such as setting CUDA_VISIBLE_DEVICES to cordon off the device, and the rest of the code should work. However, this isn’t a viable long-term solution as it effectively makes a GPU unusable in a containerized environment.
The solution?
A bug has been created to track this problem: https://bugzilla.redhat.com/show_bug.cgi? id=1746415. While it is being worked on, there are some workarounds, most of which involve correcting the problematic cpuset slices. I’ve written a script (https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh) that will check the slices used by the common containerization platforms (Docker, Kubernetes, and OpenShift). If it detects a mismatch between the slice folders cpuset.mems and the master cpuset.mems, it will notify the user. If desired, the script will also correct the problem by removing the slice folders altogether. This needs to be done because the slice folders aren’t deleted when the respective services are shut down or restarted, so bouncing Kubernetes, for example, will keep the same kubepods.slice as before, and you’ll still have the problem.
If we remove the slice folders altogether prior to (re)starting the respective service, the service will regenerate the cgroup slice from the master version, allowing the correct values to be ingested and applied correctly.
I tried editing the cpuset.mems file for certain slice groups, and with the right permissions you should be able to do this. However, I don’t recommend it as you’ll end up with containers that may have differing copies of the cpuset.mems file within a single orchestration, leading to some pretty unpredictable results. The best scenario I can think of at the moment is to bring down the service, run the script to remove the existing incorrect values, and let the service come up naturally.
One last caveat to mention: cgroups and cpusets all reside under sysfs. This means that they are regenerated after each reboot. So, any time a system is restarted, there’s a risk that this issue could happen again. One workaround is to delay the startup of Docker, Kubernetes, OpenShift, and so on until the NVIDIA GPUs have time to come online. This may not be ideal, but is still a better alternative than having to shut down the service mid production to address this problem.
In summary, this is an issue that is unique to a specific server due to its ability to have coherently attached GPU memory. In creating this feature, an exposure in cgroup was discovered where node memory can be added after startup and not passed along to existing slices.
Thanks for your time and, as always, feel free to contact me if you have any questions!
Was this topic helpful?
Document Information
Modified date:
08 February 2021
UID
ibm16412651