cudasuccess-3-vs-0-initialization-error

General Page

What to do with “cudaSuccess (3 vs. 0) initialization error” on an IBM POWER9 processor-based system?

Author: Douglas Lehr

cudaSuccess (3 vs. 0) initialization

error tldr;

If you’re on an IBM® Power® System AC922 server and are experiencing CUDA-related initialization or memory errors when running in a containerized platform (such as Docker, Kubernetes, or Red Hat® OpenShift®), you may have a mismatch in your platform’s cpuset slice due to a race condition that brings the GPU memory online.

Run https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh on the host to see if you’re affected. The script also provides a –correct parameter to fix any affected slices.

Introduction

Over the past few years, computing has followed two dominant trends: containerization of workloads and accelerated machine learning with GPUs, specifically from NVIDIA.

As these two technologies started to overlap, problems arose. One being passing specific kernel modules (NVIDIA) into a container, along with kernel devices, in our case NVIDIA GPUs. To reconcile these issues, NVIDIA has provided a fantastic set of tools for both vanilla Docker (using nvidia-docker) as well as Kubernetes-like environments (using k8s-device-plugin).

Though these tools are good, there was still room for users to experience issues running GPU workloads in containers. One such issue is the dreaded CudaInitializationError (Cudasuccess (3 vs. 0)) error. For users who have spent a lot of time running containers with GPUs enabled, this error might have occurred at least once in your endeavors. The message flatly states, CUDA failed to initialize. Why? Well there are many possible reasons. Some revolve around software making improper CUDA calls, sometimes the devices aren’t set up properly or maybe your device driver isn’t passed into the container correctly. While most of these issues are documented, there’s one case that is light on descriptions, and I’m going to tackle that here.

The issue I’m referring to is tied to Linux® cpusets, the IBM Power System AC922 (ppc64le architecture) server (based on the IBM POWER9™ processor technology), and containerization.

Background

Power AC922 server (hardware configuration)

Before we jump into how cpusets affect running NVIDIA GPUs in a container, we need to understand what IBM and NVIDIA did with their joint POWER9/NVLink 2.0 venture. The POWER9 processor-based servers, specifically Power AC922 servers come with two physical POWER9 processors and up to four NVIDIA Tesla V100 GPUs. Section 2.1 of IBM Power System AC922 from IBM Redbooks® describes the hardware layout. In short, Power AC922 servers use NVLink 2.0 to connect the GPUs directly to the CPUs, instead of the traditional PCIe bus. This allows for faster bandwidth, lower latency, and the most important part of this whole discussion: coherent access to GPU memory.

It is because of this coherency that we experience the uniqueness of this problem. To allow the GPU memory to be accessible by applications running on the CPU, the decision was made to treat the GPUs as additional nonuniform memory access (NUMA) nodes.

A sample numactl –hardware command from an IBM Power AC922 server illustrates this setup:


    numactl --hardware

    available: 6 nodes (0,8,252-255)
    
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 2 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79

    node 0 size: 257742 MB

    node 0 free: 48358 MB

    
    102 103 104 105 106 107	   108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
    123 124 125 126 127 128	   129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
    144 145 146 147 148 149	   150 151 152 153 154 155 156 157 158 159

    node 8 size: 261735 MB

    node 8 free: 186807 MB
    
    node	252	cpus:	
    node	252	size:	16128 MB
    node	252	free:	16115 MB

    node	253	cpus:	
    node	253	size:	16128 MB
    node	253	free:	16117 MB

    node	254	cpus:	
    node	254	size:	16128 MB
    node	254	free:	16117 MB

    node	255	cpus:	
    node	255	size:	16128 MB
    node	255	free:	16117 MB

Note: To avoid any potential issues with collisions on CPU nodes versus GPU nodes, the numbering for GPUs starts at 255 and goes backwards, while CPUs start at 0. On a Power AC922 server, we have two CPUs (0,8) with 80 threads each and 256 GB of memory each, and four GPUs (252- 255) with 16 GB of memory each. (GPU threads aren’t listed here.)

cpusets

Now that you understand the hardware makeup of a Power AC922 server, let’s dive into a little bit of background on cpusets (https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt). Cpusets is a mechanism that allows CPU and memory nodes to be assigned for tasks, services, virtual machines, or containers. This allows the kernel to limit what resources can be seen. There are many aspects to cpusets and you can spend hours reading about all of them. In our case, we’re mostly interested in the cpuset.mems file under sysfs. cpusets.mems lists what memory nodes are available at a given time. The default values are kept in /sys/fs/cgroup/cpuset/cpuset.mems with various subdirectories keeping their own copy of cpuset.mems

The GPU nodes, however, do not get enabled by default. The systemd service, nvidia-persistence, will bring the GPU memory online and the cpusets will get updated.

For example:


    nvidia-persistenced service up systemctl start
    nvidia-persistenced cat
    /sys/fs/cgroup/cpuset/cpuset.mems 0,8,252-255


    nvidia-persistenced service down systemctl stop
    nvidia-persistenced cat
    /sys/fs/cgroup/cpuset/cpuset.mems 0,8

Slices

Let us understand the background before getting to the crux of the issue is the concept of a slice unit. To the Linux kernel, “A slice unit is a concept for hierarchically managing resources of a group of processes.”

In the example considered in this article, there are three slices that we need to be concerned about. With RHEL 7.6, using Red Hat’s version of Docker or Podman, the slice in question is system.slice (normally located at /sys/fs/cgroup/cpuset/system.slice).

For Kubernetes or OpenShift, use kubepods.slice which is located at /sys/fs/cgroup/cpuset/kubepods.slice

Finally, later docker-ce versions appear to use the docker slice, which is at /sys/fs/cgroup/cpuset/docker. I’m not sure why they lost the “.slice” to the name, but that’s neither here nor there.

Within these slices, a subslice is created each time a container gets spun up, passing along the necessary cpuset information. Each slice and subslice contains various details, including the cpuset.mems file that contains the memory nodes.

So, what happened?

We talked about the Power AC922 CPU memory being coherently attached to GPU memory. Well, GPU memory would need to stay online at all times. Normally when a device is no longer in use, the kernel will tear down the corresponding kernel module and devices. In order to keep the GPUs online, a systemd service was created (aptly named nvidia-persistenced). With this service, we can guarantee that the GPU memory will stay online regardless of the GPUs active use. The problem? This service starts using systemd, same as Docker and Kubernetes. Unless Docker or Kubernetes explicitly waits for the nvidia-persistenced service to start up and finish the process of bringing the GPU memory online, which could take up to 5 minutes past startup, they will take what’s available in the master cpuset and use it as the base system configuration.

When a process grabs the Linux control group (cgroup) too early, the cpuset.mems file will reflect an incomplete list of memory resources. For example, “0,8,253-255″, which tells us that there are two CPU nodes and only three GPU nodes. If a system actually had just three GPU nodes then this is a valid description, but odds are that the system has four GPUs and the value should have been “0,8,252-255″ to signify all four GPUs are present.

Even when a containerization platform has an incomplete list of the GPU memory nodes, the problem will get masked until CUDA tries to initialize memory against that node. On starting up a container, the NVIDIA driver and devices will be passed through, depending on the rules you have set up, regardless of the memory nodes that are specified in the cgroup.

That is, although your cpuset.mems indicates that you have 253-255 (nvidia0-nvidia2), and 252 (nvidia3) is missing, the NVIDIA container plug-ins or hooks can still pass nvidia3 into a container, because by the time the container was started, all four GPUs were online. We now have a case where we have GPU devices that don’t exist are being passed into the cgroup.

Why doesn’t this fail all the time?

When a machine is in this incorrect state, GPU devices and drivers can be added to a container, and even driver-based commands such as nvidia-smi can provide the correct output. This is because none of those commands try to allocate memory on the GPU. I’m sure someone will speak up and tell me I’m wrong, and driver commands do in fact allocate some memory on a GPU and they’re probably right, but they’re not using the cgroup values to do so, and odds are, the request is being sent to the host and run by the driver itself.

When the code in a container tries to allocate CUDA memory against a device that doesn’t have a corresponding value in the cpuset.mems file, errors occur. Normally it’ll show up as cudaSuccess (3 vs. 0) initialization error, but other flavors can show up depending on how the memory is trying to be allocated. A lot of code, such as CUDA’s deviceQuery from its sample code package, will try to query all devices available to it. When CUDA tries to allocate memory against it, things start to go wrong. Normally if you knew which device wasn’t in the cpuset.mems file, you could use options such as setting CUDA_VISIBLE_DEVICES to cordon off the device, and the rest of the code should work. However, this isn’t a viable long-term solution as it effectively makes a GPU unusable in a containerized environment.

The solution?

A bug has been created to track this problem: https://bugzilla.redhat.com/show_bug.cgi? id=1746415. While it is being worked on, there are some workarounds, most of which involve correcting the problematic cpuset slices. I’ve written a script (https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh) that will check the slices used by the common containerization platforms (Docker, Kubernetes, and OpenShift). If it detects a mismatch between the slice folders cpuset.mems and the master cpuset.mems, it will notify the user. If desired, the script will also correct the problem by removing the slice folders altogether. This needs to be done because the slice folders aren’t deleted when the respective services are shut down or restarted, so bouncing Kubernetes, for example, will keep the same kubepods.slice as before, and you’ll still have the problem.

If we remove the slice folders altogether prior to (re)starting the respective service, the service will regenerate the cgroup slice from the master version, allowing the correct values to be ingested and applied correctly.

I tried editing the cpuset.mems file for certain slice groups, and with the right permissions you should be able to do this. However, I don’t recommend it as you’ll end up with containers that may have differing copies of the cpuset.mems file within a single orchestration, leading to some pretty unpredictable results. The best scenario I can think of at the moment is to bring down the service, run the script to remove the existing incorrect values, and let the service come up naturally.

One last caveat to mention: cgroups and cpusets all reside under sysfs. This means that they are regenerated after each reboot. So, any time a system is restarted, there’s a risk that this issue could happen again. One workaround is to delay the startup of Docker, Kubernetes, OpenShift, and so on until the NVIDIA GPUs have time to come online. This may not be ideal, but is still a better alternative than having to shut down the service mid production to address this problem.

In summary, this is an issue that is unique to a specific server due to its ability to have coherently attached GPU memory. In creating this feature, an exposure in cgroup was discovered where node memory can be added after startup and not passed along to existing slices.

Thanks for your time and, as always, feel free to contact me if you have any questions!

[{"Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"","label":""},"Product":{"code":"SGDMMD","label":"Power System AC922 Server (8335-GTC)"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Tips

cudasuccess-3-vs-0-initialization-error

General Page

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?