NVIDIA GPU driver fails to initialize

After installation, the NVIDIA GPU driver fails to successfully load.

Symptoms

When you attempt to run an application in your IBM Cloud Private cluster and try to use the GPU resource, the GPU library can fail to initialize. To check whether you might be experiencing this issue, complete the following steps:

Run the following sample GPU test in your container:

 apiVersion: apps/v1beta2
 kind: Deployment
 metadata:
 name: cuda-vector-add
 spec:
 replicas: 1
 selector:
 matchLabels:
     run: cuda-vector-add
 template:
 metadata:
     labels:
     run: cuda-vector-add
 spec:
     containers:
     - name: cuda-vector-add
     image: gcr.io/kubernetes-e2e-test-images/cuda-vector-add:2.0
     command:
     - "/bin/sh"
     - "-c"
     args:
     - nvidia-smi && tail -f /dev/null
     resources:
         limits:
         nvidia.com/gpu: 2

Run the deviceQuery test case within the container:

 root@cuda-vector-add-764ff484-cqjph:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# ./deviceQuery

If the command results in an initialization error and Result = FAIL, you might have encountered this issue. The following output shows a similar result:

 ./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

 cudaGetDeviceCount returned 3
 -> initialization error
 Result = FAIL

Solution

This issue can occur due to your GPU driver library not being successfully installed when you first created your GPU device plug-in. To resolve this issue, complete the following steps:

Remove the GPU device volume of kubelet on the GPU node:

 rm -rf /var/lib/kubelet/device-plugins/nvidia-driver/

Restart the NVIDIA GPU device plug-in:

 kubectl -n kube-system delete pods $(kubectl -n kube-system get pods | grep nvidia-device-plugin | awk '{print $1}')

Redeploy the sample GPU application and run the sample test again.

Run the following command to ensure that the sample GPU application is running:

# kubectl get pods

Sample output:

NAME                               READY   STATUS    RESTARTS   AGE
cuda-vector-add-858b9445cb-mdlrp   1/1     Running   0          70s

Run the following sample code inside the container to run the test:

# kubectl exec -it cuda-vector-add-858b9445cb-mdlrp bash
root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/0_Simple/vectorAdd# cd ../../1_Utilities/deviceQuery
root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# make
/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-10.0/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o
mkdir -p ../../bin/ppc64le/linux/release
cp deviceQuery ../../bin/ppc64le/linux/release

Run the deviceQuery test:

root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# ./deviceQuery

If the command returns the device information, your driver library is initialized successfully. The following output shows a similar result.

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla V100-SXM2-16GB"
CUDA Driver Version / Runtime Version          10.1 / 10.0
CUDA Capability Major/Minor version number:    7.0