NVIDIA GPU driver fails to initialize
After installation, the NVIDIA GPU driver fails to successfully load.
Symptoms
When you attempt to run an application in your IBM Cloud Private cluster and try to use the GPU resource, the GPU library can fail to initialize. To check whether you might be experiencing this issue, complete the following steps:
-
Run the following sample GPU test in your container:
apiVersion: apps/v1beta2 kind: Deployment metadata: name: cuda-vector-add spec: replicas: 1 selector: matchLabels: run: cuda-vector-add template: metadata: labels: run: cuda-vector-add spec: containers: - name: cuda-vector-add image: gcr.io/kubernetes-e2e-test-images/cuda-vector-add:2.0 command: - "/bin/sh" - "-c" args: - nvidia-smi && tail -f /dev/null resources: limits: nvidia.com/gpu: 2 -
Run the
deviceQuerytest case within the container:root@cuda-vector-add-764ff484-cqjph:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# ./deviceQueryIf the command results in an
initialization errorandResult = FAIL, you might have encountered this issue. The following output shows a similar result:./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 3 -> initialization error Result = FAIL
Solution
This issue can occur due to your GPU driver library not being successfully installed when you first created your GPU device plug-in. To resolve this issue, complete the following steps:
-
Remove the GPU device volume of kubelet on the GPU node:
rm -rf /var/lib/kubelet/device-plugins/nvidia-driver/ -
Restart the NVIDIA GPU device plug-in:
kubectl -n kube-system delete pods $(kubectl -n kube-system get pods | grep nvidia-device-plugin | awk '{print $1}') -
Redeploy the sample GPU application and run the sample test again.
-
Run the following command to ensure that the sample GPU application is running:
# kubectl get podsSample output:
NAME READY STATUS RESTARTS AGE cuda-vector-add-858b9445cb-mdlrp 1/1 Running 0 70s -
Run the following sample code inside the container to run the test:
# kubectl exec -it cuda-vector-add-858b9445cb-mdlrp bash root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/0_Simple/vectorAdd# cd ../../1_Utilities/deviceQuery root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# make /usr/local/cuda-10.0/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp /usr/local/cuda-10.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o mkdir -p ../../bin/ppc64le/linux/release cp deviceQuery ../../bin/ppc64le/linux/release -
Run the
deviceQuerytest:root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# ./deviceQueryIf the command returns the device information, your driver library is initialized successfully. The following output shows a similar result.
./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Tesla V100-SXM2-16GB" CUDA Driver Version / Runtime Version 10.1 / 10.0 CUDA Capability Major/Minor version number: 7.0
-