NVIDIA GPU driver fails to initialize

After installation, the NVIDIA GPU driver fails to successfully load.

Symptoms

When you attempt to run an application in your IBM Cloud Private cluster and try to use the GPU resource, the GPU library can fail to initialize. To check whether you might be experiencing this issue, complete the following steps:

  1. Run the following sample GPU test in your container:

     apiVersion: apps/v1beta2
     kind: Deployment
     metadata:
     name: cuda-vector-add
     spec:
     replicas: 1
     selector:
     matchLabels:
         run: cuda-vector-add
     template:
     metadata:
         labels:
         run: cuda-vector-add
     spec:
         containers:
         - name: cuda-vector-add
         image: gcr.io/kubernetes-e2e-test-images/cuda-vector-add:2.0
         command:
         - "/bin/sh"
         - "-c"
         args:
         - nvidia-smi && tail -f /dev/null
         resources:
             limits:
             nvidia.com/gpu: 2
    
  2. Run the deviceQuery test case within the container:

     root@cuda-vector-add-764ff484-cqjph:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# ./deviceQuery
    

    If the command results in an initialization error and Result = FAIL, you might have encountered this issue. The following output shows a similar result:

     ./deviceQuery Starting...
    
     CUDA Device Query (Runtime API) version (CUDART static linking)
    
     cudaGetDeviceCount returned 3
     -> initialization error
     Result = FAIL
    

Solution

This issue can occur due to your GPU driver library not being successfully installed when you first created your GPU device plug-in. To resolve this issue, complete the following steps:

  1. Remove the GPU device volume of kubelet on the GPU node:

     rm -rf /var/lib/kubelet/device-plugins/nvidia-driver/
    
  2. Restart the NVIDIA GPU device plug-in:

     kubectl -n kube-system delete pods $(kubectl -n kube-system get pods | grep nvidia-device-plugin | awk '{print $1}')
    
  3. Redeploy the sample GPU application and run the sample test again.

    1. Run the following command to ensure that the sample GPU application is running:

      # kubectl get pods
      

      Sample output:

      NAME                               READY   STATUS    RESTARTS   AGE
      cuda-vector-add-858b9445cb-mdlrp   1/1     Running   0          70s
      
    2. Run the following sample code inside the container to run the test:

      # kubectl exec -it cuda-vector-add-858b9445cb-mdlrp bash
      root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/0_Simple/vectorAdd# cd ../../1_Utilities/deviceQuery
      root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# make
      /usr/local/cuda-10.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp
      /usr/local/cuda-10.0/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o
      mkdir -p ../../bin/ppc64le/linux/release
      cp deviceQuery ../../bin/ppc64le/linux/release
      
    3. Run the deviceQuery test:

      root@cuda-vector-add-858b9445cb-mdlrp:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery# ./deviceQuery
      

      If the command returns the device information, your driver library is initialized successfully. The following output shows a similar result.

      ./deviceQuery Starting...
      
      CUDA Device Query (Runtime API) version (CUDART static linking)
      
      Detected 1 CUDA Capable device(s)
      
      Device 0: "Tesla V100-SXM2-16GB"
      CUDA Driver Version / Runtime Version          10.1 / 10.0
      CUDA Capability Major/Minor version number:    7.0