Installing the NVIDIA GPU Operator

Install the NVIDIA GPU Operator on an air-gapped cluster.

In an air-gapped cluster, the GPU Operator requires all images to be hosted in a local image registry accessible to each node in the cluster. To allow GPU Operator to work with local registry, you will need to modify the values.yml file.
Note: You will require a jump host to access both the internet and intranet, on this host you can download external resources and push them to local storage used by the air-gapped cluster.
Complete the following steps to install the GPU Operator: To install the NVIDIA GPU Operator on a cluster connected to the internet, see: OpenShift on NVIDIA GPU Accelerated Clusters.

Step 1: Local image registry

Create a local image registry. This registry will be accessible by all nodes in the cluster.
  1. Log in to your OpenShift® cluster as an administrator:
    oc login OpenShift_URL:port
  2. Create a namespace named gpu-operator-resources:
    oc new-project gpu-operator-resources
  3. Setup a local image registry. You can use the default OpenShift internal registry or you can use your own local image registry.
    To use the default OpenShift internal registry, make sure to do the following:
    1. Allow the OpenShift Docker registry to be accessible from outside the cluster:
      oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
    2. Get the OpenShift image registry URL:
      oc get route/default-route -n openshift-image-registry --template='{{ .spec.host }}'
      For example, the image registry URL used in the next steps is default-route-openshift-image-registry.your.image.registry.ibm.com/wmla where we assume the image namespace is wmla.
  4. Download and push the following images to your local registry:
    nvcr.io/nvidia/gpu-operator:1.5.1
    nvcr.io/nvidia/driver:450.80.02-rhcos4.6
    nvcr.io/nvidia/gpu-feature-discovery:v0.3.0
    quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
    nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04
    nvcr.io/nvidia/k8s-device-plugin:v0.7.3
    nvcr.io/nvidia/k8s/container-toolkit:1.4.3-ubuntu18.04
    nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
    Note: For the last image, you should tag and push with an image tag like:
    docker tag nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 default-route-openshift-image-registry.your.image.registry.ibm.com/wmla/cuda:v1-x86_64
  5. Edit the repository and imagePullSecrets values in the values.yaml file:
    operator:
      repository: <repo.example.com:port>
      image: gpu-operator
      version: 1.5.1
      imagePullSecrets: []
      validator:
        image: cuda-sample
        repository: <my-repository:port>
        version: vectoradd-cuda10.2
        imagePullSecrets: []
    ...
    1. Replace all occurrences of <repo.example.com:port> in values.yaml with your local image registry URL and port or your registry URL and namespace.
    2. If your local image registry requires authentication, add an image pull secret by updating the imagePullSecrets value in values.yaml.
      Note:
      If you are using the default OpenShift internal registry, you must first create the image pull secret:
      kubectl create secret docker-registry local-registry-sec  -n gpu-operator-resources --docker-username=admin --docker-password=admin --docker-server=registry.ocp4.wmlagc.org:5000/wmla
      For example, set the value of imagePullSecrets to local-registry-sec:
      ...
      operator:
        repository: default-route-openshift-image-registry.your.image.registry.ibm.com/wmla
        image: gpu-operator
        version: 1.5.1
        imagePullSecrets: []
        validator:
          image: cuda-sample
          repository: <my-repository:port>
          version: vectoradd-cuda10.2
          imagePullSecrets: ["local-registry-sec"]
      ...

Step 2: Local package repository

Create a local package repository.
  1. Prepare the local package mirror, see: Local Package Repository
  2. After packages are mirrored to the local repository, create a ConfigMap with the repo list file in the gpu-operator-resources namespace:
    kubectl create configmap repo-config -n gpu-operator-resources --from-file=<path-to-repo-list-file>
    Replace <path-to-repo-list-file> with the location of the repo list file.

Step 3: Install the NVIDIA GPU Operator

  1. Obtain the Helm command line tool using one of the following options:
    • Use wget:
      wget https://get.helm.sh/helm-v3.5.1-linux-amd64.tar.gz
    • Use helm:
      curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
      && chmod 700 get_helm.sh \ 
      && ./get_helm.sh
  2. Add the NVIDIA Helm repository:
    helm repo add nvidia https://nvidia.github.io/gpu-operator \ 
    && helm repo update
  3. Install the GPU Operator:
    helm install --generate-name \ 
    nvidia/gpu-operator --version="1.6.2"\ 
    --set operator.defaultRuntime=crio -f values.yaml
  4. patch the deamonset and deployment
    export worker_ds=`kubectl get ds|grep node-feature-discovery-worker|awk '{print $1}'`
    kubectl patch ds $worker_ds -p '{"spec":{"template":{"spec":{"serviceAccount":"nvidia-gpu-feature-discovery","serviceAccountName":"nvidia-gpu-feature-discovery"}}}}'
    During the deployment, gpu- and nvidia- pods are created, run the oc get po command to see a list of new pods:
    NAME                                                              READY   STATUS     RESTARTS   AGE
    gpu-feature-discovery-88h7p                                       1/1     Running    0          3m59s
    gpu-operator-1612340379-node-feature-discovery-master-868frv7xm   1/1     Running    0          4m8s
    gpu-operator-7d96948b44-fwr4l                                     1/1     Running    0          4m8s
    nvidia-container-toolkit-daemonset-lzk6z                          0/1     Init:0/1   0          3m39s
    nvidia-driver-daemonset-fm8wm                                     1/1     Running    0          3m59s
    Note:
    • To resolve the image pull error for the nvidia-container-toolkit-daemonset pod:
      kubectl patch ds nvidia-container-toolkit-daemonset -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "driver-validation","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'
      Replace your.image.registry:5000/user with your image registry.
    • To resolve the image pull error for the nvidia-dcgm-exporter pod:
      kubectl patch ds nvidia-dcgm-exporter -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "init-pod-nvidia-metrics-exporter","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'
      Replace your.image.registry:5000/user with your image registry.
  5. After several minutes, check the status of your pods by running the oc get pod command. All pods should be running successfully and in Running state:
    NAME                                                              READY   STATUS      RESTARTS   AGE
    gpu-feature-discovery-88h7p                                       1/1     Running     0          20m
    gpu-operator-1612340379-node-feature-discovery-master-868frv7xm   1/1     Running     0          20m
    gpu-operator-1612340379-node-feature-discovery-worker-4ptrj       1/1     Running     0          4m1s
    gpu-operator-1612340379-node-feature-discovery-worker-gf78x       1/1     Running     0          4m1s
    gpu-operator-1612340379-node-feature-discovery-worker-lkzj8       1/1     Running     0          4m1s
    gpu-operator-1612340379-node-feature-discovery-worker-mmz8r       1/1     Running     0          4m1s
    gpu-operator-1612340379-node-feature-discovery-worker-nrsnj       1/1     Running     0          4m1s
    gpu-operator-1612340379-node-feature-discovery-worker-phb5m       1/1     Running     0          4m1s
    gpu-operator-7d96948b44-fwr4l                                     1/1     Running     0          20m
    nvidia-container-toolkit-daemonset-lzk6z                          1/1     Running     0          20m
    nvidia-dcgm-exporter-czz4x                                        1/1     Running     0          8m57s
    nvidia-device-plugin-daemonset-nr7xz                              1/1     Running     0          15m
    nvidia-device-plugin-validation                                   0/1     Completed   0          15m
    nvidia-driver-daemonset-fm8wm                                     1/1     Running     0          20m
    Note:
    If the nvidia-driver-daemonset-* pod is not detected, that can indicate that there are no GPU nodes detected by the gpu-operator-* pods. You might need to manually label the nodes as GPU nodes. For example, on a Tesla T4 GPU, the following parameter needs to be set:
    withfeature.node.kubernetes.io/pci-0302_10de.present=true
  6. Check the GPU devices on the Kubernetes nodes:
    oc describe nodes|grep nvidia.com
    nvidia.com/gpu.present=true
    nvidia.com/gpu:     1
    nvidia.com/gpu:     1
    nvidia.com/gpu     0            0