Installing the NVIDIA GPU Operator

Install the NVIDIA GPU Operator on an air-gapped cluster.

In an air-gapped cluster, the GPU Operator requires all images to be hosted in a local image registry accessible to each node in the cluster. To allow GPU Operator to work with local registry, you will need to modify the values.yml file.

Note: You will require a jump host to access both the internet and intranet, on this host you can download external resources and push them to local storage used by the air-gapped cluster.

Complete the following steps to install the GPU Operator:

Step 1: Local image registry
Step 2: Local package repository
Step 3: Install the NVIDIA GPU Operator

To install the NVIDIA GPU Operator on a cluster connected to the internet, see: OpenShift on NVIDIA GPU Accelerated Clusters.

Step 1: Local image registry

Create a local image registry. This registry will be accessible by all nodes in the cluster.

Log in to your OpenShift® cluster as an administrator:
```
oc login OpenShift_URL:port
```
Create a namespace named gpu-operator-resources:
```
oc new-project gpu-operator-resources
```
Setup a local image registry. You can use the default OpenShift internal registry or you can use your own local image registry.
To use the default OpenShift internal registry, make sure to do the following:
1. Allow the OpenShift Docker registry to be accessible from outside the cluster:
```
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
```
2. Get the OpenShift image registry URL:
```
oc get route/default-route -n openshift-image-registry --template='{{ .spec.host }}'
```
  For example, the image registry URL used in the next steps is default-route-openshift-image-registry.your.image.registry.ibm.com/wmla where we assume the image namespace is wmla.

Download and push the following images to your local registry:

nvcr.io/nvidia/gpu-operator:1.5.1
nvcr.io/nvidia/driver:450.80.02-rhcos4.6
nvcr.io/nvidia/gpu-feature-discovery:v0.3.0
quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04
nvcr.io/nvidia/k8s-device-plugin:v0.7.3
nvcr.io/nvidia/k8s/container-toolkit:1.4.3-ubuntu18.04
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59

Note: For the last image, you should tag and push with an image tag like:

docker tag nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 default-route-openshift-image-registry.your.image.registry.ibm.com/wmla/cuda:v1-x86_64

Edit the repository and imagePullSecrets values in the values.yaml file:

operator:
  repository: <repo.example.com:port>
  image: gpu-operator
  version: 1.5.1
  imagePullSecrets: []
  validator:
    image: cuda-sample
    repository: <my-repository:port>
    version: vectoradd-cuda10.2
    imagePullSecrets: []
...

Replace all occurrences of <repo.example.com:port> in values.yaml with your local image registry URL and port or your registry URL and namespace.

If your local image registry requires authentication, add an image pull secret by updating the imagePullSecrets value in values.yaml.

Note:

If you are using the default OpenShift internal registry, you must first create the image pull secret:

kubectl create secret docker-registry local-registry-sec  -n gpu-operator-resources --docker-username=admin --docker-password=admin --docker-server=registry.ocp4.wmlagc.org:5000/wmla

For example, set the value of imagePullSecrets to local-registry-sec:

...
operator:
  repository: default-route-openshift-image-registry.your.image.registry.ibm.com/wmla
  image: gpu-operator
  version: 1.5.1
  imagePullSecrets: []
  validator:
    image: cuda-sample
    repository: <my-repository:port>
    version: vectoradd-cuda10.2
    imagePullSecrets: ["local-registry-sec"]
...

Step 2: Local package repository

Create a local package repository.

Prepare the local package mirror, see: Local Package Repository
After packages are mirrored to the local repository, create a ConfigMap with the repo list file in the gpu-operator-resources namespace:
```
kubectl create configmap repo-config -n gpu-operator-resources --from-file=<path-to-repo-list-file>
```
Replace <path-to-repo-list-file> with the location of the repo list file.

Step 3: Install the NVIDIA GPU Operator

Obtain the Helm command line tool using one of the following options:

Use wget:

wget https://get.helm.sh/helm-v3.5.1-linux-amd64.tar.gz

Use helm:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \ 
&& ./get_helm.sh

Add the NVIDIA Helm repository:

helm repo add nvidia https://nvidia.github.io/gpu-operator \ 
&& helm repo update

Install the GPU Operator:

helm install --generate-name \ 
nvidia/gpu-operator --version="1.6.2"\ 
--set operator.defaultRuntime=crio -f values.yaml

patch the deamonset and deployment

export worker_ds=`kubectl get ds|grep node-feature-discovery-worker|awk '{print $1}'`
kubectl patch ds $worker_ds -p '{"spec":{"template":{"spec":{"serviceAccount":"nvidia-gpu-feature-discovery","serviceAccountName":"nvidia-gpu-feature-discovery"}}}}'

During the deployment, gpu- and nvidia- pods are created, run the oc get po command to see a list of new pods:

NAME                                                              READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-88h7p                                       1/1     Running    0          3m59s
gpu-operator-1612340379-node-feature-discovery-master-868frv7xm   1/1     Running    0          4m8s
gpu-operator-7d96948b44-fwr4l                                     1/1     Running    0          4m8s
nvidia-container-toolkit-daemonset-lzk6z                          0/1     Init:0/1   0          3m39s
nvidia-driver-daemonset-fm8wm                                     1/1     Running    0          3m59s

Note:

To resolve the image pull error for the nvidia-container-toolkit-daemonset pod:

kubectl patch ds nvidia-container-toolkit-daemonset -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "driver-validation","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'

Replace your.image.registry:5000/user with your image registry.

To resolve the image pull error for the nvidia-dcgm-exporter pod:

kubectl patch ds nvidia-dcgm-exporter -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "init-pod-nvidia-metrics-exporter","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'

Replace your.image.registry:5000/user with your image registry.

After several minutes, check the status of your pods by running the oc get pod command. All pods should be running successfully and in Running state:

NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-88h7p                                       1/1     Running     0          20m
gpu-operator-1612340379-node-feature-discovery-master-868frv7xm   1/1     Running     0          20m
gpu-operator-1612340379-node-feature-discovery-worker-4ptrj       1/1     Running     0          4m1s
gpu-operator-1612340379-node-feature-discovery-worker-gf78x       1/1     Running     0          4m1s
gpu-operator-1612340379-node-feature-discovery-worker-lkzj8       1/1     Running     0          4m1s
gpu-operator-1612340379-node-feature-discovery-worker-mmz8r       1/1     Running     0          4m1s
gpu-operator-1612340379-node-feature-discovery-worker-nrsnj       1/1     Running     0          4m1s
gpu-operator-1612340379-node-feature-discovery-worker-phb5m       1/1     Running     0          4m1s
gpu-operator-7d96948b44-fwr4l                                     1/1     Running     0          20m
nvidia-container-toolkit-daemonset-lzk6z                          1/1     Running     0          20m
nvidia-dcgm-exporter-czz4x                                        1/1     Running     0          8m57s
nvidia-device-plugin-daemonset-nr7xz                              1/1     Running     0          15m
nvidia-device-plugin-validation                                   0/1     Completed   0          15m
nvidia-driver-daemonset-fm8wm                                     1/1     Running     0          20m

Note:

If the nvidia-driver-daemonset-* pod is not detected, that can indicate that there are no GPU nodes detected by the gpu-operator-* pods. You might need to manually label the nodes as GPU nodes. For example, on a Tesla T4 GPU, the following parameter needs to be set:

withfeature.node.kubernetes.io/pci-0302_10de.present=true

Check the GPU devices on the Kubernetes nodes:

oc describe nodes|grep nvidia.com

nvidia.com/gpu.present=true
nvidia.com/gpu:     1
nvidia.com/gpu:     1
nvidia.com/gpu     0            0