GPU node settings
Install the NVIDIA GPU Operator on an air-gapped cluster.
In an air-gapped cluster, the GPU Operator requires all images to be hosted in a local image registry accessible to each node in the cluster. To allow GPU Operator to work with local registry, you will need to modify the values.yml file.
Note: You will require a jump host to access both the internet and intranet, on this host you can download external resources and push them to local storage used by the air-gapped cluster.
Complete the following steps to install the GPU Operator:
- Step 1: Local image registry
- Step 2: Local package repository
- Step 3: Install the NVIDIA GPU Operator
To install the NVIDIA GPU Operator on a cluster connected to the internet, see: OpenShift on NVIDIA GPU Accelerated Clusters.
Step 1: Local image registry
Create a local image registry. This registry will be accessible by all nodes in the cluster.
-
Log in to your OpenShift cluster as an administrator:
oc login OpenShift_URL:port -
Create a namespace named
gpu-operator-resources:oc new-project gpu-operator-resources -
Setup a local image registry. You can use the default OpenShift internal registry or you can use your own local image registry. To use the default OpenShift internal registry, make sure to do the following:
a. Allow the OpenShift Docker registry to be accessible from outside the cluster:
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=mergeb. Get the OpenShift image registry URL:
oc registry info --public
-
Download and push the following images to your local registry:
nvcr.io/nvidia/gpu-operator:1.5.1 nvcr.io/nvidia/driver:450.80.02-rhcos4.6 nvcr.io/nvidia/gpu-feature-discovery:v0.3.0 quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0 nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04 nvcr.io/nvidia/k8s-device-plugin:v0.7.3 nvcr.io/nvidia/k8s/container-toolkit:1.4.3-ubuntu18.04 nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59Note: For the last image, you should tag and push with an image tag similar to the following.
docker tag nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 default-route-openshift-image-registry.your.image.registry.ibm.com/<zen_namespace>/cuda:v1-x86_64Note:
zen_namespaceis the namespace being used to install Cloud Pak for Data. -
Edit the
repositoryandimagePullSecretsvalues in thevalues.yamlfile.docker tag nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 default-route-openshift-image-registry.your.image.registry.ibm.com/<zen_namespace>/cuda:v1-x86_64Note:
zen_namespaceis the namespace being used to install Cloud Pak for Data.a. Replace all occurrences of
<repo.example.com:port>invalues.yamlwith your local image registry URL and port or your registry URL and namespace.b. If your local image registry requires authentication, add an image pull secret by updating the
imagePullSecretsvalue invalues.yaml.Note: If you are using the default OpenShift internal registry, you must first create the image pull secret. For example:
oc create secret docker-registry local-registry-sec -n gpu-operator-resources --docker-username=admin --docker-password=admin --docker-server=registry.ocp4.wmlagc.org:5000/zen_namespaceFor example, set the value of
imagePullSecretstolocal-registry-sec:
```
...
operator:
repository: default-route-openshift-image-registry.your.image.registry.ibm.com/<zen-namespace>
image: gpu-operator
version: 1.5.1
imagePullSecrets: []
validator:
image: cuda-sample
repository: <my-repository:port>
version: vectoradd-cuda10.2
imagePullSecrets: ["local-registry-sec"]
...
```
**Note:** `zen_namespace` is the namespace being used to install Cloud Pak for Data.
Step 2: Local package repository
Create a local package repository.
- Prepare the local package mirror, see: Local Package Repository.
-
After packages are mirrored to the local repository, create a
ConfigMapwith the repo list file in thegpu-operator-resourcesnamespace:oc create configmap repo-config -n gpu-operator-resources --from-file=<path-to-repo-list-file>Replace
<path-to-repo-list-file>with the location of the repo list file.
Step 3: Install the NVIDIA GPU Operator
-
Obtain the Helm command line tool using one of the following options:
-
Use
wget:wget https://get.helm.sh/helm-v3.5.1-linux-amd64.tar.gz -
Use helm:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ && chmod 700 get_helm.sh \ && ./get_helm.sh
-
-
Add the NVIDIA Helm repository:
helm repo add nvidia https://nvidia.github.io/gpu-operator \ && helm repo update -
Install the GPU Operator:
helm install --generate-name \ nvidia/gpu-operator --version="1.5.1"\ --set operator.defaultRuntime=crio -f values.yaml -
Patch the daemonset and deployment:
export worker_ds=`oc get ds|grep node-feature-discovery-worker|awk '{print $1}'` oc patch ds $worker_ds -p '{"spec":{"template":{"spec":{"serviceAccount":"nvidia-gpu-feature-discovery","serviceAccountName":"nvidia-gpu-feature-discovery"}}}}'During the deployment, gpu- and nvidia- pods are created, run the
oc get pocommand to see a list of new pods.NAME READY STATUS RESTARTS AGE gpu-feature-discovery-88h7p 1/1 Running 0 3m59s gpu-operator-1612340379-node-feature-discovery-master-868frv7xm 1/1 Running 0 4m8s gpu-operator-7d96948b44-fwr4l 1/1 Running 0 4m8s nvidia-container-toolkit-daemonset-lzk6z 0/1 Init:0/1 0 3m39s nvidia-driver-daemonset-fm8wm 1/1 Running 0 3m59sNote:
-
To resolve the image pull error for the nvidia-container-toolkit-daemonset pod:
oc patch ds nvidia-container-toolkit-daemonset -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "driver-validation","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'Replace
your.image.registry:5000/userwith your image registry. -
Resolve the image pull error for the nvidia-dcgm-exporter pod:
oc patch ds nvidia-dcgm-exporter -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "init-pod-nvidia-metrics-exporter","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'Replace
your.image.registry:5000/userwith your image registry.
-
-
After several minutes, check the status of your pods by running the
oc get podcommand. All pods should be running successfully and in Running state:NAME READY STATUS RESTARTS AGE gpu-feature-discovery-88h7p 1/1 Running 0 20m gpu-operator-1612340379-node-feature-discovery-master-868frv7xm 1/1 Running 0 20m gpu-operator-1612340379-node-feature-discovery-worker-4ptrj 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-gf78x 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-lkzj8 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-mmz8r 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-nrsnj 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-phb5m 1/1 Running 0 4m1s gpu-operator-7d96948b44-fwr4l 1/1 Running 0 20m nvidia-container-toolkit-daemonset-lzk6z 1/1 Running 0 20m nvidia-dcgm-exporter-czz4x 1/1 Running 0 8m57s nvidia-device-plugin-daemonset-nr7xz 1/1 Running 0 15m nvidia-device-plugin-validation 0/1 Completed 0 15m nvidia-driver-daemonset-fm8wm 1/1 Running 0 20mNote: If the
nvidia-driver-daemonset-*pod is not detected, that can indicate that there are no GPU nodes detected by thegpu-operator-*pods. You might need to manually label the nodes as GPU nodes. For example, on a Tesla T4 GPU, the following parameter needs to be set:withfeature.node.kubernetes.io/pci-0302_10de.present=true -
Check the GPU devices on the Kubernetes nodes:
oc describe nodes|grep nvidia.comnvidia.com/gpu.present=true nvidia.com/gpu: 1 nvidia.com/gpu: 1 nvidia.com/gpu 0 0
Parent topic: Administering Jupyter notebooks with Python 3.7 GPU