Installing the NVIDIA GPU Operator
Install the NVIDIA GPU Operator on an air-gapped cluster.
In an air-gapped cluster, the GPU Operator requires all images to be hosted in a local image
registry accessible to each node in the cluster. To allow GPU Operator to work with local registry,
you will need to modify the
values.yml
file. Note: You will require a jump host to
access both the internet and intranet, on this host you can download external resources and push
them to local storage used by the air-gapped cluster.
Complete the following steps to install
the GPU Operator: - Step 1: Local image registry
- Step 2: Local package repository
- Step 3: Install the NVIDIA GPU Operator
Step 1: Local image registry
Create a local image registry. This registry will be accessible by all nodes in the cluster.
- Log in to your OpenShift®
cluster as an administrator:
oc login OpenShift_URL:port
- Create a namespace named
gpu-operator-resources:
oc new-project gpu-operator-resources
- Setup a local image registry. You can use the default OpenShift internal registry or you can use
your own local image registry. To use the default OpenShift internal registry, make sure to do the following:
- Allow the OpenShift Docker
registry to be accessible from outside the cluster:
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
- Get the OpenShift image
registry
URL:
For example, the image registry URL used in the next steps isoc get route/default-route -n openshift-image-registry --template='{{ .spec.host }}'
default-route-openshift-image-registry.your.image.registry.ibm.com/wmla
where we assume the image namespace iswmla
.
- Allow the OpenShift Docker
registry to be accessible from outside the cluster:
- Download and push the following images to your local registry:
nvcr.io/nvidia/gpu-operator:1.5.1 nvcr.io/nvidia/driver:450.80.02-rhcos4.6 nvcr.io/nvidia/gpu-feature-discovery:v0.3.0 quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0 nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04 nvcr.io/nvidia/k8s-device-plugin:v0.7.3 nvcr.io/nvidia/k8s/container-toolkit:1.4.3-ubuntu18.04 nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
Note: For the last image, you should tag and push with an image tag like:docker tag nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 default-route-openshift-image-registry.your.image.registry.ibm.com/wmla/cuda:v1-x86_64
- Edit the repository and imagePullSecrets values in the
values.yaml file:
operator: repository: <repo.example.com:port> image: gpu-operator version: 1.5.1 imagePullSecrets: [] validator: image: cuda-sample repository: <my-repository:port> version: vectoradd-cuda10.2 imagePullSecrets: [] ...
- Replace all occurrences of <repo.example.com:port> in values.yaml with your local image registry URL and port or your registry URL and namespace.
- If your local image registry requires authentication, add an image pull secret by updating the
imagePullSecrets value in values.yaml.Note:If you are using the default OpenShift internal registry, you must first create the image pull secret:
kubectl create secret docker-registry local-registry-sec -n gpu-operator-resources --docker-username=admin --docker-password=admin --docker-server=registry.ocp4.wmlagc.org:5000/wmla
For example, set the value of imagePullSecrets to local-registry-sec:... operator: repository: default-route-openshift-image-registry.your.image.registry.ibm.com/wmla image: gpu-operator version: 1.5.1 imagePullSecrets: [] validator: image: cuda-sample repository: <my-repository:port> version: vectoradd-cuda10.2 imagePullSecrets: ["local-registry-sec"] ...
Step 2: Local package repository
Create a local package repository.
- Prepare the local package mirror, see: Local Package Repository
- After packages are mirrored to the local repository, create a ConfigMap
with the repo list file in the gpu-operator-resources namespace:
Replacekubectl create configmap repo-config -n gpu-operator-resources --from-file=<path-to-repo-list-file>
<path-to-repo-list-file>
with the location of the repo list file.
Step 3: Install the NVIDIA GPU Operator
- Obtain the Helm command line tool using one of the following options:
- Use
wget
:wget https://get.helm.sh/helm-v3.5.1-linux-amd64.tar.gz
- Use
helm
:curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ && chmod 700 get_helm.sh \ && ./get_helm.sh
- Use
- Add the NVIDIA Helm
repository:
helm repo add nvidia https://nvidia.github.io/gpu-operator \ && helm repo update
- Install the GPU Operator:
helm install --generate-name \ nvidia/gpu-operator --version="1.6.2"\ --set operator.defaultRuntime=crio -f values.yaml
- patch the deamonset and
deployment
During the deployment, gpu- and nvidia- pods are created, run theexport worker_ds=`kubectl get ds|grep node-feature-discovery-worker|awk '{print $1}'` kubectl patch ds $worker_ds -p '{"spec":{"template":{"spec":{"serviceAccount":"nvidia-gpu-feature-discovery","serviceAccountName":"nvidia-gpu-feature-discovery"}}}}'
oc get po
command to see a list of new pods:NAME READY STATUS RESTARTS AGE gpu-feature-discovery-88h7p 1/1 Running 0 3m59s gpu-operator-1612340379-node-feature-discovery-master-868frv7xm 1/1 Running 0 4m8s gpu-operator-7d96948b44-fwr4l 1/1 Running 0 4m8s nvidia-container-toolkit-daemonset-lzk6z 0/1 Init:0/1 0 3m39s nvidia-driver-daemonset-fm8wm 1/1 Running 0 3m59s
Note:- To resolve the image pull error for the nvidia-container-toolkit-daemonset
pod:
Replacekubectl patch ds nvidia-container-toolkit-daemonset -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "driver-validation","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'
your.image.registry:5000/user
with your image registry. - To resolve the image pull error for the nvidia-dcgm-exporter
pod:
Replacekubectl patch ds nvidia-dcgm-exporter -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "init-pod-nvidia-metrics-exporter","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'
your.image.registry:5000/user
with your image registry.
- To resolve the image pull error for the nvidia-container-toolkit-daemonset
pod:
- After several minutes, check the status of your pods by running the
oc get pod
command. All pods should be running successfully and in Running state:NAME READY STATUS RESTARTS AGE gpu-feature-discovery-88h7p 1/1 Running 0 20m gpu-operator-1612340379-node-feature-discovery-master-868frv7xm 1/1 Running 0 20m gpu-operator-1612340379-node-feature-discovery-worker-4ptrj 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-gf78x 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-lkzj8 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-mmz8r 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-nrsnj 1/1 Running 0 4m1s gpu-operator-1612340379-node-feature-discovery-worker-phb5m 1/1 Running 0 4m1s gpu-operator-7d96948b44-fwr4l 1/1 Running 0 20m nvidia-container-toolkit-daemonset-lzk6z 1/1 Running 0 20m nvidia-dcgm-exporter-czz4x 1/1 Running 0 8m57s nvidia-device-plugin-daemonset-nr7xz 1/1 Running 0 15m nvidia-device-plugin-validation 0/1 Completed 0 15m nvidia-driver-daemonset-fm8wm 1/1 Running 0 20m
Note:If the nvidia-driver-daemonset-* pod is not detected, that can indicate that there are no GPU nodes detected by the gpu-operator-* pods. You might need to manually label the nodes as GPU nodes. For example, on a Tesla T4 GPU, the following parameter needs to be set:withfeature.node.kubernetes.io/pci-0302_10de.present=true
- Check the GPU devices on the Kubernetes
nodes:
oc describe nodes|grep nvidia.com
nvidia.com/gpu.present=true nvidia.com/gpu: 1 nvidia.com/gpu: 1 nvidia.com/gpu 0 0