GPU node settings

Install the NVIDIA GPU Operator on an air-gapped cluster.

In an air-gapped cluster, the GPU Operator requires all images to be hosted in a local image registry accessible to each node in the cluster. To allow GPU Operator to work with local registry, you will need to modify the values.yml file.

Note: You will require a jump host to access both the internet and intranet, on this host you can download external resources and push them to local storage used by the air-gapped cluster.

Complete the following steps to install the GPU Operator:

To install the NVIDIA GPU Operator on a cluster connected to the internet, see: OpenShift on NVIDIA GPU Accelerated Clusters.

Step 1: Local image registry

Create a local image registry. This registry will be accessible by all nodes in the cluster.

  1. Log in to your OpenShift cluster as an administrator:

     oc login OpenShift_URL:port
    
  2. Create a namespace named gpu-operator-resources:

     oc new-project gpu-operator-resources
    
  3. Setup a local image registry. You can use the default OpenShift internal registry or you can use your own local image registry. To use the default OpenShift internal registry, make sure to do the following:

    a. Allow the OpenShift Docker registry to be accessible from outside the cluster:

     oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
    

    b. Get the OpenShift image registry URL:

     oc registry info --public
    
  1. Download and push the following images to your local registry:

     nvcr.io/nvidia/gpu-operator:1.5.1
     nvcr.io/nvidia/driver:450.80.02-rhcos4.6
     nvcr.io/nvidia/gpu-feature-discovery:v0.3.0
     quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
     nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04
     nvcr.io/nvidia/k8s-device-plugin:v0.7.3
     nvcr.io/nvidia/k8s/container-toolkit:1.4.3-ubuntu18.04
     nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
     nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
    

    Note: For the last image, you should tag and push with an image tag similar to the following.

     docker tag nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 default-route-openshift-image-registry.your.image.registry.ibm.com/<zen_namespace>/cuda:v1-x86_64
    

    Note: zen_namespace is the namespace being used to install Cloud Pak for Data.

  2. Edit the repository and imagePullSecrets values in the values.yaml file.

     docker tag nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 default-route-openshift-image-registry.your.image.registry.ibm.com/<zen_namespace>/cuda:v1-x86_64
    

    Note: zen_namespace is the namespace being used to install Cloud Pak for Data.

    a. Replace all occurrences of <repo.example.com:port> in values.yaml with your local image registry URL and port or your registry URL and namespace.

    b. If your local image registry requires authentication, add an image pull secret by updating the imagePullSecrets value in values.yaml.

    Note: If you are using the default OpenShift internal registry, you must first create the image pull secret. For example:

     oc create secret docker-registry local-registry-sec  -n gpu-operator-resources --docker-username=admin --docker-password=admin --docker-server=registry.ocp4.wmlagc.org:5000/zen_namespace
    

    For example, set the value of imagePullSecrets to local-registry-sec:

```
...
operator:
repository: default-route-openshift-image-registry.your.image.registry.ibm.com/<zen-namespace>
image: gpu-operator
version: 1.5.1
imagePullSecrets: []
validator:
    image: cuda-sample
    repository: <my-repository:port>
    version: vectoradd-cuda10.2
    imagePullSecrets: ["local-registry-sec"]
...
```

**Note:** `zen_namespace` is the namespace being used to install Cloud Pak for Data. 

Step 2: Local package repository

Create a local package repository.

  1. Prepare the local package mirror, see: Local Package Repository.
  2. After packages are mirrored to the local repository, create a ConfigMap with the repo list file in the gpu-operator-resources namespace:

     oc create configmap repo-config -n gpu-operator-resources --from-file=<path-to-repo-list-file>
    

    Replace <path-to-repo-list-file> with the location of the repo list file.

Step 3: Install the NVIDIA GPU Operator

  1. Obtain the Helm command line tool using one of the following options:

    • Use wget:

      wget https://get.helm.sh/helm-v3.5.1-linux-amd64.tar.gz
      
    • Use helm:

      curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
      && chmod 700 get_helm.sh \ 
      && ./get_helm.sh
      
  2. Add the NVIDIA Helm repository:

     helm repo add nvidia https://nvidia.github.io/gpu-operator \ 
     && helm repo update
    
  3. Install the GPU Operator:

     helm install --generate-name \ 
     nvidia/gpu-operator --version="1.5.1"\ 
     --set operator.defaultRuntime=crio -f values.yaml
    
  4. Patch the daemonset and deployment:

     export worker_ds=`oc get ds|grep node-feature-discovery-worker|awk '{print $1}'`
     oc patch ds $worker_ds -p '{"spec":{"template":{"spec":{"serviceAccount":"nvidia-gpu-feature-discovery","serviceAccountName":"nvidia-gpu-feature-discovery"}}}}'
    

    During the deployment, gpu- and nvidia- pods are created, run the oc get po command to see a list of new pods.

     NAME                                                              READY   STATUS     RESTARTS   AGE
     gpu-feature-discovery-88h7p                                       1/1     Running    0          3m59s
     gpu-operator-1612340379-node-feature-discovery-master-868frv7xm   1/1     Running    0          4m8s
     gpu-operator-7d96948b44-fwr4l                                     1/1     Running    0          4m8s
     nvidia-container-toolkit-daemonset-lzk6z                          0/1     Init:0/1   0          3m39s
     nvidia-driver-daemonset-fm8wm                                     1/1     Running    0          3m59s
    

    Note:

    • To resolve the image pull error for the nvidia-container-toolkit-daemonset pod:

      oc patch ds nvidia-container-toolkit-daemonset -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "driver-validation","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'
      

      Replace your.image.registry:5000/user with your image registry.

    • Resolve the image pull error for the nvidia-dcgm-exporter pod:

      oc patch ds nvidia-dcgm-exporter -p '{"spec":{"template":{"spec":{"initContainers":[{"name": "init-pod-nvidia-metrics-exporter","image":"your.image.registry:5000/user/cuda:v1-x86_64"}]}}}}'
      

      Replace your.image.registry:5000/user with your image registry.

  5. After several minutes, check the status of your pods by running the oc get pod command. All pods should be running successfully and in Running state:

     NAME                                                              READY   STATUS      RESTARTS   AGE
     gpu-feature-discovery-88h7p                                       1/1     Running     0          20m
     gpu-operator-1612340379-node-feature-discovery-master-868frv7xm   1/1     Running     0          20m
     gpu-operator-1612340379-node-feature-discovery-worker-4ptrj       1/1     Running     0          4m1s
     gpu-operator-1612340379-node-feature-discovery-worker-gf78x       1/1     Running     0          4m1s
     gpu-operator-1612340379-node-feature-discovery-worker-lkzj8       1/1     Running     0          4m1s
     gpu-operator-1612340379-node-feature-discovery-worker-mmz8r       1/1     Running     0          4m1s
     gpu-operator-1612340379-node-feature-discovery-worker-nrsnj       1/1     Running     0          4m1s
     gpu-operator-1612340379-node-feature-discovery-worker-phb5m       1/1     Running     0          4m1s
     gpu-operator-7d96948b44-fwr4l                                     1/1     Running     0          20m
     nvidia-container-toolkit-daemonset-lzk6z                          1/1     Running     0          20m
     nvidia-dcgm-exporter-czz4x                                        1/1     Running     0          8m57s
     nvidia-device-plugin-daemonset-nr7xz                              1/1     Running     0          15m
     nvidia-device-plugin-validation                                   0/1     Completed   0          15m
     nvidia-driver-daemonset-fm8wm                                     1/1     Running     0          20m
    

    Note: If the nvidia-driver-daemonset-* pod is not detected, that can indicate that there are no GPU nodes detected by the gpu-operator-* pods. You might need to manually label the nodes as GPU nodes. For example, on a Tesla T4 GPU, the following parameter needs to be set:

     withfeature.node.kubernetes.io/pci-0302_10de.present=true
    
  6. Check the GPU devices on the Kubernetes nodes: oc describe nodes|grep nvidia.com

     nvidia.com/gpu.present=true
     nvidia.com/gpu:     1
     nvidia.com/gpu:     1
     nvidia.com/gpu     0            0
    

Parent topic: Administering Jupyter notebooks with Python 3.7 GPU