GPU node settings

Before you can install the Jupyter Notebooks with Python 3.7 with GPU service on Cloud Pak for Data and create GPU environment definitions in which to run analytical tools in Watson Studio, you need to perform the following steps to configure GPU nodes on the Red Hat OpenShift cluster in Cloud Pak for Data.

  1. Install the NVIDIA driver by following the instructions in Part 1: NVIDIA Driver Installation in How to use GPUs with DevicePlugin in OpenShift. The installation instructions can be used for Red Hat OpenShift versions 4.5 and 4.6.
  2. Add the runtime hook file.

    1. Install libnvidia-container and the nvidia-container-runtime repository:
       # curl -so /etc/yum.repos.d/nvidia-container-runtime.repo https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.repo
      
       # yum -y install nvidia-container-runtime-hook
      
    2. Create the hook file:
       /etc/containers/oci/hooks.d/oci-nvidia-hook.json
      
    3. Add the following content to the oci-nvidia-hook.json file:
       {
           "version": "1.0.0",
           "hook": {
           "path": "/usr/bin/nvidia-container-runtime-hook",
           "args": ["nvidia-container-runtime-hook", "prestart"],
           "env": [
               "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
               ]
           },
           "when": {
           "always": true,
           "commands": [".*"]
           },
              
           "stages": ["prestart"]
       }
      
  3. Create a service account and Security Context Constraints (SCC). The security settings for Red Hat OpenShift and Cloud Pak for Data do not allow you to create randomly powerful service accounts in all namespaces. To enable the NVIDIA GPU device plugin:

    1. Enter the command:
       oc project kube-system
      
    2. Then enter the command:
       oc create sa nvidia-deviceplugin
      
    3. Download the nvidia-device-plugin-scc.yaml file from OpenShift Performance-Sensitive Application Platform Artifacts.
    4. Update the users value in the downloaded file to system:serviceaccount:kube-system:nvidia-deviceplugin.
    5. Create the SCC configuration:
       oc create -f nvidia-deviceplugin-scc.yaml
      
    6. Verify the SCC creation by running:
       # oc get scc | grep nvidia
      

      The output should be like:

       nvidia-deviceplugin true [*] RunAsAny RunAsAny RunAsAny RunAsAny 10 false
       [*]
      
  4. Label the GPU nodes that you want to schedule the NVIDIA GPU device plugin to run on:
     oc label node openshift.com/gpu-accelerator=true
    
  5. Deploy the NVIDIA GPU device plugin. Note that the NVIDIA device plugin is supported and maintained by the vendor, and is not shipped or supported by Red Hat.

    1. Download the nvidia-device-plugin-daemonset.yaml file from OpenShift Performance-Sensitive Application Platform Artifacts
    2. Run the command:
       oc create -f nvidia-device-plugin-daemonset.yaml
      
    3. Verify the plugin deployment:
       oc get pods | grep -i nvidia-device-plugin
      

      The output should look like:

       NAME READY STATUS RESTARTS AGE
       nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 1m
      
    4. Test the deployment by following the instructions in Deploy a pod that requires a GPU in How to use GPUs with DevicePlugin in OpenShift. The instructions can be used for Red Hat OpenShift versions 3.11 and 4.5.
  6. Update the SELinux permissions. As Cloud Pak for Data doesn’t use privileged pods, you have to update some of the SELinux labels on the NVIDIA driver files.

    1. Check the current SELinux labels:
       ls -Z /dev/nvidia*
      

      The output of this command can be different on each GPU node. An example could look like this:

       # ls -Z /dev/nvidia*
       crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0
       /dev/nvidia0
       crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0
       /dev/nvidiactl
       crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0
       /dev/nvidia-modeset
       crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0
       /dev/nvidia-uvm
       crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0
       /dev/nvidia-uvm-tools
      
    2. If the SELinux labels for the NVIDIA files in the returned output don’t contain container_file_t, update the labels by running the following commands:
       semanage fcontext -a -t container_file_t "/dev/nvidia(.*)"
       restorecon -v /dev/nvidia*
      
    3. Then rerun:
       ls -Z /dev/nvidia*
      

      The output should now look like this:

       # ls -Z /dev/nvidia*
       crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidia0
       crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiactl
       crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiamodeset
       crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiauvm
       crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiauvm-tools
      

Next step