GPU node settings
Before you can install the Jupyter Notebooks with Python 3.7 with GPU service on Cloud Pak for Data and create GPU environment definitions in which to run analytical tools in Watson Studio, you need to perform the following steps to configure GPU nodes on the Red Hat OpenShift cluster in Cloud Pak for Data.
- Install the NVIDIA driver by following the instructions in Part 1: NVIDIA Driver Installation in How to use GPUs with DevicePlugin in OpenShift. The installation instructions can be used for Red Hat OpenShift versions 4.5 and 4.6.
-
Add the runtime hook file.
- Install
libnvidia-container
and thenvidia-container-runtime
repository:# curl -so /etc/yum.repos.d/nvidia-container-runtime.repo https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.repo # yum -y install nvidia-container-runtime-hook
- Create the hook file:
/etc/containers/oci/hooks.d/oci-nvidia-hook.json
- Add the following content to the
oci-nvidia-hook.json
file:{ "version": "1.0.0", "hook": { "path": "/usr/bin/nvidia-container-runtime-hook", "args": ["nvidia-container-runtime-hook", "prestart"], "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" ] }, "when": { "always": true, "commands": [".*"] }, "stages": ["prestart"] }
- Install
-
Create a service account and Security Context Constraints (SCC). The security settings for Red Hat OpenShift and Cloud Pak for Data do not allow you to create randomly powerful service accounts in all namespaces. To enable the NVIDIA GPU device plugin:
- Enter the command:
oc project kube-system
- Then enter the command:
oc create sa nvidia-deviceplugin
- Download the
nvidia-device-plugin-scc.yaml
file from OpenShift Performance-Sensitive Application Platform Artifacts. - Update the
users
value in the downloaded file tosystem:serviceaccount:kube-system:nvidia-deviceplugin
. - Create the SCC configuration:
oc create -f nvidia-deviceplugin-scc.yaml
- Verify the SCC creation by running:
# oc get scc | grep nvidia
The output should be like:
nvidia-deviceplugin true [*] RunAsAny RunAsAny RunAsAny RunAsAny 10 false [*]
- Enter the command:
- Label the GPU nodes that you want to schedule the NVIDIA GPU device plugin to run on:
oc label node openshift.com/gpu-accelerator=true
-
Deploy the NVIDIA GPU device plugin. Note that the NVIDIA device plugin is supported and maintained by the vendor, and is not shipped or supported by Red Hat.
- Download the
nvidia-device-plugin-daemonset.yaml
file from OpenShift Performance-Sensitive Application Platform Artifacts - Run the command:
oc create -f nvidia-device-plugin-daemonset.yaml
- Verify the plugin deployment:
oc get pods | grep -i nvidia-device-plugin
The output should look like:
NAME READY STATUS RESTARTS AGE nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 1m
- Test the deployment by following the instructions in Deploy a pod that requires a GPU in How to use GPUs with DevicePlugin in OpenShift. The instructions can be used for Red Hat OpenShift versions 3.11 and 4.5.
- Download the
-
Update the SELinux permissions. As Cloud Pak for Data doesn’t use privileged pods, you have to update some of the SELinux labels on the NVIDIA driver files.
- Check the current SELinux labels:
ls -Z /dev/nvidia*
The output of this command can be different on each GPU node. An example could look like this:
# ls -Z /dev/nvidia* crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0 /dev/nvidia0 crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0 /dev/nvidiactl crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0 /dev/nvidia-modeset crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0 /dev/nvidia-uvm crw-rw-rw-. root root system_u:object_r:xserver_misc_device_t:s0 /dev/nvidia-uvm-tools
- If the SELinux labels for the NVIDIA files in the returned output don’t contain
container_file_t
, update the labels by running the following commands:semanage fcontext -a -t container_file_t "/dev/nvidia(.*)" restorecon -v /dev/nvidia*
- Then rerun:
ls -Z /dev/nvidia*
The output should now look like this:
# ls -Z /dev/nvidia* crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidia0 crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiactl crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiamodeset crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiauvm crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidiauvm-tools
- Check the current SELinux labels: