Frequently asked questions about graphics processing units

Graphics processing units (GPUs) provide the high-throughput computation resources that Maximo® Visual Inspection requires. After you ensure that at least one worker node in the Red Hat® OpenShift® cluster has GPU devices, you can configure your cluster to use GPUs by installing the NVIDIA GPU operator.

Does Maximo Visual Inspection require GPUs to operate?

Yes. Maximo Visual Inspection requires GPUs to train all model types. However, in some configurations, Maximo Visual Inspection can perform model inferencing without using GPUs. In some cases, models might use only the central processing unit (CPU) or might be optimized for edge devices. For more information about model-specific requirements, capabilities, and optimizations, see the Models and supported functions topic.

Why do I need a GPU for this application?

Maximo Visual Inspection uses deep learning to create AI models for computer vision tasks. GPUs provide highly parallel computation resources, are suited to performing deep learning tasks, and greatly reduce the training and inference time for the resulting models.

Must I allocate a fixed number of GPUs?

No. Maximo Visual Inspection relies on Kubernetes, Red Hat OpenShift, and the NVIDIA GPU operator to allocate GPUs as requested Kubernetes resources when it needs to use them. Maximo Visual Inspection and IBM Maximo Application Suite share GPU resources with other applications, in the same way that applications share other Kubernetes resources, such as memory or CPUs.

What happens if no GPUs are available?

During training, Maximo Visual Inspection queues training jobs and periodically checks to see whether any GPUs are available. When a GPU becomes available, Maximo Visual Inspection assigns it to the job that is queued for the longest time.

For some models, when you deploy the model for inferencing, you might choose to deploy the model in CPU mode or to run the model on an edge device by using Maximo Visual Inspection Edge or IBM Maximo Visual Inspection Mobile. In these configurations, no GPU is consumed during inferencing.

Which GPUs are supported?

IBM collaborates with NVIDIA to certify the software. In the current release, IBM certifies NVIDIA Ampere, Turing, Pascal, and Volta devices, such as the NVIDIA T4, P40, P100, V100, A10, and A100. At least 16 GB of GPU memory is required during model training.

Do I need to configure GPU devices in every node in my Red Hat OpenShift cluster?

No. Red Hat OpenShift clusters might have different configurations for different sets of machines. For Maximo Visual Inspection to function, at least one worker node in the cluster must have valid GPU devices. For fault tolerance, two worker nodes that have GPU devices are required. Other general-purpose workloads in the cluster, such as running web services or databases, do not require worker nodes that have GPU devices.

How do I assign Maximo Visual Inspection to my GPU nodes in Red Hat OpenShift?

You do not need to manage this allocation manually. Maximo Visual Inspection is a certified Red Hat OpenShift application, which means that it requests resources from Red Hat OpenShift when it needs them. Maximo Visual Inspection works with the Red Hat OpenShift scheduler to place the internal components of Maximo Visual Inspection on the most suitable nodes in the cluster and to move and balance work within the cluster over time.

Can the GPU nodes in my cluster run non-GPU workloads?

Yes. Red Hat OpenShift automatically schedules workloads within the cluster to the most suitable nodes and balances allocations as needed over time.

How do I configure my Red Hat OpenShift cluster for GPUs?

NVIDIA provides this getting started guide to help you configure Red Hat OpenShift by using the NVIDIA GPU operator. If you installed Red Hat OpenShift and have GPU devices in your cluster’s worker nodes, you can install the operator and its prerequisite components without reinstalling Maximo Visual Inspection or the IBM Maximo Application Suite.

I installed the NVIDIA GPU operator, but nothing happened.

Verify that the Red Hat OpenShift node feature discovery (NFD) operator is functioning correctly. The Red Hat OpenShift NFD operator allows the NVIDIA GPU operator to discover the GPUs in your worker nodes. Ensure that the NFD operator version that you install matches your Red Hat OpenShift version. For example, deploy NFD operator version 4.6 on Red Hat OpenShift version 4.6.

Multiple node feature discovery operators are available. Which one do I choose?

Choose the official NFD operator that is provided by Red Hat OpenShift. Do not choose the community edition of the NFD operator.

How do I verify that the node feature discovery operator is functioning correctly?

In the Red Hat OpenShift administration user interface, examine a worker node that you know contains a GPU. Verify that the following label is present: feature.node.kubernetes.io/pci-10de.present=true. 0x10de is the PCI vendor ID that is assigned to NVIDIA.

You can also search for the label from the Red Hat OpenShift command-line interface by running the following command: oc get node --selector=feature.node.kubernetes.io/pci-10de.present=true

If the NFD operator is functioning correctly, this command returns results that are similar to the following output:

NAME                                         STATUS   ROLES    AGE  VERSION
worker0.example.maximovisualinspection.com   Ready    worker   8d   v1.17.1+40d7dbd
worker1.example.maximovisualinspection.com   Ready    worker   8d   v1.17.1+40d7dbd
worker2.example.maximovisualinspection.com   Ready    worker   8d   v1.17.1+40d7dbd

These results indicate that three worker nodes in this cluster, worker0, worker1, and worker2, have at least one NVIDIA PCI GPU device.

I installed the NVIDIA GPU operator in Red Hat OpenShift and tried to deploy a ClusterPolicy, but the deployment failed.

When you install the NVIDIA GPU operator in Red Hat OpenShift, a custom resource definition for a ClusterPolicy is created. The ClusterPolicy controls the versions of the internal components of the GPU operator. However, if you create a ClusterPolicy that contains an empty specification, such as spec{}, the ClusterPolicy fails to deploy.

The operator includes a template for ClusterPolicy deployment that contains the correct values of the operator component versions. Use this template to deploy the ClusterPolicy by opening the Details page for the operator in the Red Hat OpenShift administration dashboard and clicking Provided APIs->ClusterPolicy->Create instance.

The GPU operator seems to crash during the installation of driver containers. No GPUs are available to Maximo Visual Inspection.

Examine the NVIDIA GPU operator driver daemon set output for errors that warn about missing or non-matching packages, such as the following error:

Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.2 install kernel-headers-4.18.0-193.24.1.el8_2.dt1.x86_64 kernel-devel-4.18.0-193.24.1.el8_2.dt1.x86_64
Error: Unable to find a match: kernel-headers-4.18.0-193.24.1.el8_2.dt1.x86_64 kernel-devel-4.18.0-193.24.1.el8_2.dt1.x86_64

This error indicates that your cluster is not able to create entitled containers, which are required by the NVIDIA GPU operator. During installation, the NVIDIA GPU operator installs and configures the GPU device driver within a container. This step fails if the container cannot access a valid Red Hat OpenShift registry or satellite.

Creating entitled containers requires that you assign machine configuration that has a valid Red Hat OpenShift entitlement certificate to your worker nodes. This step is necessary because Red Hat OpenShift CoreOS nodes are not yet automatically entitled. Your rhsm.conf configuration file and entitlement certificate must match, and the remote repository must recognize your entitlement certificates. During the installation of the machine configuration, Red Hat OpenShift restarts your cluster, which might take several minutes. For more information, see the Red Hat OpenShift documentation on using entitled image builds to build DriverContainers with UBI on Red Hat OpenShift.

After I update my GPU operator, I cannot see GPUs in one of my nodes, and the operator’s driver daemon set warns about missing packages.

Verify that the entitlement certificates for your Red Hat OpenShift cluster are still valid. The GPU operator builds and loads the binaries for the driver container dynamically each time the container starts. If the node that hosts the container cannot provide valid entitlement certificates, the installation of the driver container fails.

I installed the node feature discovery operator, a valid cluster entitlement, and the GPU operator, and I deployed a cluster policy successfully. How do I verify that the GPU operator is functioning correctly?

The NVIDIA GPU operator includes a self-test function. If those self-tests pass, the operator notifies Red Hat OpenShift that GPU resources are available for scheduling. To see whether GPUs are available in your cluster, run the following command from the Red Hat OpenShift command-line interface: oc describe nodes | egrep '^Name|^Capacity:|^Allocatable:|nvidia.com/gpu:'

Results that are similar to the following output are returned:

Name:               master0.example.maximovisualinspection.com
Capacity:
Allocatable:
Name:               master1.example.maximovisualinspection.com
Capacity:
Allocatable:
Name:               master2.example.maximovisualinspection.com
Capacity:
Allocatable:
Name:               worker0.example.maximovisualinspection.com
Capacity:
  nvidia.com/gpu:     4
Allocatable:
  nvidia.com/gpu:     4
Name:               worker1.example.maximovisualinspection.com
Capacity:
  nvidia.com/gpu:     4
Allocatable:
  nvidia.com/gpu:     4
Name:               worker2.example.maximovisualinspection.com
Capacity:
Allocatable:

In this example, the Red Hat OpenShift cluster has three master nodes, two GPU-enabled worker nodes, and one general-purpose worker node. The worker0 and worker1 nodes contain four GPUs each, which gives a total of eight available GPUs in the cluster and means that worker0 and worker1 can be used for deep learning tasks. The worker2 node has no GPUs and can be used for other types of work in the cluster. Because the cluster has eight allocatable GPUs, and because the total GPU capacity is eight, the GPU operator is functioning correctly.