Upgrading the NVIDIA GPU driver with the NVIDIA GPU operator

After installing IBM Fusion HCI and adding a GPU node to the OpenShift® cluster, the Node Feature Discovery (NFD) component and the NVIDIA GPU operator are automatically installed. The NFD component labels the hardware capabilities of the GPU node, and the NVIDIA GPU operator automates the deployment and management of NVIDIA drivers and related components required for GPU accelerated workloads.

During Stage 2 deployment, if the system detects an NVIDIA GPU node, the Fusion operator automatically sets the manageNvidiaGPUOperator field to true in the ComputeInit custom resource and installs the latest supported version of the NVIDIA GPU operator.

The manageNvidiaGPUOperator setting controls the automatic installation and upgrade of the NVIDIA GPU Operator.
Value Description
True (default) Automatically installs and upgrades the NVIDIA GPU operator when GPU nodes are detected
False NVIDIA GPU operator installation and upgrades are managed manually.
Manual NVIDIA GPU operator management
To manually manage the NVIDIA GPU operator lifecycle, update the manageNvidiaGPUOperator field in the ComputeInit custom resource before you perform NVIDIA GPU node upsize operations or use a NVIDIA GPU operator version other than the latest supported stable version that is installed by IBM Fusion HCI.
  1. Run the following command to update the manageNvidiaGPUOperator field to false:
    oc patch computeinit computeinit \
      -n ibm-spectrum-fusion-ns \
      --type=merge \
      -p '{"spec":{"manageNvidiaGPUOperator":false}}'
  2. Run the following command to verify that the update is applied successfully:
    oc get computeinit computeinit \
      -n ibm-spectrum-fusion-ns \
      -o jsonpath='{.spec.manageNvidiaGPUOperator}'

When the manageNvidiaGPUOperator attribute is set to false, IBM Fusion HCI does not automatically install, upgrade, or reinstall the NVIDIA GPU operator. You must manually manage the NVIDIA GPU operator lifecycle, including maintaining a version other than the latest supported stable version that is installed by IBM Fusion HCI.

To restore automatic NVIDIA GPU operator lifecycle management, update the field to true.

Upgrading the NVIDIA GPU operator
In IBM Fusion HCI 2.13 and later releases, the NVIDIA GPU operator is automatically upgraded to the latest supported stable version.
Automatic upgrade behavior
  • The latest supported NVIDIA GPU operator version is installed by default.
  • Upgrading to IBM Fusion HCI 2.13 or later automatically upgrades the NVIDIA GPU operator.
Manual upgrade procedure
Manual upgrade procedures are required only when the manageGPUOperator parameter is set to false.
NVIDIA DCGM exporter dashboard
The NVIDIA DCGM exporter dashboard displays GPU related metrics and monitoring graphs. The configuration of the NVIDIA DCGM exporter dashboard is automated.
Viewing GPU metrics
To monitor GPU performance, you can access GPU metrics through the OpenShift Container Platform web console. Ensure that you must have the OpenShift Container Platform web console access and NVIDIA DCGM exporter dashboard available in the console.
Administrator perspective:
In the OpenShift Container Platform web console from the side menu, switch to the Administrator perspective, then navigate to Observe > Dashboards and select NVIDIA DCGM Exporter Dashboard from the Dashboard list.
Developer perspective:
If the dashboard is added to the developer perspective, in the OpenShift Container Platform web console from the side menu, switch to the Developer perspective, navigate to Observe > Dashboards and select NVIDIA DCGM Exporter Dashboard from the Dashboard list.

For more information, see NVIDIA documentation.