Upgrading the NVIDIA GPU driver with the NVIDIA GPU operator
After installing IBM Fusion HCI and adding a GPU node to the OpenShift® cluster, the Node Feature Discovery (NFD) component and the NVIDIA GPU operator are automatically installed. The NFD component labels the hardware capabilities of the GPU node, and the NVIDIA GPU operator automates the deployment and management of NVIDIA drivers and related components required for GPU accelerated workloads.
During Stage 2 deployment, if the system detects an NVIDIA GPU node, the Fusion operator
automatically sets the manageNvidiaGPUOperator field to true in
the ComputeInit custom resource and installs the latest supported version of the
NVIDIA GPU operator.
manageNvidiaGPUOperator setting controls the automatic installation and
upgrade of the NVIDIA GPU Operator.
| Value | Description |
|---|---|
| True (default) | Automatically installs and upgrades the NVIDIA GPU operator when GPU nodes are detected |
| False | NVIDIA GPU operator installation and upgrades are managed manually. |
- Manual NVIDIA GPU operator management
- To manually manage the NVIDIA GPU operator lifecycle, update the
manageNvidiaGPUOperatorfield in theComputeInitcustom resource before you perform NVIDIA GPU node upsize operations or use a NVIDIA GPU operator version other than the latest supported stable version that is installed by IBM Fusion HCI.- Run the following command to update the
manageNvidiaGPUOperatorfield to false:oc patch computeinit computeinit \ -n ibm-spectrum-fusion-ns \ --type=merge \ -p '{"spec":{"manageNvidiaGPUOperator":false}}' - Run the following command to verify that the update is applied
successfully:
oc get computeinit computeinit \ -n ibm-spectrum-fusion-ns \ -o jsonpath='{.spec.manageNvidiaGPUOperator}'
When the
manageNvidiaGPUOperatorattribute is set tofalse, IBM Fusion HCI does not automatically install, upgrade, or reinstall the NVIDIA GPU operator. You must manually manage the NVIDIA GPU operator lifecycle, including maintaining a version other than the latest supported stable version that is installed by IBM Fusion HCI.To restore automatic NVIDIA GPU operator lifecycle management, update the field to true.
- Run the following command to update the
- Upgrading the NVIDIA GPU operator
- In IBM Fusion HCI 2.13 and later releases, the
NVIDIA GPU operator is automatically upgraded to the latest supported stable version.
- Automatic upgrade behavior
-
- The latest supported NVIDIA GPU operator version is installed by default.
- Upgrading to IBM Fusion HCI 2.13 or later automatically upgrades the NVIDIA GPU operator.
- Manual upgrade procedure
- Manual upgrade procedures are required only when the
manageGPUOperatorparameter is set tofalse.
- NVIDIA DCGM exporter dashboard
- The NVIDIA DCGM exporter dashboard displays GPU related metrics and monitoring graphs. The configuration of the NVIDIA DCGM exporter dashboard is automated.
- Viewing GPU metrics
- To monitor GPU performance, you can access GPU metrics through the OpenShift Container Platform web console. Ensure that you must have the
OpenShift Container Platform web console access and NVIDIA DCGM
exporter dashboard available in the console.
- Administrator perspective:
- In the OpenShift Container Platform web console from the side menu, switch to the Administrator perspective, then navigate to and select NVIDIA DCGM Exporter Dashboard from the Dashboard list.
- Developer perspective:
- If the dashboard is added to the developer perspective, in the OpenShift Container Platform web console from the side menu, switch to the Developer perspective, navigate to and select NVIDIA DCGM Exporter Dashboard from the Dashboard list.