Virtualization issues
This topic covers issues related to virtualization and their resolution.
GPU node power cycle issues
- Problem statement
- After a planned or unplanned power cycle of a GPU node, any virtual machines (VMs) that use a
vGPU enter a
CrashLoopBackOffstate.
- Resolution
-
- Restart the
vgpu-device-configpod in thenvidia-gpu-operatornamespace. - Verify the available vGPU profiles by running the following command:
oc get node <gpu_node_name> -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'Example output:{ "nvidia.com/NVIDIA_RTX_Pro_6000_Blackwell_DC-48Q": "4" } - Start the VM.
- Restart the