Virtualization issues

This topic covers issues related to virtualization and their resolution.

GPU node power cycle issues

Problem statement
After a planned or unplanned power cycle of a GPU node, any virtual machines (VMs) that use a vGPU enter a CrashLoopBackOff state.
Resolution
  1. Restart the vgpu-device-config pod in the nvidia-gpu-operator namespace.
  2. Verify the available vGPU profiles by running the following command:
    oc get node <gpu_node_name> -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
    Example output:
    {
      "nvidia.com/NVIDIA_RTX_Pro_6000_Blackwell_DC-48Q": "4"
    }
  3. Start the VM.