Resolving a GPU, PCIe adapter, or device problem

Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.

  1. Are all of the adapters in the system missing or failed?
    If Then
    Yes: Replace the system backplane.
    • If your system is an 8335-GCA or 8335-GTA, go to 8335-GCA and 8335-GTA locations to identify the physical location and the removal and replacement procedure.
    • If your system is an 8335-GTB, go to 8335-GTB locations to identify the physical location and the removal and replacement procedure.
    • If your system is an 8348-21C, go to 8348-21C locations to identify the physical location and the removal and replacement procedure.
    No: Continue with the next step.
  2. To identify the correct service procedure to perform by using operating system log information, complete the following steps:
    1. Log in as the root user.
    2. At the command prompt, type dmesg and press Enter.
  3. Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or failed. When you find a keyword that accompanies one or more of the resource names in the following table, a service action is required. Use the following table to determine the service procedure to perform for your type of problem.
    Table 1. Resource names, examples, and service procedures for different types of operating system logs.
    Resource name Example of a log requiring a service action Type of problem Service procedure
    aacraid PCI error detected 2 RAID
    Note: This adapter is available only for 8348-21C systems.
    Go to Resolving a RAID adapter problem.
    eth1, eth2, eth3 Failed to re-initialize device Network Go to Resolving a network adapter problem.
    NVRM aborting RmInitAdapter failed! Graphics Go to Resolving a graphics processing unit problem.
    nvidia-nvlink IBMNPU: NPU FENCE detected, machine power cycle required Graphics Go to Resolving a graphics processing unit problem.
    nvme Failed status: ffffffff, reset controller NVMe Flash adapter
    Note: This adapter is available only for 8335-GCA systems.
    Go to Resolving an NVMe Flash adapter problem.
    ata1, ata2 SError: { RecovComm PHYRdyChg 10B8B Dispar } Marvell storage adapter
    Note: This adapter is available only for 8348-21C systems.
    Go to Resolving a storage device problem.
    sda, sdb, sdc FAILED Result Storage



Last updated: Thu, December 02, 2021