Resolving a GPU, PCIe adapter, or device problem

Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.

  1. To identify the correct service procedure to perform by using operating system log information, complete the following steps:
    1. Log in as the root user.
    2. At the command prompt, type dmesg and press Enter.
  2. Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or failed. When you find a keyword that accompanies one or more of the resource names in Table 1, a service action is required.

    Did you find an operating system log that requires a service action?

    If Then
    Yes: Use Table 1 to determine the service procedure to perform for your type of problem. This ends the procedure.
    No: Continue with the next step.
    Table 1. Resource names, examples, and service procedures for different types of operating system logs.
    Resource name Example of a log requiring a service action Type of problem Service procedure
    mpt3sas PCI error detected 2 RAID Go to Resolving a RAID adapter problem.
    eth1, eth2, eth3 Failed to re-initialize device Network Go to Resolving a network adapter problem.
    NVRM aborting RmInitAdapter failed! Graphics Go to Resolving a graphics processing unit problem.
    nvme Failed status: ffffffff, reset controller NVMe Flash adapter Go to Resolving an NVMe Flash adapter problem.
    ata1, ata2 SError: { RecovComm PHYRdyChg 10B8B Dispar } Marvell storage adapter Go to Resolving a storage device problem.
    sda, sdb, sdc FAILED Result Storage
  3. Are all of the adapters in the system missing or failed?
    If Then
    Yes: Perform the following actions, one at a time until the problem is resolved:
    1. Ensure that the PCIe risers are fully seated in the system.
    2. Replace system processor CPU 1.
    3. Replace the system backplane.
    Note:
    • If your system is an 8001-12C or 8005-12N, go to 8001-12C or 8005-12N locations to identify the physical location and the removal and replacement procedure.
    • If your system is an 8001-22C or 8005-22N, go to 8001-22C or 8005-22N locations to identify the physical location and the removal and replacement procedure.
    No: Go to Collecting diagnostic data. Then, go to Contacting IBM service and support.



Last updated: Thu, December 02, 2021