Resolving a GPU, PCIe adapter, or device problem

Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.

About this task

Procedure

  1. Are all of the adapters in the system missing or failed?
    If Then
    Yes: Replace the system backplane.
    No: Continue with the next step.
  2. To identify the correct service procedure to perform by using operating system log information, complete the following steps:
    1. Log in as the root user.
    2. To display the operating system logs, type dmesg and press Enter.
  3. Scan the operating system logs that occurred around the time that the problem started for the first occurrence of keywords, such as fail, failure, or failed. When you find a keyword that accompanies one or more of the resource names in the following table, a service action is required. Use the following table to determine the service procedure to perform for your type of problem.
    Table 1. Resource names, examples, and service procedures for different types of operating system logs.
    Resource name Example of a log requiring a service action Type of problem Service procedure
    eth1, eth2, eth3, enPxxxxx, where xxxxx indicates the network port. Failed to re-initialize device Network Go to Resolving a network adapter problem.
    mlx5_core Link Down
    health_care: handling bad device here
    Network Go to Resolving a network adapter problem.
    tg3 PCI I/O error detected.
    Link is Down
    Network Go to Resolving a network adapter problem.
    NVRM aborting RmInitAdapter failed! Graphics Go to Resolving a graphics processing unit problem.
    nvidia-nvlink IBMNPU: NPU FENCE detected, machine power cycle required Graphics Go to Resolving a graphics processing unit problem.
    nvme Failed status: ffffffff, reset controller NVMe Flash adapter Go to Resolving an NVMe Flash adapter problem.
    sda, sdb, sdc FAILED Result Storage Go to Resolving a storage device problem.
    EEH Detected error on PHB#xxx, where xxx is the PHB number. PCIe bus or adapter Resolve any device driver errors that are related to I/O and that occurred near the time of this operating system log entry.
    xxx has failed 6 times in the last hour and has been permanently disabled, where xxx is the PCI bus number. PCIe bus or adapter Ensure that the correct device drivers are properly installed for the device. If the problem persists, replace the adapter in the PCIe slot that is specified in the operating system log entry.