Resolving a GPU, PCIe adapter, or device problem
Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.
About this task
Procedure
-
Are all of the adapters in the system missing or failed?
If Then Yes: Replace the system backplane. - If your system is an 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX, go to 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX locations to identify the physical location and the removal and replacement procedure.
No: Continue with the next step. -
To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
- Log in as the root user.
-
To display the operating system logs, type
dmesgand press Enter.
-
Scan the operating system logs that occurred around the time that the problem started for the
first occurrence of keywords, such as fail, failure, or failed. When you find a keyword that
accompanies one or more of the resource names in the following table, a service action is required.
Use the following table to determine the service procedure to perform for your type of
problem.
Table 1. Resource names, examples, and service procedures for different types of operating system logs. Resource name Example of a log requiring a service action Type of problem Service procedure eth1, eth2, eth3, enPxxxxx, where xxxxx indicates the network port. Failed to re-initialize deviceNetwork Go to Resolving a network adapter problem. mlx5_core Link Downhealth_care: handling bad device hereNetwork Go to Resolving a network adapter problem. tg3 PCI I/O error detected.Link is DownNetwork Go to Resolving a network adapter problem. NVRM aborting RmInitAdapter failed!Graphics Go to Resolving a graphics processing unit problem. nvidia-nvlink IBMNPU: NPU FENCE detected, machine power cycle requiredGraphics Go to Resolving a graphics processing unit problem. nvme Failed status: ffffffff, reset controllerNVMe Flash adapter Go to Resolving an NVMe Flash adapter problem. sda, sdb, sdc FAILED ResultStorage Go to Resolving a storage device problem. EEH Detected error on PHB#xxx, where xxx is the PHB number.PCIe bus or adapter Resolve any device driver errors that are related to I/O and that occurred near the time of this operating system log entry. xxx has failed 6 times in the last hour and has been permanently disabled, where xxx is the PCI bus number.PCIe bus or adapter Ensure that the correct device drivers are properly installed for the device. If the problem persists, replace the adapter in the PCIe slot that is specified in the operating system log entry.