Resolving a GPU, PCIe adapter, or device problem
Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.
- To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
- Log in as the root user.
- At the command prompt, type dmesg and press Enter.
- Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or
failed. When you find a keyword that accompanies one or more of the resource names in Table 1, a service action is required.
Did you find an operating system log that requires a service action?
If Then Yes: Use Table 1 to determine the service procedure to perform for your type of problem. This ends the procedure. No: Continue with the next step. Table 1. Resource names, examples, and service procedures for different types of operating system logs. Resource name Example of a log requiring a service action Type of problem Service procedure mpt3sas PCI error detected 2 RAID Go to Resolving a RAID adapter problem. eth1, eth2, eth3 Failed to re-initialize device Network Go to Resolving a network adapter problem. NVRM aborting RmInitAdapter failed! Graphics Go to Resolving a graphics processing unit problem. nvme Failed status: ffffffff, reset controller NVMe Flash adapter Go to Resolving an NVMe Flash adapter problem. ata1, ata2 SError: { RecovComm PHYRdyChg 10B8B Dispar } Marvell storage adapter Go to Resolving a storage device problem. sda, sdb, sdc FAILED Result Storage - Are all of the adapters in the system missing or failed?
If Then Yes: Perform the following actions, one at a time until the problem is resolved: - Ensure that the PCIe risers are fully seated in the system.
- Replace system processor CPU 1.
- Replace the system backplane.
Note:- If your system is an 8001-12C or 8005-12N, go to 8001-12C or 8005-12N locations to identify the physical location and the removal and replacement procedure.
- If your system is an 8001-22C or 8005-22N, go to 8001-22C or 8005-22N locations to identify the physical location and the removal and replacement procedure.
No: Go to Collecting diagnostic data. Then, go to Contacting IBM service and support.
- Resolving a RAID adapter problem
Learn about the possible problems and service actions that you can perform to resolve a RAID adapter problem. - Resolving a network adapter problem
Learn about the possible problems and service actions that you can perform to resolve a network adapter problem. - Resolving a graphics processing unit problem
Learn about the possible problems and service actions that you can perform to resolve a graphics processing unit (GPU) problem. - Resolving an NVMe Flash adapter problem
Learn about the possible problems and service actions that you can perform to resolve a Non-Volatile Memory Express (NVMe) Flash adapter problem. - Resolving a storage device problem
Learn about the possible problems and service actions that you can perform to resolve a storage device problem. - Identifying the location of the PCIe adapter by using the slot number
The error message provides information to help you to determine the location of the PCIe adapter. - Identifying the location of the GPU by using the slot number
The error message provides information to help you to determine the location of the graphics processing unit (GPU). - Identifying the location of the NVMe Flash adapter
Use this procedure to identify the location of a Non-Volatile Memory Express (NVMe) Flash adapter. - Identifying the location of the storage device
Use this procedure to identify the location of a storage device. - User guides for GPUs and PCIe adapters
Use this information to find the user guide for your graphics processing unit (GPU) or PCIe adapter.
Parent topic: Beginning troubleshooting and problem analysis