Resolving a GPU, PCIe adapter, or device problem
Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.
- Are all of the adapters in the system missing or failed?
If Then Yes: Replace the system backplane. - If your system is an 8335-GCA or 8335-GTA, go to 8335-GCA and 8335-GTA locations to identify the physical location and the removal and replacement procedure.
- If your system is an 8335-GTB, go to 8335-GTB locations to identify the physical location and the removal and replacement procedure.
- If your system is an 8348-21C, go to 8348-21C locations to identify the physical location and the removal and replacement procedure.
No: Continue with the next step. - To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
- Log in as the root user.
- At the command prompt, type dmesg and press Enter.
- Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or
failed. When you find a keyword that accompanies one or more of the resource names in the following
table, a service action is required. Use the following table to determine the service procedure to
perform for your type of problem.
Table 1. Resource names, examples, and service procedures for different types of operating system logs. Resource name Example of a log requiring a service action Type of problem Service procedure aacraid PCI error detected 2 RAID Note: This adapter is available only for 8348-21C systems.Go to Resolving a RAID adapter problem. eth1, eth2, eth3 Failed to re-initialize device Network Go to Resolving a network adapter problem. NVRM aborting RmInitAdapter failed! Graphics Go to Resolving a graphics processing unit problem. nvidia-nvlink IBMNPU: NPU FENCE detected, machine power cycle required Graphics Go to Resolving a graphics processing unit problem. nvme Failed status: ffffffff, reset controller NVMe Flash adapter Note: This adapter is available only for 8335-GCA systems.Go to Resolving an NVMe Flash adapter problem. ata1, ata2 SError: { RecovComm PHYRdyChg 10B8B Dispar } Marvell storage adapter Note: This adapter is available only for 8348-21C systems.Go to Resolving a storage device problem. sda, sdb, sdc FAILED Result Storage
- Resolving a RAID adapter problem
Learn about the possible problems and service actions that you can perform to resolve a RAID adapter problem. - Resolving a network adapter problem
Learn about the possible problems and service actions that you can perform to resolve a network adapter problem. - Resolving a graphics processing unit problem
Learn about the possible problems and service actions that you can perform to resolve a graphics processing unit (GPU) problem. - Resolving an NVMe Flash adapter problem
Learn about the possible problems and service actions that you can perform to resolve a Non-Volatile Memory Express (NVMe) Flash adapter problem. - Resolving a storage device problem
Learn about the possible problems and service actions that you can perform to resolve a storage device problem. - Identifying the location of the PCIe adapter by using the slot number
The error message provides information to help you to determine the location of the PCIe adapter. - Identifying the location of the GPU
The error message provides information to help you to determine the location of the graphics processing unit (GPU). - Identifying the location of the NVMe Flash adapter
Use this procedure to identify the location of a Non-Volatile Memory Express (NVMe) Flash adapter. - Identifying the location of the storage device
Use this procedure to identify the location of a storage device. - User guides for GPUs and PCIe adapters
Use this information to find the user guide for your graphics processing unit (GPU) or PCIe adapter.
Parent topic: Beginning troubleshooting and problem analysis