IBM Support

Recoverable PCIe error might not recover on Power10 Systems

Flashes (Alerts)


Abstract

During PCIe error injection testing on Power10 Systems, it is observed that, occasionally, recoverable PCIe errors are not recovered by the Linux® OS and the affected adapter is taken offline.

Content

Linux releases affected:
Red Hat® Enterprise Linux 8.2
Red Hat Enterprise Linux 8.4
SUSE Linux Enterprise Server 12, Service Pack 5
SUSE Linux Enterprise Server 15, Service Pack 3

IBM systems affected: 
All IBM Power10 systems 

Symptoms
When a PCIe error recover is attempted, the following errors are seen in the log and the adapter is taken offline:

[ 7668.221060] bnx2x: [bnx2x_io_slot_reset:14359(enP21p1s0f1)]IO slot reset initializing...
[ 7668.221124] bnx2x 0015:01:00.1: enabling device (0140 -> 0142)
[ 7668.225177] bnx2x: [bnx2x_io_slot_reset:14375(enP21p1s0f1)]IO slot reset --> driver unload
[ 7862.256292] INFO: task kworker/u48:1:14577 blocked for more than 120 seconds.
[ 7862.256303]       Tainted: G        W        --------- -  - 4.18.0-305.el8.ppc64le #1
[ 7862.256305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7862.256307] kworker/u48:1   D    0 14577      2 0x00000888
[ 7862.256313] Workqueue: netns cleanup_net
[ 7862.256315] Call Trace:
[ 7862.256318] [c00000017ae736a0] [c0000000001a2d88] kthread+0x8/0x1c0 (unreliable)
[ 7862.256322] [c00000017ae73880] [c000000000018400] __switch_to+0x2e0/0x500
[ 7862.256325] [c00000017ae738e0] [c000000000ed90a8] __schedule+0x2f8/0x9c0
[ 7862.256328] [c00000017ae739b0] [c000000000ed97d8] schedule+0x68/0x130
[ 7862.256331] [c00000017ae739e0] [c000000000ed9f30] schedule_preempt_disabled+0x20/0x30
[ 7862.256334] [c00000017ae73a00] [c000000000edba18] __mutex_lock.isra.1+0x388/0x760
[ 7862.256337] [c00000017ae73aa0] [c000000000c4a1d8] rtnl_lock+0x28/0x40
[ 7862.256340] [c00000017ae73ac0] [c000000000c2da1c] default_device_exit+0x3c/0x1a0
[ 7862.256342] [c00000017ae73b70] [c000000000c17134] cleanup_net+0x404/0x720
[ 7862.256345] [c00000017ae73c60] [c0000000001981b4] process_one_work+0x304/0x5d0
[ 7862.256347] [c00000017ae73d00] [c000000000198cfc] worker_thread+0xcc/0x7a0
[ 7862.256349] [c00000017ae73db0] [c0000000001a2f30] kthread+0x1b0/0x1c0
[ 7862.256353] [c00000017ae73e20] [c00000000000b7d8] ret_from_kernel_thread+0x5c/0x64

Workaround
If a device is taken offline because of this issue, the device can be DLPAR removed from the LPAR and added back in order to recover from this issue.

Fix Outlook
IBM is working with Red Hat and SUSE to release a fix for this issue. The fix is targeted to be made available in the next minor release of Red Hat and SUSE.

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGMV157","label":"IBM Support for Red Hat Enterprise Linux Server"},"ARM Category":[{"code":"a8m0z000000Gnl7AAC","label":"Red Hat Enterprise Linux"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
23 September 2021

UID

ibm16490839