Flashes (Alerts)
Abstract
When performing TCP RR test with PCIe3 2-PORT 25/10 Gb NIC & RoCE SFP28 Adapter, there is a chance to see completions errors in dmesg thrown by the Mellanox driver (mlx5) that could lead to a system crash on Power9 servers.
Content
Linux Releases Affected
SUSE Linux Enterprise Server 12 Service Pack 3
IBM Systems Affected
9008-22L, 9009-22A, 9009-41A, 9009-42A, 9223-22H, 9223-42H
9040-MR9, 9225-50H
9080-M9S, 9222-80H
Affected Hardware
PCIe3 2-PORT 25/10 Gb NIC & RoCE SFP28 Adapter (FC EC2T and EC2U)
Symptoms
When running TCP traffic with PCIe3 2-PORT 25/10 Gb NIC & RoCE SFP28 Adapter there is a chance that completion errors will appear in dmesg like this:
mlx5_core 0003:01:00.1 eth14: Error cqe on cqn 0x8d3, ci 0x144, sqn 0x19df, syndrome 0x2, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 12 00 68 02 0e 00 19 df 00 00 27 d3
When trying to recover from these completions errors there is a chance of a system crash on Power 9 servers with this stack trace can be seen:
[ 2965.875443] NIP [c0000000004c4c84] dql_completed+0x24/0x210 [ 2965.875477] LR [d00000000c67a7f4] mlx5e_poll_tx_cq+0x4b4/0x600 [mlx5_core] [ 2965.875479] Call Trace: [ 2965.875502] [c000001dffd7fbd0] [d00000000c67a614] mlx5e_poll_tx_cq+0x2d4/0x600 [mlx5_core] [ 2965.875524] [c000001dffd7fcd0] [d00000000c67fb68] mlx5e_napi_poll+0x88/0x400 [mlx5_core] [ 2965.875528] [c000001dffd7fd30] [c00000000069c6f8] net_rx_action+0x2d8/0x460 [ 2965.875534] [c000001dffd7fe40] [c0000000000c2ed4] __do_softirq+0x174/0x420 [ 2965.875538] [c000001dffd7ff30] [c0000000000c36a8] irq_exit+0x1e8/0x200 [ 2965.875543] [c000001dffd7ff60] [c000000000014c70] __do_irq+0x90/0x1d0 [ 2965.875547] [c000001dffd7ff90] [c000000000027944] call_do_irq+0x14/0x24 [ 2965.875551] [c000000f6d333a10] [c000000000014e50] do_IRQ+0xa0/0x120 [ 2965.875555] [c000000f6d333a60] [c000000000002694] hardware_interrupt_common+0x114/0x180
Workaround
There is currently no known workaround for this issue.
Fix Outlook
IBM is working closely with Mellanox to release a fix for this issue. The fix should come as part of Mellanox MOFED iso.
Was this topic helpful?
Document Information
Modified date:
26 September 2022
UID
ibm10725899