IBM Support

PCIe3 2-PORT 25/10 Gb NIC & RoCE SFP28 Adapter (EC2T / EC2U) issue on Power 9 systems

Flashes (Alerts)


Abstract

When performing TCP RR test with PCIe3 2-PORT 25/10 Gb NIC & RoCE SFP28 Adapter, there is a chance to see completions errors in dmesg thrown by the Mellanox driver (mlx5) that could lead to a system crash on Power9 servers.

Content

Linux Releases Affected
        SUSE Linux Enterprise Server 12 Service Pack 3


IBM Systems Affected
        9008-22L, 9009-22A, 9009-41A, 9009-42A, 9223-22H, 9223-42H
        9040-MR9, 9225-50H
        9080-M9S, 9222-80H
        
Affected Hardware
        PCIe3 2-PORT 25/10 Gb NIC & RoCE SFP28 Adapter (FC EC2T and EC2U)    

Symptoms
When running TCP traffic with PCIe3 2-PORT 25/10 Gb NIC & RoCE SFP28 Adapter there is a chance that completion errors will appear in dmesg like this:


mlx5_core 0003:01:00.1 eth14: Error cqe on cqn 0x8d3, ci 0x144, sqn 0x19df, syndrome 0x2, vendor syndrome 0x68
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 12 00 68 02 0e 00 19 df 00 00 27 d3


When trying to recover from these completions errors there is a chance of a system crash on Power 9 servers with this stack trace can be seen:

[ 2965.875443] NIP [c0000000004c4c84] dql_completed+0x24/0x210
[ 2965.875477] LR [d00000000c67a7f4] mlx5e_poll_tx_cq+0x4b4/0x600 [mlx5_core]
[ 2965.875479] Call Trace:
[ 2965.875502] [c000001dffd7fbd0] [d00000000c67a614] mlx5e_poll_tx_cq+0x2d4/0x600 [mlx5_core]
[ 2965.875524] [c000001dffd7fcd0] [d00000000c67fb68] mlx5e_napi_poll+0x88/0x400 [mlx5_core]
[ 2965.875528] [c000001dffd7fd30] [c00000000069c6f8] net_rx_action+0x2d8/0x460
[ 2965.875534] [c000001dffd7fe40] [c0000000000c2ed4] __do_softirq+0x174/0x420
[ 2965.875538] [c000001dffd7ff30] [c0000000000c36a8] irq_exit+0x1e8/0x200
[ 2965.875543] [c000001dffd7ff60] [c000000000014c70] __do_irq+0x90/0x1d0
[ 2965.875547] [c000001dffd7ff90] [c000000000027944] call_do_irq+0x14/0x24
[ 2965.875551] [c000000f6d333a10] [c000000000014e50] do_IRQ+0xa0/0x120
[ 2965.875555] [c000000f6d333a60] [c000000000002694] hardware_interrupt_common+0x114/0x180

Workaround
There is currently no known workaround for this issue.

Fix Outlook
IBM is working closely with Mellanox to release a fix for this issue. The fix should come as part of Mellanox MOFED iso.

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGMV157","label":"IBM Support for Red Hat Enterprise Linux Server"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022

UID

ibm10725899