IBM Support

On systems with Broadcom Emulex Fibre Channel adapters, DLPAR operations may cause CPU soft lockups that might cause the system to crash

Flashes (Alerts)


Abstract

If DLPAR operations cause CPU soft lockups, Power Systems with Emulex FC adapters might crash due to out of memory errors. This failure can be seen in Red Hat Enterprise Linux 9.

Content

Linux Releases Affected
Red Hat Enterprise Linux 9.1
IBM Systems Affected

Power Systems with Emulex FC adapters

Symptoms

When performing DLPAR operations on CPUs while the Emulex FC adapter is installed, there is a possibility that the driver might not register the addition of new CPUs or the removal of active CPUs. This failure might cause the system to hit a soft lockup that can look similar to the following trace:

watchdog: BUG: soft lockup - CPU#81 stuck for 26s! [kworker/81:1H:1036]

NIP [c0080000071c59e8] lpfc_sli4_process_eq+0x50/0x200 [lpfc]
LR [c0080000071e16f0] lpfc_sli4_poll_hbtimer+0x78/0xe0 [lpfc]
Call Trace:
[c00000005475b5d0] [c00000039a990480] 0xc00000039a990480 (unreliable)
[c00000005475b630] [c0080000071e16f0] lpfc_sli4_poll_hbtimer+0x78/0xe0 [lpfc]
[c00000005475b670] [c000000000238050] call_timer_fn+0x50/0x1c0
[c00000005475b700] [c0000000002384e4] __run_timers.part.0+0x324/0x480
[c00000005475b7e0] [c000000000238694] run_timer_softirq+0x54/0xa0
[c00000005475b810] [c000000000ece3cc] __do_softirq+0x15c/0x3e0
[c00000005475b910] [c000000000158ac8] __irq_exit_rcu+0x158/0x190
[c00000005475b940] [c000000000158d00] irq_exit+0x20/0x40
[c00000005475b960] [c00000000002805c] timer_interrupt+0x14c/0x2b0
[c00000005475b9c0] [c000000000016dc4] replay_soft_interrupts+0x134/0x2f0
[c00000005475bbb0] [c000000000017088] arch_local_irq_restore+0x108/0x170
[c00000005475bbe0] [c000000000ecdcc0] _raw_spin_unlock_irqrestore+0x80/0xb0
[c00000005475bc10] [c000000000964a94] mix_interrupt_randomness+0xe4/0x1b0
[c00000005475bc70] [c00000000017ad98] process_one_work+0x298/0x580
[c00000005475bd10] [c00000000017b128] worker_thread+0xa8/0x630
[c00000005475bda0] [c000000000188428] kthread+0x1b8/0x1c0
[c00000005475be10] [c00000000000cd64] ret_from_kernel_thread+0x5c/0x64
The CPU soft lockups can cause other failures. If the DLPAR memory removal operations continue, it might cause an out of memory error that can result in a system crash with a trace similar to the following:
Oops: Kernel access of bad area, sig: 11 [#1]

NIP [c0000000004ebf3c] deactivate_slab+0x15c/0x6f0
LR [c0000000004ec790] flush_cpu_slab+0x90/0x130
Call Trace:
[c000000591b07af0] [c000000000ec4ab0] schedule+0x60/0x110 (unreliable)
[c000000591b07c30] [c0000000004ec790] flush_cpu_slab+0x90/0x130
[c000000591b07c70] [c00000000017ad98] process_one_work+0x298/0x580
[c000000591b07d10] [c00000000017b128] worker_thread+0xa8/0x630
[c000000591b07da0] [c000000000188428] kthread+0x1b8/0x1c0
[c000000591b07e10] [c00000000000cd64] ret_from_kernel_thread+0x5c/0x64
Workaround

There is no workaround for this issue currently. It is advised to shut down the logical partition rather than using the DLPAR operation before adding or removing CPUs from the configuration. If a CPU soft lockup occurs, performing DLPAR operations is not advised until the error is resolved. 

For more information about DLPAR, see Dynamic logical partitioning.
Fix Outlook
The resolution to this issue is still under investigation and will be applied to a future zstream kernel version or maintweb release. Once the fix is available, upgrading the kernel must resolve the issue. If the issue is still seen after a fix is identified and deployed, reach out to IBM support for further assistance.

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGMV157","label":"IBM Support for Red Hat Enterprise Linux Server"},"ARM Category":[{"code":"a8m0z000000Gnl7AAC","label":"Red Hat Enterprise Linux"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
20 February 2023

UID

ibm16955489