IBM Support

Methods to debug hung or unresponsive Linux systems

Question & Answer


Question

System is perhaps pingable but unable to continue active login sessions or open new ones.

Answer

These types of problems are often difficult to resolve from the software side alone. In our experience, a true hard lockup, can be a hardware problem. Make sure that all firmware, BIOS and diagnostics have been examined and or updated.

You first step should be to confirm the state of the machine. Some suggestions:

- Is the system pingable?

- Get kdump setup, configured and tested before the next occurrence.

Sysrq:

Sysrq allows you a "backdoor" into the kernel to gather information or trigger a crashdump if the kernel is still alive, ie pingable.

See /usr/src/<kernelversion>/Documentation/sysrq.txt for details and additional options.

Make sure that /etc/sysctl.conf has the following:

kernel.sysrq = 1
kernel.panic_on_oops = 1 (Best Practice)

- Then run sysctl -p

- You can confirm the sysctl values with the following command: sysctl -A | less

- You can then test it by echo'ng into /proc/sysrq-trigger. i.e.

echo m > /proc/sysrq-trigger

- Memory status information will be logged to /var/log/messages.

- If the system hangs, goto the console and use the Alt-Sysrq sequences below. The Sysrq key is often labeled PrintScreen.


    Alt-Sysrq-c

    NOTE: On System P (ppc64) system, the Alt-Sysrq will test the crashdump just as expected. In the event of a real panic, the debugger (mon>) will be entered and you will need to type "X" to trigger the dump.
In the event that you have enabled Sysrq but have not yet completed setting up crashdumps, it still may be useful to use the Sysrq mechanism to gather data. Use the following key sequences at the console:
    Alt-Sysrq-p (process listing)
    Alt-Sysrq-t (Stacks)
    Alt-Sysrq-m (memory)
unknown_nmi_panic:

The value kernel.unknown_nmi_panic will allow you to trigger an NMI and Oops if your hardware has an NMI button. This is suggested when the system is not pingable or otherwise accessible. It is intended to help debug what are generally Hardware problems. You would only enable this normally after a discussion with Customer Support.

Before you enable kernel.unknown_nmi_panic, check to see if nmi_watchdog is enabled by doing the following:

cat /proc/interrupts | grep NMI

If there are nonzero values, you will need to disable nmi_watchdog in the bootloader. Edit the kernel command line to include:

nmi_watchdog=0

Make sure to reboot after the bootloader change and check to make sure it's disabled with:

cat /proc/interrupts | grep NMI

Then, edit /etc/sysctl.conf to include the following:
    kernel.unknown_nmi_panic=1
Then run sysctl -p

You can confirm the sysctl values with the following command: sysctl -A | less

NOTE: unknown_nmi_panic is incompatible with nmi_watchdog and the Oracle hangcheck_timer. Please contact Service for additional information.

For Redhat and Suse provide:

The vmcore from the /var/crash/* directory.

[{"Product":{"code":"SGMV157","label":"IBM Support for Red Hat Enterprise Linux Server"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Support information","Platform":[{"code":"PF016","label":"Linux"}],"Version":"Version Independent","Edition":"Enterprise","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
12 August 2021

UID

isg3T1010236