Technical Blog Post
Abstract
Firmware assisted dump support on PowerLinux systems
Body
By: Mahesh Salgaonkar.
Any Enterprise deployment needs facilities to trace, record, and analyze system state for serviceability. We would ideally want to capture all relevant information to debug a problem at the first occurrence of the problem. In IBM parlance, we call this First Failure Data Capture (FFDC). While FFDC is important for all components of the system, it is critical for the operating system (OS) which is the heart of the running software. Gathering the state of the OS at the time of a crash/hang is the first step to finding out what caused the problem. And, an OS crash dump is the FFDC for this purpose.
- Operating systems have come up with their own way of configuring, capturing, and analyzing crash dumps. Linux has been no different. The evolution of Linux's crash dump mechanism dates back to early 2000s, when SGI's Linux Kernel Crash Dump (LKCD) was proposed. Other solutions included Netdump and Diskdump. None of these mechanisms however found favor with the upstream kernel community. Finally, in year 2006, it was IBM's kdump mechanism that was accepted as the de-facto OS crash dump mechanism for Linux.
In this article, I introduce how kernel dump works in the context of fadump enabled firmware, with a look at the emerging Linux Distro work to integrate kdump and fadump capabilities.
kdump uses kexec -- a kernel-kernel bootloader, that bypasses the firmware. Kexec had been used for fast reboots, and extending this mechanism to boot into a new (minimal) kernel in a reserved memory region, without disturbing the contents of the rest of RAM was an appealing prospect for capturing the OS state at the time of failure. Thus, we use a two kernel approach, where the production kernel runs as normal, while a minimal kernel resides in a reserved memory area, and is booted into in the case where the production kernel crashes. Once booted into this minimal kernel, contents of the (untouched) RAM can then be accessed and written out in the ELF format for analysis with tools like crash. The system administrator can decide where this dump needs to be stored, and can further filter the dump for size.
While kdump is an elegant solution addressing a critical problem, it has a few potential drawbacks. Once the OS crashes, the system is in an inconsistent state, especially the devices. While utmost care is taken to prevent failures, in some rare cases, a rogue DMA or ill-behaving device drivers can cause the kdump capture to fail. There is continued effort to make kdump robust.
The IBM Power Systems platform has been a benchmark for RAS capabilities. While kdump is available on Linux running on Power Systems, engineers at the IBM Linux Technology Center envisaged that a more robust crash dump mechanism can be built using Power Systems unique firmware features. Thus Firmware Assisted Dump (fadump) was born.
Fadump uses Power System firmware unique features to capture the Linux kernel dump. When the OS crashes, Power firmware is informed about the crash and it takes care of preserving the memory image at the time of failure and reboots into a new kernel, taking care to reset all the device and system states, making the dump capture mechanism more robust. This unique feature is supported in the firmware on all POWER6 and above Power System servers.
For related information, the following articles can be referenced
Roughly, the following is what happens on a fadump operation:
1. At the crash, the kernel informs the Power firmware that it is crashing
2. Firmware takes the control and reboots the entire system preserving only the memory (resets all other devices going through BIOS).
3. The reboot follows the normal booting process (non-kexec).
4. The boot loader loads the default kernel and initrd from /boot
For an in-depth perspective:
While kdump is an elegant solution addressing a critical problem, it has a few potential drawbacks. Once the OS crashes, the system is in an inconsistent state, especially the devices. While utmost care is taken to prevent failures, in some rare cases, a rogue DMA or ill-behaving device drivers can cause the kdump capture to fail. There is continued effort to make kdump robust.
The IBM Power Systems platform has been a benchmark for RAS capabilities. While kdump is available on Linux running on Power Systems, engineers at the IBM Linux Technology Center envisaged that a more robust crash dump mechanism can be built using Power Systems unique firmware features. Thus Firmware Assisted Dump (fadump) was born.
Fadump uses Power System firmware unique features to capture the Linux kernel dump. When the OS crashes, Power firmware is informed about the crash and it takes care of preserving the memory image at the time of failure and reboots into a new kernel, taking care to reset all the device and system states, making the dump capture mechanism more robust. This unique feature is supported in the firmware on all POWER6 and above Power System servers.
For related information, the following articles can be referenced
- fadump is now part of the upstream Linux kernel and more information about the mechanism can be found at lwn.net - Articles - 488132
- IBM InfoCenter provides an article on Configuring a Kernel Dump
- RedHat provides an article for the kdump Crash Recovery Service
- SUSE provides an article on configuring the kernel core dump capture
Roughly, the following is what happens on a fadump operation:
1. At the crash, the kernel informs the Power firmware that it is crashing
2. Firmware takes the control and reboots the entire system preserving only the memory (resets all other devices going through BIOS).
3. The reboot follows the normal booting process (non-kexec).
4. The boot loader loads the default kernel and initrd from /boot
For an in-depth perspective:
The PowerLinux kernel uses the RTAS (Run Time Abstraction Services) interface to interact with Power platform hardware features. RTAS is run-time firmware intended to be present during the execution of the OS, and to be called by the OS to access platform hardware features on behalf of the OS. All RTAS functions are invoked from the OS by calling the rtas_call function with RTAS token present under /proc/device-tree/rtas/*.
In the normal power-off/reboot sequence the PowerLinux kernel invokes RTAS functions ibm,power-off/system-reboot which resets all processor and attached devices including memory. But in OS crash situations (for example during the panic call), the kernel invokes the appropriate ibm,os-term RTAS call to indicate to the platform that it has terminated abnormally, so that platform can take appropriate action.
Starting from Power6, the Power platform supports the extended ibm,os-term behavior which preserves the memory contents if kernel has registered for platform assisted kernel dump feature, specified by RTAS function ibm,configure-kernel-dump. This means the kernel invokes the ibm,os-term RTAS function to inform the system platform about the OS crash.
Since firmware enables booting into new kernel after fully resetting the system, it is guaranteed that all the devices and ongoing DMAs are stopped properly. In particular, PCI and I/O devices have been reinitialized and are in a clean, consistent state. This improves Power serviceability by making fadump more robust compared to current kdump mechanism on Linux. Like kdump, fadump also exports the memory dump in ELF format. This enables the reuse of existing kdump infrastructure for dump capture and filtering.
Engineers in the IBM Linux Technology Center (LTC) are working on integrating fadump with the existing kdump infrastructure so that users can benefit from a seamless migration to a more robust framework. We expect to have this "fadump" feature enabled on upcoming distributions that will run on IBM PowerLinux systems.
[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power ->PowerLinux"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]
UID
ibm16171555