Using the Ceph manager crash module
By default, daemon crashdumps are dumped in /var/lib/ceph/crash. You can
configure it with the option crash dir. Crash directories are named by time, date,
and a randomly-generated UUID, and contain a metadata file meta and a recent log
file, with a crash_id that is the same.
You can use ceph-crash.service to submit these crashes automatically and persist
in the Ceph Monitors. The ceph-crash.service watches the crashdump
directory and uploads them with ceph crash post.
The RECENT_CRASH heath message is one of the most common health messages in a Ceph
cluster. This health message means that one or more Ceph daemons has crashed recently, and the crash
has not yet been archived or acknowledged by the administrator. This might indicate a software bug,
a hardware problem like a failing disk, or some other problem. The option
mgr/crash/warn_recent_interval controls the time period of what recent means, which
is two weeks by default. You can disable the warnings by running the following command:
Example
[ceph: root@host01 /]# ceph config set mgr/crash/warn_recent_interval 0
The option mgr/crash/retain_interval controls the period for which you want to
retain the crash reports before they are automatically purged. The default for this option is one
year.
Prerequisites
-
A running IBM Storage Ceph cluster.
Procedure
-
Ensure the crash module is enabled:
Example
[ceph: root@host01 /]# ceph mgr module ls | more { "always_on_modules": [ "balancer", "crash", "devicehealth", "orchestrator_cli", "progress", "rbd_support", "status", "volumes" ], "enabled_modules": [ "dashboard", "pg_autoscaler", "prometheus" ] -
Save a crash dump: The metadata file is a JSON blob stored in the crash dir as
meta. You can invoke the ceph command-i -option, which reads from stdin.Example
[ceph: root@host01 /]# ceph crash post -i meta -
List the timestamp or the UUID crash IDs for all the new and archived crash info:
Example
[ceph: root@host01 /]# ceph crash ls -
List the timestamp or the UUID crash IDs for all the new crash information:
Example
[ceph: root@host01 /]# ceph crash ls-new -
List the timestamp or the UUID crash IDs for all the new crash information:
Example
[ceph: root@host01 /]# ceph crash ls-new -
List the summary of saved crash information grouped by age:
Example
[ceph: root@host01 /]# ceph crash stat 8 crashes recorded 8 older than 1 days old: 2022-05-20T08:30:14.533316Z_4ea88673-8db6-4959-a8c6-0eea22d305c2 2022-05-20T08:30:14.590789Z_30a8bb92-2147-4e0f-a58b-a12c2c73d4f5 2022-05-20T08:34:42.278648Z_6a91a778-bce6-4ef3-a3fb-84c4276c8297 2022-05-20T08:34:42.801268Z_e5f25c74-c381-46b1-bee3-63d891f9fc2d 2022-05-20T08:34:42.803141Z_96adfc59-be3a-4a38-9981-e71ad3d55e47 2022-05-20T08:34:42.830416Z_e45ed474-550c-44b3-b9bb-283e3f4cc1fe 2022-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d 2022-05-24T19:58:44.315282Z_1847afbc-f8a9-45da-94e8-5aef0738954e -
View the details of the saved crash:
Syntax
ceph crash info CRASH_IDExample
[ceph: root@host01 /]# ceph crash info 2022-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d { "assert_condition": "session_map.sessions.empty()", "assert_file": "/builddir/build/BUILD/ceph-16.1.0-486-g324d7073/src/mon/Monitor.cc", "assert_func": "virtual Monitor::~Monitor()", "assert_line": 287, "assert_msg": "/builddir/build/BUILD/ceph-16.1.0-486-g324d7073/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f67a1aeb700 time 2022-05-24T19:58:42.545485+0000\n/builddir/build/BUILD/ceph-16.1.0-486-g324d7073/src/mon/Monitor.cc: 287: FAILED ceph_assert(session_map.sessions.empty())\n", "assert_thread_name": "ceph-mon", "backtrace": [ "/lib64/libpthread.so.0(+0x12b30) [0x7f679678bb30]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f6798c8d37b]", "/usr/lib64/ceph/libceph-common.so.2(+0x276544) [0x7f6798c8d544]", "(Monitor::~Monitor()+0xe30) [0x561152ed3c80]", "(Monitor::~Monitor()+0xd) [0x561152ed3cdd]", "main()", "__libc_start_main()", "_start()" ], "ceph_version": "16.2.8-65.el8cp", "crash_id": "2022-07-06T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d", "entity_name": "mon.ceph-adm4", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "8.5 (Ootpa)", "os_version_id": "8.5", "process_name": "ceph-mon", "stack_sig": "957c21d558d0cba4cee9e8aaf9227b3b1b09738b8a4d2c9f4dc26d9233b0d511", "timestamp": "2022-07-06T19:58:42.549073Z", "utsname_hostname": "host02", "utsname_machine": "x86_64", "utsname_release": "4.18.0-240.15.1.el8_3.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Wed Jul 06 03:12:15 EDT 2022" } -
Remove saved crashes older than KEEP days: Here, KEEP must be an integer.
Syntax
ceph crash prune KEEPExample
[ceph: root@host01 /]# ceph crash prune 60 -
Archive a crash report so that it is no longer considered for the
RECENT_CRASHhealth check and does not appear in thecrash ls-newoutput. It appears in thecrash ls.Syntax
ceph crash archive CRASH_IDExample
[ceph: root@host01 /]# ceph crash archive 2022-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d -
Archive all crash reports:
Example
[ceph: root@host01 /]# ceph crash archive-all -
Remove the crash dump:
Syntax
ceph crash rm CRASH_IDExample
[ceph: root@host01 /]# ceph crash rm 2022-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d