Flapping OSDs

Flapping OSDs are when OSDs change repeatedly between down and up within a short period of time. Understand and troubleshoot flapping OSDs.

Flapping OSDs are when the ceph -w | grep osds command shows OSDs repeatedly as down and then up again within a short period of time, such as in the following example.
ceph -w | grep osds
2022-05-05 06:27:20.810535 mon.0 [INF] osdmap e609: 9 osds: 8 up, 9 in
2022-05-05 06:27:24.120611 mon.0 [INF] osdmap e611: 9 osds: 7 up, 9 in
2022-05-05 06:27:25.975622 mon.0 [INF] HEALTH_WARN; 118 pgs stale; 2/9 in osds are down
2022-05-05 06:27:27.489790 mon.0 [INF] osdmap e614: 9 osds: 6 up, 9 in
2022-05-05 06:27:36.540000 mon.0 [INF] osdmap e616: 9 osds: 7 up, 9 in
2022-05-05 06:27:39.681913 mon.0 [INF] osdmap e618: 9 osds: 8 up, 9 in
2022-05-05 06:27:43.269401 mon.0 [INF] osdmap e620: 9 osds: 9 up, 9 in
2022-05-05 06:27:54.884426 mon.0 [INF] osdmap e622: 9 osds: 8 up, 9 in
2022-05-05 06:27:57.398706 mon.0 [INF] osdmap e624: 9 osds: 7 up, 9 in
2022-05-05 06:27:59.669841 mon.0 [INF] osdmap e625: 9 osds: 6 up, 9 in
2022-05-05 06:28:07.043677 mon.0 [INF] osdmap e628: 9 osds: 7 up, 9 in
2022-05-05 06:28:10.512331 mon.0 [INF] osdmap e630: 9 osds: 8 up, 9 in
2022-05-05 06:28:12.670923 mon.0 [INF] osdmap e631: 9 osds: 9 up, 9 in
In addition, the Ceph log contains error messages similar to the following example:
2022-05-25 03:44:06.510583 osd.50 127.0.0.1:6801/149046 18992 : cluster [WRN] map e600547 wrongly marked me down

2022-05-25 19:00:08.906864 7fa2a0033700 -1 osd.254 609110 heartbeat_check: no reply from osd.2 since back 2021-07-25 19:00:07.444113 front 2021-07-25 18:59:48.311935 (cutoff 2021-07-25 18:59:48.906862)

What this means

The main causes of flapping OSDs are:
  • Certain storage cluster operations, such as scrubbing or recovery, take an abnormal amount of time. For example, if you run these operations on objects with a large index or large placement groups. Usually, after these operations finish, the flapping OSDs' problem is solved.
  • Problems with the underlying physical hardware. In this case, the ceph health detail command also returns the slow requests error message.
  • Problems with the network.

Ceph OSDs cannot manage situations where the private network for the storage cluster fails, or significant latency is on the public client-facing network.

Ceph OSDs use the private network for sending heartbeat packets to each other to indicate that they are up and in. If the private storage cluster network does not work properly, OSDs are unable to send and receive the heartbeat packets. As a consequence, they report each other as being down to the Ceph Monitors, while marking themselves as up.

Table 1 lists parameters in the Ceph configuration file that can influence this behavior.
Table 1. Ceph configuration parameters that can influence the OSD status
Parameter Description Default value
osd_heartbeat_grace_time How long OSDs wait for the heartbeat packets to return before reporting an OSD as down to the Ceph Monitors. 20 seconds
mon_osd_min_down_reporters How many OSDs must report another OSD as down before the Ceph Monitors mark the OSD as down 2
Table 1 shows that in the default configuration, the Ceph Monitors mark an OSD as down if only one OSD made three distinct reports about the first OSD being down. In some cases, if one single host encounters network issues, the entire cluster can experience flapping OSDs. This is because the OSDs that are on the host reports other OSDs in the cluster as down.
Note: The flapping OSDs scenario does not include the situation when the OSD processes are started and then immediately stopped.

Troubleshooting this problem

Important: Flapping OSDs can be caused by MTU misconfiguration on Ceph OSD nodes, at the network switch level, or both. To resolve the issue, set MTU to a uniform size on all storage cluster nodes, including on the core and access network switches with a planned downtime. Do not tune osd heartbeat min size because changing this setting can hide issues within the network, and it does not solve actual network inconsistency.
  1. Check the output of the ceph health detail command again.
    Note: If the output includes the slow requests error message, see Slow requests or requests are blocked for information on how to troubleshoot this issue.
    ceph health detail
    HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests
    30 ops are blocked > 268435 sec
    1 ops are blocked > 268435 sec on osd.11
    1 ops are blocked > 268435 sec on osd.18
    28 ops are blocked > 268435 sec on osd.39
    3 osds have slow requests
  2. Determine which OSDs are marked as down and on what nodes they reside.
    ceph osd tree | grep down
  3. On the nodes containing the flapping OSDs, troubleshoot and fix any networking problems. For more information, see Troubleshooting networking issues.
  4. Alternatively, you can temporarily force Monitors to stop marking the OSDs as down and up by setting the noup and nodown flags.
    Important: Using the noup and nodown flags does not fix the root cause of the problem but only prevents OSDs from flapping.
    ceph osd set noup
    ceph osd set nodown