Flapping OSDs
Flapping OSDs are when OSDs change repeatedly between down and up within a short period of time. Understand and troubleshoot flapping OSDs.
Flapping OSDs are when the ceph -w | grep osds command shows OSDs
repeatedly as
down and then up again within a short period of
time, such as in the following example.ceph -w | grep osds 2022-05-05 06:27:20.810535 mon.0 [INF] osdmap e609: 9 osds: 8 up, 9 in 2022-05-05 06:27:24.120611 mon.0 [INF] osdmap e611: 9 osds: 7 up, 9 in 2022-05-05 06:27:25.975622 mon.0 [INF] HEALTH_WARN; 118 pgs stale; 2/9 in osds are down 2022-05-05 06:27:27.489790 mon.0 [INF] osdmap e614: 9 osds: 6 up, 9 in 2022-05-05 06:27:36.540000 mon.0 [INF] osdmap e616: 9 osds: 7 up, 9 in 2022-05-05 06:27:39.681913 mon.0 [INF] osdmap e618: 9 osds: 8 up, 9 in 2022-05-05 06:27:43.269401 mon.0 [INF] osdmap e620: 9 osds: 9 up, 9 in 2022-05-05 06:27:54.884426 mon.0 [INF] osdmap e622: 9 osds: 8 up, 9 in 2022-05-05 06:27:57.398706 mon.0 [INF] osdmap e624: 9 osds: 7 up, 9 in 2022-05-05 06:27:59.669841 mon.0 [INF] osdmap e625: 9 osds: 6 up, 9 in 2022-05-05 06:28:07.043677 mon.0 [INF] osdmap e628: 9 osds: 7 up, 9 in 2022-05-05 06:28:10.512331 mon.0 [INF] osdmap e630: 9 osds: 8 up, 9 in 2022-05-05 06:28:12.670923 mon.0 [INF] osdmap e631: 9 osds: 9 up, 9 in
In addition, the Ceph log contains error messages similar to the following
example:
2022-05-25 03:44:06.510583 osd.50 127.0.0.1:6801/149046 18992 : cluster [WRN] map e600547 wrongly marked me down 2022-05-25 19:00:08.906864 7fa2a0033700 -1 osd.254 609110 heartbeat_check: no reply from osd.2 since back 2021-07-25 19:00:07.444113 front 2021-07-25 18:59:48.311935 (cutoff 2021-07-25 18:59:48.906862)
What this means
The main causes of flapping OSDs are:
- Certain storage cluster operations, such as scrubbing or recovery, take an abnormal amount of time. For example, if you run these operations on objects with a large index or large placement groups. Usually, after these operations finish, the flapping OSDs' problem is solved.
- Problems with the underlying physical hardware. In this case, the ceph health detail command also returns the slow requests error message.
- Problems with the network.
Ceph OSDs cannot manage situations where the private network for the storage cluster fails, or significant latency is on the public client-facing network.
Ceph OSDs use the private network for sending heartbeat packets to each other to indicate that
they are up and in. If the private storage cluster network does
not work properly, OSDs are unable to send and receive the heartbeat packets. As a consequence, they
report each other as being down to the Ceph Monitors, while marking themselves as
up.
Table 1 lists parameters in the Ceph configuration
file that can influence this behavior.
| Parameter | Description | Default value |
|---|---|---|
| osd_heartbeat_grace_time | How long OSDs wait for the heartbeat packets to return before reporting an OSD as
down to the Ceph Monitors. |
20 seconds |
| mon_osd_min_down_reporters | How many OSDs must report another OSD as down before the Ceph Monitors mark the OSD as
down |
2 |
Table 1 shows that in the default configuration, the
Ceph Monitors mark an OSD as
down if only one OSD made three distinct reports about
the first OSD being down. In some cases, if one single host encounters network
issues, the entire cluster can experience flapping OSDs. This is because the OSDs that are on the
host reports other OSDs in the cluster as down.Note: The flapping OSDs scenario
does not include the situation when the OSD processes are started and then immediately
stopped.
Troubleshooting this problem
Important: Flapping OSDs can be caused by MTU misconfiguration on Ceph OSD nodes, at the
network switch level, or both. To resolve the issue, set MTU to a uniform size on all storage
cluster nodes, including on the core and access network switches with a planned downtime. Do not
tune
osd heartbeat min size because changing this setting can hide issues within
the network, and it does not solve actual network inconsistency.-
Check the output of the ceph health detail command again.Note: If the output includes the slow requests error message, see Slow requests or requests are blocked for information on how to troubleshoot this issue.
ceph health detail HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests 30 ops are blocked > 268435 sec 1 ops are blocked > 268435 sec on osd.11 1 ops are blocked > 268435 sec on osd.18 28 ops are blocked > 268435 sec on osd.39 3 osds have slow requests
- Determine which OSDs are marked as
downand on what nodes they reside.ceph osd tree | grep down
- On the nodes containing the flapping OSDs, troubleshoot and fix any networking problems. For more information, see Troubleshooting networking issues.
- Alternatively, you can temporarily force Monitors to stop marking the OSDs as
downandupby setting thenoupandnodownflags.Important: Using thenoupandnodownflags does not fix the root cause of the problem but only prevents OSDs from flapping.ceph osd set noup ceph osd set nodown