Down OSDs
Understand and troubleshoot OSDs that are down.
ceph health detail command returns
an error similar to the following example:HEALTH_WARN 1/3 in osds are down
What this means
One of the ceph-osd processes is unavailable due to a possible service failure
or problems with communication with other OSDs. As a consequence, the surviving
ceph-osd daemons reported this failure to the Monitors.
If the ceph-osd daemon is not running, the underlying OSD drive or file system
is either corrupted, or some other error, such as a missing keyring, is preventing the daemon from
starting.
Usually, networking issues cause the situation when the ceph-osd daemon is
running but still marked as down.
For more information, see Stale placement groups. To enable logging files, see Ceph daemon logs.
Troubleshooting this problem
-
Determine which OSD is down, by using the ceph health detail command.
[ceph: root@host01 /]# ceph health detail HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
-
Restart the
ceph-osddaemon.
Replace OSD_NUMBER with the ID of the OSD that is in asystemctl restart ceph-osd@OSD_NUMBERdownstate.[root@host01 ~]# systemctl restart ceph-osd@OSD_NUMBERFor example,[root@host01 ~]# systemctl restart ceph-osd@0
- If you are not able to start
ceph-osd, go to The ceph-osd daemon cannot start. - If you are able to start the
ceph-osddaemon but it is marked in adownstate, go to The ceph-osd is running but still marked as down.
- If you are not able to start
The ceph-osd daemon cannot start
-
If you have a node containing several OSDs (generally, more than twelve), verify that the default maximum number of threads (PID count) is sufficient. For more information, see Increasing PID count.
-
Verify that the OSD data and journal partitions are mounted properly. You can use the ceph-volume lvm list command to list all devices and volumes that are associated with the Ceph Storage Cluster and then manually inspect if they are mounted properly. See the
mount(8)manual page for details. -
If you got the ERROR: missing keyring, cannot use cephx for authentication error message, the OSD is a missing keyring.
-
If you got the ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1 error message, the
ceph-osddaemon cannot read the underlying file system.- Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the /var/log/ceph/ directory.
- An EIO error message indicates a failure of the underlying disk. To fix this problem, replace the underlying OSD disk. For more information, Replacing an OSD drive.
- If the log includes any other FAILED assert errors, such as the following one,
open a support ticket. For more information, contact IBM Support.See the following FAILED assert message example:
FAILED assert(0 == "hit suicide timeout")
-
Check the dmesg output for the errors with the underlying file system or disk:
- The error -5 error message similar to the one in the following example, indicates
corruption of the underlying XFS file
system.
xfs_log_force: error -5 returned
To resolve this issue, unmount the volume and then recover using the xfs_repair command tool-set. For more information and help using the xfs_repair command, contact IBM Support, referring to this error and documentation.
- If the dmesg output includes any SCSI error error messages, see the SCSI Error Codes Solution Finder solution to determine the best way to fix the problem.
- If you are still unable to fix the underlying file system, replace the OSD drive. For more information, see Replacing an OSD drive.
- The error -5 error message similar to the one in the following example, indicates
corruption of the underlying XFS file
system.
-
If the OSD failed with a segmentation fault, such as the one in the following example, gather the required information and open a support ticket. For more information, contact IBM Support.
Caught signal (Segmentation fault)
The ceph-osd is running but still marked as down
- If the log includes any error messages similar to the following example, see Flapping OSDs.
wrongly marked me down heartbeat_check: no reply from osd.2 since back - If there are any other error types, open a support ticket. For more information, contact IBM Support.