Down OSDs

Understand and troubleshoot OSDs that are down.

If OSDs are considered down, the ceph health detail command returns an error similar to the following example:

HEALTH_WARN 1/3 in osds are down

What this means

One of the ceph-osd processes is unavailable due to a possible service failure or problems with communication with other OSDs. As a consequence, the surviving ceph-osd daemons reported this failure to the Monitors.

If the ceph-osd daemon is not running, the underlying OSD drive or file system is either corrupted, or some other error, such as a missing keyring, is preventing the daemon from starting.

Usually, networking issues cause the situation when the ceph-osd daemon is running but still marked as down.

For more information, see Stale placement groups. To enable logging files, see Ceph daemon logs.

Troubleshooting this problem

Edit online

Determine which OSD is down, by using the ceph health detail command.

[ceph: root@host01 /]# ceph health detail
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

Restart the ceph-osd daemon.
```
systemctl restart ceph-osd@OSD_NUMBER
```
Replace OSD_NUMBER with the ID of the OSD that is in a down state.
```
[root@host01 ~]# systemctl restart ceph-osd@OSD_NUMBER
```
For example,
```
[root@host01 ~]# systemctl restart ceph-osd@0
```
- If you are not able to start ceph-osd, go to The ceph-osd daemon cannot start.
- If you are able to start the ceph-osd daemon but it is marked in a down state, go to The ceph-osd is running but still marked as down.

The `ceph-osd` daemon cannot start

Edit online

If you have a node containing several OSDs (generally, more than twelve), verify that the default maximum number of threads (PID count) is sufficient. For more information, see Increasing PID count.
Verify that the OSD data and journal partitions are mounted properly. You can use the ceph-volume lvm list command to list all devices and volumes that are associated with the Ceph Storage Cluster and then manually inspect if they are mounted properly. See the mount(8) manual page for details.
If you got the ERROR: missing keyring, cannot use cephx for authentication error message, the OSD is a missing keyring.
If you got the ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1 error message, the ceph-osd daemon cannot read the underlying file system.
- Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the /var/log/ceph/ directory.
- An EIO error message indicates a failure of the underlying disk. To fix this problem, replace the underlying OSD disk. For more information, Replacing an OSD drive.
- If the log includes any other FAILED assert errors, such as the following one, open a support ticket. For more information, contact IBM Support.
  See the following FAILED assert message example:
```
FAILED assert(0 == "hit suicide timeout")
```
Check the dmesg output for the errors with the underlying file system or disk:
- The error -5 error message similar to the one in the following example, indicates corruption of the underlying XFS file system.
```
xfs_log_force: error -5 returned
```
  To resolve this issue, unmount the volume and then recover using the xfs_repair command tool-set. For more information and help using the xfs_repair command, contact IBM Support, referring to this error and documentation.
- If the dmesg output includes any SCSI error error messages, see the SCSI Error Codes Solution Finder solution to determine the best way to fix the problem.
- If you are still unable to fix the underlying file system, replace the OSD drive. For more information, see Replacing an OSD drive.
If the OSD failed with a segmentation fault, such as the one in the following example, gather the required information and open a support ticket. For more information, contact IBM Support.
```
Caught signal (Segmentation fault)
```

The `ceph-osd` is running but still marked as `down`

Edit online

Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the /var/log/ceph/ directory.

If the log includes any error messages similar to the following example, see Flapping OSDs.
```
wrongly marked me down
    heartbeat_check: no reply from osd.2 since back
```
If there are any other error types, open a support ticket. For more information, contact IBM Support.

Down OSDs

What this means

Troubleshooting this problem

The ceph-osd daemon cannot start

The ceph-osd is running but still marked as down

The `ceph-osd` daemon cannot start

The `ceph-osd` is running but still marked as `down`