Replacing an OSD drive
Ceph is designed for fault tolerance, which means that it can operate in a degraded state without losing data. Therefore, Ceph can operate even if a data storage drive fails. In the context of a failed drive, the degraded state means that the extra copies of the data that is stored on other OSDs backfill automatically to other OSDs in the cluster. If this occurs, replace the failed OSD drive and re-create the OSD manually.
Before you begin
- A running IBM Storage Ceph cluster.
- Root-level access to the Ceph Monitor node.
- At least one OSD is down.
About this task
When a drive fails, Ceph reports the OSD as down.
For example,
HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
Note: Ceph can mark
an OSD as down also as a consequence of networking or permissions problems.
Modern servers typically deploy with hot-swappable drives so you can pull a failed drive and
replace it with a new one without bringing down the node. The procedure includes these steps:
For more information, see:
Removing an OSD from the Ceph cluster
Procedure
Replacing the physical drive
Procedure
See the documentation for the hardware node for details on replacing the physical
drive.
- If the drive is hot-swappable, replace the failed drive with a new one.
- If the drive is not hot-swappable and the node contains multiple OSDs, you might have to shut down the whole node and replace the physical drive. Consider preventing the cluster from backfilling. For more information, see Stopping and starting rebalancing.
- When the drive appears under the /dev/ directory, make a note of the drive path.
- If you want to add the OSD manually, find the OSD drive and format the disk.