Replacing an OSD drive

Ceph is designed for fault tolerance, which means that it can operate in a degraded state without losing data. Therefore, Ceph can operate even if a data storage drive fails. In the context of a failed drive, the degraded state means that the extra copies of the data that is stored on other OSDs backfill automatically to other OSDs in the cluster. If this occurs, replace the failed OSD drive and re-create the OSD manually.

Before you begin

Before you begin, make sure that you have the following prerequisites in place:
  • A running IBM Storage Ceph cluster.
  • Root-level access to the Ceph Monitor node.
  • At least one OSD is down.

About this task

When a drive fails, Ceph reports the OSD as down.

For example,
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
Note: Ceph can mark an OSD as down also as a consequence of networking or permissions problems.
Modern servers typically deploy with hot-swappable drives so you can pull a failed drive and replace it with a new one without bringing down the node. The procedure includes these steps:
  1. Removing an OSD from the Ceph cluster
  2. Replacing the physical drive
  3. Adding an OSD to the Ceph cluster

Removing an OSD from the Ceph cluster

Procedure

  1. Log in to the Cephadm shell.
    [root@host01 ~]# cephadm shell
  2. Determine which OSD is down.
    ceph osd tree | grep -i down
    For example,
    [ceph: root@host01 /]# ceph osd tree | grep -i down
    ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
    0   hdd    0.00999  osd.0      down    1.00000   1.00000

    Determine which OSD is down.

  3. Mark the OSD as out for the cluster to rebalance and copy its data to other OSDs
    ceph osd out osd.OSD_ID
    For example,
    [ceph: root@host01 /]# ceph osd out osd.0
    marked out osd.0.
    Note: If the OSD is down, Ceph marks it as out automatically after 600 seconds when it does not receive any heartbeat packet from the OSD based on the mon_osd_down_out_interval parameter. When this happens, other OSDs with copies of the failed OSD data begin backfilling to ensure that the required number of copies exists within the cluster. While the cluster is backfilling, the cluster is in a degraded state.
  4. Ensure that the failed OSD is backfilling.
    ceph -w | grep backfill
    For example,
    [ceph: root@host01 /]# ceph -w | grep backfill
    2022-05-02 04:48:03.403872 mon.0 [INF] pgmap v10293282: 431 pgs: 1 active+undersized+degraded+remapped+backfilling, 28 active+undersized+degraded, 49 active+undersized+degraded+remapped+wait_backfill, 59 stale+active+clean, 294 active+clean; 72347 MB data, 101302 MB used, 1624 GB / 1722 GB avail; 227 kB/s rd, 1358 B/s wr, 12 op/s; 10626/35917 objects degraded (29.585%); 6757/35917 objects misplaced (18.813%); 63500 kB/s, 15 objects/s recovering
    2022-05-02 04:48:04.414397 mon.0 [INF] pgmap v10293283: 431 pgs: 2 active+undersized+degraded+remapped+backfilling, 75 active+undersized+degraded+remapped+wait_backfill, 59 stale+active+clean, 295 active+clean; 72347 MB data, 101398 MB used, 1623 GB / 1722 GB avail; 969 kB/s rd, 6778 B/s wr, 32 op/s; 10626/35917 objects degraded (29.585%); 10580/35917 objects misplaced (29.457%); 125 MB/s, 31 objects/s recovering
    2022-05-02 04:48:00.380063 osd.1 [INF] 0.6f starting backfill to osd.0 from (0'0,0'0] MAX to 2521'166639
    2022-05-02 04:48:00.380139 osd.1 [INF] 0.48 starting backfill to osd.0 from (0'0,0'0] MAX to 2513'43079
    2022-05-02 04:48:00.380260 osd.1 [INF] 0.d starting backfill to osd.0 from (0'0,0'0] MAX to 2513'136847
    2022-05-02 04:48:00.380849 osd.1 [INF] 0.71 starting backfill to osd.0 from (0'0,0'0] MAX to 2331'28496
    2022-05-02 04:48:00.381027 osd.1 [INF] 0.51 starting backfill to osd.0 from (0'0,0'0] MAX to 2513'87544
    The placement group states change from active+clean to active, some degraded objects, and then changes to active+clean when migration completes.
  5. Stop the OSD.
    ceph orch daemon stop OSD_ID
    For example,
    [ceph: root@host01 /]# ceph orch daemon stop osd.0
  6. Remove the OSD from the storage cluster.
    Note: The OSD_ID is preserved.
    ceph orch osd rm OSD_ID --replace
    For example,
    [ceph: root@host01 /]# ceph orch osd rm 0 --replace

Replacing the physical drive

Procedure

See the documentation for the hardware node for details on replacing the physical drive.
  • If the drive is hot-swappable, replace the failed drive with a new one.
  • If the drive is not hot-swappable and the node contains multiple OSDs, you might have to shut down the whole node and replace the physical drive. Consider preventing the cluster from backfilling. For more information, see Stopping and starting rebalancing.
  • When the drive appears under the /dev/ directory, make a note of the drive path.
  • If you want to add the OSD manually, find the OSD drive and format the disk.

Adding an OSD to the Ceph cluster

Procedure

  1. After the new drive is inserted, deploy the OSDs by using one of the following options.
    • The OSDs are deployed automatically by the Ceph Orchestrator if the --unmanaged parameter is not set.
      ceph orch apply osd --all-available-devices
    • Deploy the OSDs on all the available devices with the unmanaged parameter set to true.
      ceph orch apply osd --all-available-devices --unmanaged=true
    • Deploy the OSDs on specific devices and hosts.
      ceph orch daemon add osd HOST:PATH_TO_DRIVE
      For example,
      [ceph: root@host01 /]# ceph orch daemon add osd host02:/dev/sdb
  2. Ensure that the CRUSH hierarchy is accurate.
    ceph osd tree