How to replace the failed Ceph OSD and use the same backend disk for OSD creation in ODF/FDF

How To

Summary

This article includes the steps to clean up the disk that was used by Ceph OSD earlier and create a new OSD on it.

Objective

Steps to remove unwanted or failed Ceph OSD in Red Hat OpenShift Data Foundation and use the same backend disk to create the new OSD

Environment

Red Hat OpenShift Data Foundation 4.x

IBM Fusion Data Foundation 4.x

Steps

Ceph Health Check (Guidance and cautions to avoid disrupting access to your organization's data)

Warnings:

Make sure that the cluster is healthy and all pgs are in an active+clean state. Removing an OSD when the Ceph cluster is not in a healthy state and pgs are not active+something can result in Data Loss
If the goal is to replace 2 or more OSDs, remove one OSD at a time. The health of the Ceph must be checked and ensure that the health is HEALTH_OK between two OSD removals. Failure to do so can result in Data Loss.
Removing 2 or more OSDs back to back will results in Data Loss.
Removing 2 or more OSDs in a single command line will result in Data Loss.

See the article How to configure the toolbox pod to run the Ceph commands

Then, run the #ceph -s command to check cluster health:

# ceph -s
  cluster:
    id:     0782432c-e1aa-11ec-aed8-fa163ed0b37b
    health: HEALTH_OK                                                    <--- Overall health must be HEALTH_OK

  services:
    mon: 5 daemons, quorum mgmt-0.icemanny01.lab.pnq2.cee.redhat.com,osds-2,osds-1,osds-0,mons-1 (age 2h)
    mgr: mgmt-0.icemanny01.lab.pnq2.cee.redhat.com.fycvbc(active, since 4w), standbys: osds-2.gmcizc
    osd: 3 osds: 3 up (since 2h), 3 in (since 2h)

  data:
    pools:   1 pools, 1 pgs
    objects: 3 objects, 0 B
    usage:   58 MiB used, 30 GiB / 30 GiB avail
    pgs:     1 active+clean                                              <--- Must be only Active+Clean

If the overall health of the system is anything other than HEALTH_OK, Do Not Proceed
If the state of the Placement Groups (PGs) reports anything other than Active+Clean, Do Not Proceed
Examples of PG states we do NOT want to see: Creating, Peering, Degraded, Recovering, Migrating, Backfilling, Remapped.
If any OSDs are full or near full, Do Not Proceed

# ceph health detail 
HEALTH_ERR 1 full osd(s); 6 near full osd(s)
osd.60 is full at 95%
osd.0 is near full at 86%
osd.4 is near full at 91%
osd.8 is near full at 92%

If one is certain the Ceph component is healthy and not too full, proceed with the OSD removal steps.
Take care to read entire Resolution Section and follow the steps for your organization's type and version of OCS.
Allow at least one day for the Ceph component to rebalance the data.
Do not proceed with removing another OSD unless the Ceph components are healthy.
Again, removing OSDs back to back or removing 2 or more OSDs in one command line will result in Data Loss.

Steps to Remove the OSD:

1. Scale down the operators as below:

For Fusion Data Foundation we need to scale down deployment from ibm-spectrum-fusion-ns. [This steps can be skipped for ODF clusters]

# oc scale deployment isf-cns-operator-controller-manager --replicas=0 -n ibm-spectrum-fusion-ns

For ODF we need to scale down the ocs-operator and rook-ceph-operator deployment to pause the recreation of OSD.

# oc scale deployment ocs-operator rook-ceph-operator --replicas=0

2. Identify the backend disk for the OSD that needs to be replaced

i. Identify the pvc map with the OSD pod

# oc get -n openshift-storage -o yaml deployment rook-ceph-osd-{osd-id} | grep "ceph.rook.io/pvc:"

Example output:
   ceph.rook.io/pvc: ocs-deviceset-lso-volumeset-0-data-0w75fl

ii. Identify the PV associated with the PVC.

# oc get -n openshift-storage pvc ocs-deviceset-lso-volumeset-0-data-0w75fl

Example output:
NAME                                            STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ocs-deviceset-lso-volumeset-0-data-0w75fl       Bound    local-pv-e37b8c4a   512Gi      RWO            lso-volumeset  2d

iii. Identify the failed device ID.

# oc get pv local-pv-e37b8c4a -oyaml | grep path
        f: path: {}
   path: /mnt/local-storage/lso-volumeset/scsi-36000c2914f732fdde713794b8fb5714a

iv. Go inside the failed OSD node and identify the disk name for the ID that we received from the previous command output.

# oc debug node/<node_with_failed_osd>
# chroot/host

# ls -l /mnt/local-storage/lso-volumeset/
total 0
lrwxrwxrwx. 1 root root 43 Jan  3 10:56 scsi-36000c2914f732fdde713794b8fb5714a -> /dev/disk/by-id/scsi-36000c2914f732fdde713794b8fb5714a

# ls -l /dev/disk/by-id/
total 0
.
lrwxrwxrwx.1 root root 9 Jan  3 10:48 scsi-36000c2914f732fdde713794b8fb5714a -> ../../sdb  <<<<<

3. Scale down the OSD deployment.

# osd_id_to_remove=0
# oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0

Example output:
deployment.extensions/rook-ceph-osd-0 scaled

Where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

4. Verify that the rook-ceph-osd pod is terminated.

# oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}

Example output:
No resources found in openshift-storage namespace.

5. Remove the old OSD from the cluster so that you can add a new one.

i. Delete any old ocs-osd-removal jobs.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:
job.batch "ocs-osd-removal-job" deleted

ii. Navigate to the openshift-storage project.

# oc project openshift-storage

iii. Remove the OSD from the cluster.

# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -

Example output: 
job.batch/ocs-osd-removal-job created

NOTE: In OCP 4.10 you can get this error, from the "ocs-osd-removal-job" pod that runs the commands to delete the OSD device on the Ceph cluster:

# oc logs -n openshift-storage ocs-osd-removal-job-cj522 
2022-04-26 13:35:54.741879 I | rookcmd: starting Rook v4.10.0-0.ed62be54b2371ca23ae9b81137a2d301d032f164 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=0 --force-osd-removal false'

2022-04-26 13:35:56.047744 I | cephosd: validating status of osd.0
2022-04-26 13:35:56.047762 I | cephosd: osd.0 is marked 'DOWN'
2022-04-26 13:35:56.047772 D | exec: Running command: ceph osd safe-to-destroy 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2022-04-26 13:35:56.313675 W | cephosd: osd.0 is NOT be ok to destroy, retrying in 1m until success

The solution is to run the OSD Removal job with FORCE_OSD_REMOVAL=true, as per Bug 2059027 - Device Replacement with FORCE_OSD_REMOVAL, OSD moved to the "destroyed" state. Run this command when all PGs are active+<something>. If not, PGs must complete backfilling or be investigated to ensure they are active.

6. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

A status of Completed confirms that the OSD removal job succeeded.

# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage

7. Ensure that the OSD removal is completed.

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

8. If encryption was enabled at the time of installation, remove dm-crypt the managed device-mapper mapping from the OSD devices that are removed from the respective OpenShift Data Foundation nodes. If the encryption is not enabled skip step 8 and jumps to step 9.

i. Get the Persistent Volume Claim (PVC) name of the replaced OSD from the logs of ocs-osd-removal-job pod.

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1  |egrep -i 'pvc|deviceset'

Example output:
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxx

ii. Go inside the failed OSD node, and perform the following steps:

# oc debug node/<node_with_failed_osd>
# chroot/host

Find the relevant device name based on the PVC names identified in the previous step.

# dmsetup ls| grep <pvc name>

Example output:
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)

iii. Remove the mapped device. ( <ocs-deviceset-name> Is the name of the relevant device based on the PVC names identified in the previous step)

# cryptsetup luksClose --debug --verbose <ocs-deviceset-name>

9. Go inside the failed OSD node.

# oc debug node/<node_with_failed_osd>
# chroot/host
# lsblk
# sgdisk -Z /dev/sdb    
# wipefs -af /dev/sdb             <Mention the disk name for failed osd which we identified in step 2- iv>

For ODF 4.18.x and above versions, we need to cleanup the disk with some additional command to remove any metadata present on the disks

The disk can be cleaned with the below steps:

DISK="/dev/sdX"

# Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean)
sgdisk --zap-all $DISK

# Wipe portions of the disk to remove more LVM metadata that may be present
$ for gb in 0 1 10 100 1000; do dd if=/dev/zero of="$DISK" bs=1K count=200 oflag=direct,dsync seek=$((gb * 1024**2)); done

# SSDs may be better cleaned with blkdiscard instead of dd
# This might not be supported on all devices
blkdiscard $DISK

# Inform the OS of partition table changes
partprobe $DISK

10. Remove the pv that is in the Release state (Refer to step 2. ii)

# oc delete pv local-pv-e37b8c4a

Example output:
persitentvolume "local-pv-e37b8c4a" deleted

11. Delete the ocs-osd-removal jobs.

# oc delete -n openshift-storage job ocs-osd-removal-job

Verification steps:

1. Verify whether the new pv is created with the available state.

# oc get pv | grep Available

2. Scale up the ocs-operator and rook-ceph-operator deployment to create the new OSD.

# oc scale deployment ocs-operator rook-ceph-operator --replicas=1

For Fusion Data Foundation scale up the deployment in ibm-spectrum-fusion-ns.

# oc scale deployment isf-cns-operator-controller-manager --replicas=1 -n ibm-spectrum-fusion-ns

3. Verify that a new PVC is created.

# oc get -n openshift-storage pvc

4. Verify that a new OSD is running.

# oc get -n openshift-storage pods -l app=rook-ceph-osd

Additional Information

Replace the multiple OSDs:

After the replacement of OSD, allow the cluster to rebalance the data on the newly added OSD. Once pgs are in active+clean state and cluster health is OK, proceed with the next OSD with the same steps by replacing the OSD ID

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB66","label":"Technology Lifecycle Services"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SSSEWFV","label":"Storage Fusion Data Foundation"},"ARM Category":[{"code":"a8m3p000000UoIPAA0","label":"Support Reference Guide"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Tips

How to replace the failed Ceph OSD and use the same backend disk for OSD creation in ODF/FDF

How To

Summary

Objective

Environment

Steps

Additional Information

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?