Removing failed or unwanted Ceph OSDs provisioned using local storage devices

About this task

You can remove failed or unwanted Ceph provisioned using local storage devices by following the steps in the procedure.
Important: Scaling down of cluster is supported only with the help of the Red Hat support team.
Warning:
  • Removing an OSD when the Ceph component is not in a healthy state can result in data loss.
  • Removing two or more OSDs at the same time results in data loss.

Before you begin

Procedure

  1. Forcibly, mark the OSD down by scaling the replicas on the OSD deployment to 0. You can skip this step if the OSD is already down due to failure.
    oc scale deployment rook-ceph-osd-<osd-id> --replicas=0
  2. Remove the failed OSD from the cluster.
    # failed_osd_id=<osd_id>
    
    # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=$<failed_osd_id> | oc create -f -
  3. Verify that the OSD is removed successfully by checking the logs.
    # oc logs -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>
  4. Optional: If you get an error as cephosd:osd.0 is NOT ok to destroy from the ocs-osd-removal-job pod in OpenShift Container Platform, seeTroubleshooting the error cephosd:osd.0 is NOT ok to destroy while removing failed or unwanted Ceph OSDs.
  5. Delete persistent volume claim (PVC) resources associated with the failed OSD.
    1. Get the PVC associated with the failed OSD.
      oc get -n openshift-storage -o yaml deployment rook-ceph-osd-<osd-id> | grep ceph.rook.io/pvc
    2. Get the persistent volume (PV) associated with the PVC.
      oc get -n openshift-storage pvc <pvc-name>
    3. Get the failed device name.
      oc get pv <pv-name-from-above-command> -oyaml | grep path
    4. Get the prepare-pod associated with the failed OSD.
      oc describe -n openshift-storage pvc ocs-deviceset-0-0-nvs68 | grep Mounted
    5. Delete the osd-prepare pod before removing the associated PVC.
      oc delete -n openshift-storage pod <osd-prepare-pod-from-above-command>
    6. Delete the PVC associated with the failed OSD.
      oc delete -n openshift-storage pvc <pvc-name-from-step-a>
  6. Remove failed device entry from the LocalVolume custom resource(CR).
    1. Log in to node with the failed device.
      oc debug node/<node_with_failed_osd>
    2. Record the /dev/disk/by-id/<id> for the failed device name.
      ls -alh /mnt/local-storage/localblock/
  7. Optional: In case, Local Storage Operator is used for provisioning OSD, login to the machine with {osd-id} and remove the device symlink.
    oc debug node/<node_with_failed_osd>
    1. Get the OSD symlink for the failed device name.
      ls -alh /mnt/local-storage/localblock
    2. Remove the symlink
      rm /mnt/local-storage/localblock/<failed-device-name>
  8. Delete the PV associated to the OSD.
    # oc delete pv <pv-name>

What to do next

  • To check if the OSD is deleted successfully, run:
    #oc get pod -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>
    This command must return the status as Completed.