How to find and remove stale/orphaned RBD images in an FDF/ODF cluster

How To

Summary

This article includes the steps to identify and delete the stale or orphan RBD images. The stale images could consume the an unwanted space in the ODF cluster.

Environment

IBM Storage Fusion Data Foundation (FDF) 4.x

Red Hat OpenShift Data Foundation (ODF) 4.x

Steps

Important

Please investigate how the cluster reached the current state before attempting to delete any orphaned images. An IBM support ticket must be logged further investigation if required.
Please perform the orphan image removal with extreme caution, wrongly deleting any RBD image that is in use will result in unrecoverable data loss.
Please open an IBM Support Case to get technical guidance before you perform this on a production cluster.

Pre-requisite: Configure the ceph toolbox pod

Check the available number of RBD images in a specific pool and storageclass combination and keep a note of the number of images.

For this article, the default ocs-storagecluster-cephblockpool pool is used. List the images in this pool. Check the total count

[1] $ rbd -p ocs-storagecluster-cephblockpool ls

List the number of PVs in OpenShift. If you have more than one storageclass for RBD volumes then please check available volumes for all RBD storageclasses.

$ oc get pv -o 'custom-columns=IMAGENAME:.spec.csi.volumeAttributes.imageName'  | grep -v none

If we observe the difference in PV and RBD image count, check for the orphan RBD image.
- Pick the image one by one, listed in [1] and try to search for its corresponding pv in pv describe. If we do not find any reference in the pv describe of the RBD image, then that RBD image is orphan.

$ oc get pv -o yaml | grep <image-name>

Note: To understand how we can map the pv to the respective backend RBD image, check the Red Hat article: access.redhat.com/articles/4718301

Now, once we identify that the orphan RBD images, those can be deleted with the exception below:

Exception:

1. Once you have the list of RBD orphan volumes, confirm that the volume is not in use from the Ceph side. There should be no watchers for the volume.

$ rbd -p ocs-storagecluster-cephblockpool status csi-vol-e097a3e7-1098-11ec-ae7f-0a580a83002e
    Watchers:
        watcher=10.129.2.1:0/1849604752 client.4740 cookie=18446462598732840961   <--- There are active watchers for this volume and may be in use by an application.

If there are active watchers then you will need to trace the IP and client application. Often, these active watcher entries are stale, but it will prevent RBD volumes from being deleted with error "rbd: error: image still has watchers". Please refer this KCS article Ceph - rbd image cannot be deleted with "rbd: error: image still has watchers" to know how to unmap and clear the watcher list to proceed with deletion
In ODF environment, you have to run the 'showmapped' command from the csi-rbdplugin pod where the PV is attached to:

Example:
$ oc rsh -n openshift-storage -c csi-rbdplugin pod/csi-rbdplugin-<pod-name>
$ rbd showmapped
id  pool                              namespace  image                                         snap  device   
0   ocs-storagecluster-cephblockpool             csi-vol-09477143-85e0-4c2f-bed1-f236d39aa9f3  -     /dev/rbd0

If you are not able to figure out on which worker node the image is mapped and mounted, then you may have to do this from all the rbdplugin pods
You can use a bash script like :

$ for rbdpluginpods in $(oc get pod -l app=csi-rbdplugin  -o name); do echo $rbdpluginpods; oc exec -it $rbdpluginpods -c csi-rbdplugin rbd showmapped; done

This will list all the mapped RBD images from all the worker nodes. Identify in which csi-rbdplugin the orphaned RBD image is mapped, then get inside the corresponding csi-rbdplugin pod and run

$ oc rsh -n openshift-storage -c csi-rbdplugin pod/csi-rbdplugin-<pod-name>
$ rbd unmap <imagename> -p ocs-storagecluster-cephblockpool

2. Check if that RBD image has any snapshots. If yes, kindly reach out to the IBM Support team to delete the orphan images

[2] $ rbd snap ls ocs-storagecluster-cephblockpool/<image-name> --all

Root Cause

The volume delete flow in ODF should prevent any orphaned volumes. But in some corner cases, we have seen the presence of orphaned RBD volumes in the cluster. Some scenarios include:
- Manual deletion of PV before the PVC.
- PVC reclaimpolicy is set to retain instead of delete. This will prevent the underlying RBD image from being deleted.

Diagnostic Steps

Follow the first few steps in the Resolution field to diagnose if any orphaned RBD images exist in the ODF cluster.
Another simpler approach would be to run the following command directly inside the rook-ceph-toolbox pod:

$ rbd ls -p ocs-storagecluster-cephblockpool --format json | jq -r '.[] | select(.referenced == false) | .name'

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB66","label":"Technology Lifecycle Services"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SSSEWFV","label":"Storage Fusion Data Foundation"},"ARM Category":[{"code":"a8m3p000000UoIUAA0","label":"Documentation"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":""}]

Tips

How to find and remove stale/orphaned RBD images in an FDF/ODF cluster

How To

Summary

Environment

Steps

Important

Diagnostic Steps

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?