Recovering workload pods stuck in ContainerCreating state post zone recovery
Problem
After performing complete zone failure and recovery, the workload pods are sometimes stuck in
ContainerCreating state with the any of the below errors:- MountDevice failed to create newCsiDriverClient: driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers
- MountDevice failed for volume <volume_name> : rpc error: code = Aborted desc = an operation with the given Volume ID <volume_id> already exists
- MountVolume.SetUp failed for volume <volume_name> : rpc error: code = Internal desc = staging path <path> for volume <volume_id> is not a mountpoint
Resolution
If the workload pods are stuck with any of the above mentioned errors, perform the following workarounds:
- For ceph-fs workload stuck in
ContainerCreating:- Restart the nodes where the stuck pods are scheduled
- Delete these stuck pods
- Verify that the new pods are running
- For ceph-rbd workload stuck in
ContainerCreatingthat do not self recover after sometime- Restart csi-rbd plugin pods in the nodes where the stuck pods are scheduled
- Verify that the new pods are running