Recovering workload pods stuck in ContainerCreating state post zone recovery

Problem

After performing complete zone failure and recovery, the workload pods are sometimes stuck in ContainerCreating state with the any of the below errors:
  • MountDevice failed to create newCsiDriverClient: driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers
  • MountDevice failed for volume <volume_name> : rpc error: code = Aborted desc = an operation with the given Volume ID <volume_id> already exists
  • MountVolume.SetUp failed for volume <volume_name> : rpc error: code = Internal desc = staging path <path> for volume <volume_id> is not a mountpoint

Resolution

If the workload pods are stuck with any of the above mentioned errors, perform the following workarounds:
  • For ceph-fs workload stuck in ContainerCreating:
    1. Restart the nodes where the stuck pods are scheduled
    2. Delete these stuck pods
    3. Verify that the new pods are running
  • For ceph-rbd workload stuck in ContainerCreating that do not self recover after sometime
    1. Restart csi-rbd plugin pods in the nodes where the stuck pods are scheduled
    2. Verify that the new pods are running