CephClusterErrorState
This alert reflects that the storage cluster is in ERROR state for an unacceptable amount of time and thispts the storage availability. Check for other alerts that would have triggered prior to this one and troubleshoot those alerts first.
Impact: Critical
Diagnosis
- pod status: pending
-
- Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet
problems, using the following commands:
- oc project openshift-storage
- oc get pod | grep rook-ceph
- Set
MYPODas the variable for the pod that is identified as the problem pod, specifying the name of the pod that is identified as the problem pod for <pod_name>:Examine the output for a rook-ceph that is in the pending state, not running or not ready MYPOD=<pod_name>
- Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment, using the oc get pod/${MYPOD} -o wide command.
- Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet
problems, using the following commands:
- pod status: NOT pending, running, but NOT ready
- Check the readiness of the probe, using the oc describe pod/${MYPOD} command.
- pod status: NOT pending, but NOT running
- Check for application or image issues, using the oc logs pod/${MYPOD}
command.Important:
-
If a node was assigned, check the kubelet on the node.
-
If the basic health of the running pods, node affinity and resource availability on the nodes are verified, run the Ceph tools to get the status of the storage components.
-
Mitigation
- (Optional) Debugging log information
- Run the following command to gather the debugging information for the Ceph
cluster:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6