CephMonHighNumberOfLeaderChanges
In a Ceph cluster there is a redundant set of monitor pods that store critical information about the storage cluster. Monitor pods synchronize periodically to obtain information about the storage cluster. The first monitor pod to get the most updated information becomes the leader, and the other monitor pods will start their synchronization process after asking the leader. A problem in network connection or another kind of problem in one or more monitor pods produces an unusual change of the leader. This situation can negatively affect the storage cluster performance.
Impact: Medium
Important: Check for any network issues. If there is a network issue, you need to
escalate to the Fusion Data Foundation team before you proceed
with any of the following troubleshooting steps.
Diagnosis
Use one of the following to help gather more information and diagnose the issue:- Print the logs of the affected monitor pod to gather more information about the issue, using the oc logs <rook-ceph-mon-X-yyyy> -n openshift-storage command, where <rook-ceph-mon-X-yyyy> specifies the name of the affected monitor pod.
- Use the Openshift Web console to open the logs of the affected monitor pod. More information about possible causes is reflected in the log.
Use the following steps for general pod troubleshooting:
- pod status: pending
-
- Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet
problems, using the following commands:
- oc project openshift-storage
- oc get pod | grep {ceph-component}
- Set
MYPODas the variable for the pod that is identified as the problem pod, specifying the name of the pod that is identified as the problem pod for <pod_name>:Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name> - Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment, using the oc get pod/${MYPOD} -o wide command.
- Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet
problems, using the following commands:
- pod status: NOT pending, running, but NOT ready
- Check the readiness of the probe, using the oc describe pod/${MYPOD} command.
- pod status: NOT pending, but NOT running
- Check for application or image issues, using the oc logs pod/${MYPOD}
command.Important: If a node was assigned, check the kubelet on the node.
Mitigation
- (Optional) Debugging log information
- Run the following command to gather the debugging information for the Ceph
cluster:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6