CephNodeDown

A node running Ceph pods is down. While storage operations will continue to function as Ceph is designed to deal with a node failure, it is recommended to resolve the issue to minimize the risk of another node going down and affecting storage functions.

Impact: Medium

Diagnosis

  1. List all the pods that are running and failing:
    oc -n openshift-storage get pods
    Important: Ensure that you meet the IBM Storage Fusion Data Foundation resource requirements so that the Object Storage Device (OSD) pods are scheduled on the new node. This may take a few minutes as the Ceph cluster recovers data for the failing but now recovering OSD. To watch this recovery in action, ensure that the OSD pods are correctly placed on the new worker node.
  2. Check if the OSD pods that were previously failing are now running:
    oc -n openshift-storage get pods
    If the previously failing OSD pods have not been scheduled, use the describe command and check the events for reasons the pods were not rescheduled.
  3. Describe the events for the failing OSD pod:
    oc -n openshift-storage get pods | grep osd
  4. Find the one or more failing OSD pods:
    oc -n openshift-storage describe pods/<osd_podname_ from_the_ previous step>
    In the events section look for the failure reasons, such as the resources are not being met.
    • (Optional) Use the rook-ceph-toolbox to watch the recovery. This step is helpful for large Ceph clusters. To access the toolbox, run the following command:
      TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
      oc rsh -n openshift-storage $TOOLS_POD
    • From the rsh command prompt, run the ceph status command and watch for recovery under the I/O section.
  5. Determine if there are failed nodes.
    1. Get the list of worker nodes, and check for the node status:
      oc get nodes --selector='node-role.kubernetes.io/worker','!node-role.kubernetes.io/infra'
    2. Describe the node which is of the NotReady status to get more information about the failure:
      oc describe node <node_name>

Mitigation

(Optional) Debugging log information
Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6