Troubleshooting Rook Ceph cluster

Review frequently encountered Rook Ceph cluster issues.

Before you proceed with troubleshooting, ensure that your cluster meets the prerequisites and that you have adequate permissions to perform installation-related operations. For more information, see Prerequisites and limitations.

Note: You need to set up kubectl CLI to run troubleshooting commands. For more information, see Accessing your cluster from the kubectl CLI.

Rook Operator chart installation by using the management console gets ESOCKETTIMEDOUT error

You might see a ESOCKETTIMEDOUT error while you install the Rook Operator chart by using the management console:

Resolve the issue

Check whether the rook-agent and operator pods are running. Run the following commands:

kubectl get nodes -o wide
kubectl -n default get po -o wide

If you have as many agents as the number of worker nodes, and one operator pod is running, then the installation was successful. You can ignore the error.

If the agent or operator pods are not running, check the Helm logs to identify the error:

kubectl -n kube-system get po | grep helm

Following is a sample output:

helm-api-66b98d88bc-6psq6                            2/2       Running    0          1d
helm-repo-5495f5c48c-k9mkl                           1/1       Running    0          1d
kubectl -n kube-system log helm-api-66b98d88bc-6psq6 rudder

Rook Operator Chart deletion does not delete the Rook daemonset

  1. Get the list of pods.

    kubectl -n default get po
    

    Following is a sample output:

    NAME               READY     STATUS    RESTARTS   AGE
    rook-agent-ckxht   1/1       Running   0          23h
    rook-agent-jnxh6   1/1       Running   0          23h
    rook-agent-wkt26   1/1       Running   0          23h
    
  2. Get the daemonset.

    kubectl -n default get ds
    

    Following is a sample output:

    NAME         DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
    rook-agent   3         3         3         3            3           <none>          23h
    

This issue is known in Rook alpha release. Manually delete the agent daemonset. Run the following commands:

  1. Delete the daemonset.

    kubectl -n default delete ds rook-agent
    

    Following is a sample output:

    daemonset "rook-agent" deleted
    
  2. Get a list of pods.

    kubectl -n default get po
    

    Following is a sample output:

    No resources found.
    

Installing Rook cluster chart (ibm-rook-rbd-cluster) by using the management console gets ESOCKETTIMEDOUT error

You might see a ESOCKETTIMEDOUT error while you install the Rook cluster chart by using the management console.

  1. Check the pods that are running in the namespace where you are installing the chart. Look for rook-cluster-precheck-job and its InitContainer. You might see the following error:

    kubectl -n default get po
    

    Following is a sample output:

    NAME                              READY     STATUS              RESTARTS   AGE
    rook-cluster-precheck-job-mqfv9   0/1       Init:ErrImagePull   0          28s
    
  2. Verify that the Docker repository that is specified for Hyperkube image is correct.

Installing Rook cluster chart (ibm-rook-rbd-cluster) gets Job failed: BackoffLimitExceeded Error

You might see a BackoffLimitExceeded error while you install the Rook cluster chart.

Following two reasons might be causing this error:

Rook Operator Helm chart is not installed in your cluster

Verify whether the Rook Operator Helm chart is installed in your cluster:

kubectl get po --all-namespaces | grep rook-operator

If the chart is not installed, install it first.

For more information about installing the Rook operator Helm chart, see Ceph Operator Helm Chart Opens in a new tab.

Rook cluster chart is already installed in your namespace

Verify whether the Rook cluster chart is installed in your cluster:

kubectl -n default get cluster

Following is a sample output:

NAME              KIND
default-cluster   Cluster.v1alpha1.rook.io

You cannot install multiple Rook cluster charts in one namespace.

Rook cluster chart starts deploying but rook-ceph-mon gets a CrashLoopBackOff error

  1. Get a list of pods.

    kubectl -n default get po
    

    Following is a sample output:

    NAME                            READY     STATUS             RESTARTS   AGE
    rook-agent-8tlqf                1/1       Running            0          24m
    rook-agent-htjdl                1/1       Running            0          24m
    rook-agent-q46vw                1/1       Running            0          24m
    rook-ceph-mon0-f2bc6            0/1       CrashLoopBackOff   2          37s
    rook-operator-947bf78c6-8hjgj   1/1       Running            0          24m
    
  2. Check the rook-ceph-mon log.

    kubectl -n default log rook-ceph-mon0-f2bc6
    

    Following is a sample output:

    2018-05-18 10:51:39.932606 I | rook: starting Rook v0.7.1 with arguments '/usr/local/bin/rook mon --config-dir=/var/lib/rook --name=rook-ceph-mon0 --port=6790 --fsid=a01d92fb-8191-4343-8ec1-676abd0de780'
    2018-05-18 10:51:39.932749 I | rook: flag values: --admin-secret=*****, --ceph-config-override=/etc/rook/config/override.conf, --cluster-name=default, --config-dir=/var/lib/rook, --fsid=a01d92fb-8191-4343-8ec1-676abd0de780, --help=false, --log-level=INFO, --mon-endpoints=rook-ceph-mon0=10.0.0.185:6790, --mon-secret=*****, --name=rook-ceph-mon0, --port=6790, --private-ipv4=10.1.19.21, --public-ipv4=10.0.0.185
    The keyring does not match the existing keyring in /var/lib/rook/rook-ceph-mon0/data/keyring. You may need to delete the contents of dataDirHostPath on the host from a previous deployment.
    

This error indicates that you had a previous Rook Ceph deployment.

To fix this issue, delete the failed chart and then delete the contents of the dataDirHostPath on the hosts that were used in a previous deployment. Or, specify a different dataDirHostPath setting. Then, reinstall the ibm-rook-rbd-cluster chart.

Deployment completes for ibm-rook-rbd-cluster chart but none of rook-ceph-mon, rook-ceph, manager, or api pods come up

This issue might happen if you tried to reinstall ibm-rook-rbd-cluster without deleting the contents of the dataDirHostPath on the hosts for cleaning up the storage disks.

To resolve the issue, complete the following tasks:

  1. Delete the failed ibm-rook-rbd-cluster chart
  2. Delete the Rook Operator chart
  3. Delete the contents of dataDirHostPath.
  4. Clean up the disk that you used for storage.
  5. Reinstall Rook Operator chart.
  6. Reinstall ibm-rook-rbd-cluster chart.

After a worker node is restarted, Rook agent pod remains in error status

  1. Get the pod information.

    kubectl get po -o wide
    

    Following is a sample output:

    NAME                             READY     STATUS                 RESTARTS   AGE       IP            NODE
    rook-agent-5rst5                 1/1       Running                0          2d        9.5.28.147    9.5.28.147
    rook-agent-bsrrx                 1/1       Running                0          2d        9.5.28.143    9.5.28.143
    rook-agent-zq4bm                 0/1       CreateContainerError   1          2d        9.5.28.146    9.5.28.146
    rook-api-86b5b8849c-fjqf8        1/1       Running                0          7m        10.1.68.153   9.5.28.147
    rook-ceph-mgr0-9c56544c8-2mxqr   1/1       Running                0          2d        10.1.19.31    9.5.28.143
    rook-ceph-mon0-g5t7m             1/1       Running                0          2d        10.1.19.30    9.5.28.143
    rook-ceph-mon1-zl5px             1/1       Running                5          7m        10.1.0.164    9.5.28.146
    rook-ceph-mon2-jjjht             1/1       Running                0          2d        10.1.68.151   9.5.28.147
    rook-ceph-osd-9.5.28.143-2bpl6   1/1       Running                0          2d        10.1.19.32    9.5.28.143
    rook-ceph-osd-9.5.28.146-8qwbx   1/1       Running                5          7m        10.1.0.165    9.5.28.146
    rook-ceph-osd-9.5.28.147-mcksg   1/1       Running                0          2d        10.1.68.152   9.5.28.147
    rook-operator-947bf78c6-58nng    1/1       Running                0          2d        10.1.19.22    9.5.28.143
    
  2. Get information about Rook agent pod.

    kubectl describe po rook-agent-zq4bm
    

    Following is a sample output:

    Name:        rook-agent-zq4bm
    Namespace:    default
    Node:        9.5.28.146/9.5.28.146
    ...
    
    6m        6m        3    kubelet, 9.5.28.146    spec.containers{rook-agent}    Warning        Failed    Error: Error response from daemon:
    Conflict. The container name "/k8s_rook-agent_rook-agent-zq4bm_default_5b2c4423-5a8e-11e8-a2b0-005056a7db67_2" is already in
    use by container ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a. You have to remove (or rename) that
    container to be able to reuse that name.
    
  3. Log in to the node on which the pod is failing. Kill the conflicting container as reported in the error.

    docker kill ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a
    ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a
    

    The agent pod starts running normally.

    kubectl get  po -o wide
    

    Following is a sample output:

    NAME                             READY     STATUS    RESTARTS   AGE       IP            NODE
    rook-agent-5rst5                 1/1       Running   0          2d        9.5.28.147    9.5.28.147
    rook-agent-bsrrx                 1/1       Running   0          2d        9.5.28.143    9.5.28.143
    rook-agent-zq4bm                 1/1       Running   2          2d        9.5.28.146    9.5.28.146
    rook-api-86b5b8849c-fjqf8        1/1       Running   0          12m       10.1.68.153   9.5.28.147
    rook-ceph-mgr0-9c56544c8-2mxqr   1/1       Running   0          2d        10.1.19.31    9.5.28.143
    rook-ceph-mon0-g5t7m             1/1       Running   0          2d        10.1.19.30    9.5.28.143
    rook-ceph-mon1-zl5px             1/1       Running   5          12m       10.1.0.164    9.5.28.146
    rook-ceph-mon2-jjjht             1/1       Running   0          2d        10.1.68.151   9.5.28.147
    rook-ceph-osd-9.5.28.143-2bpl6   1/1       Running   0          2d        10.1.19.32    9.5.28.143
    rook-ceph-osd-9.5.28.146-8qwbx   1/1       Running   5          12m       10.1.0.165    9.5.28.146
    rook-ceph-osd-9.5.28.147-mcksg   1/1       Running   0          2d        10.1.68.152   9.5.28.147
    rook-operator-947bf78c6-58nng    1/1       Running   0          2d        10.1.19.22    9.5.28.143