Troubleshooting Rook Ceph cluster

Review frequently encountered Rook Ceph cluster issues.

Rook Operator chart installation by using the management console gets ESOCKETTIMEDOUT error
Rook Operator Chart deletion does not delete the Rook daemonset
Installing Rook cluster chart (ibm-rook-rbd-cluster) by using the management console gets ESOCKETTIMEDOUT error
Installing Rook cluster chart (ibm-rook-rbd-cluster) gets Job failed: BackoffLimitExceeded Error
- Rook Operator Helm chart is not installed in your cluster
- Rook cluster chart is already installed in your namespace
Rook cluster chart starts deploying but rook-ceph-mon gets a CrashLoopBackOff error
Deployment completes for ibm-rook-rbd-cluster chart but none of rook-ceph-mon, rook-ceph, manager, or api pods come up
After a worker node is restarted, Rook agent pod remains in error status

Before you proceed with troubleshooting, ensure that your cluster meets the prerequisites and that you have adequate permissions to perform installation-related operations. For more information, see Prerequisites and limitations.

Note: You need to set up kubectl CLI to run troubleshooting commands. For more information, see Accessing your cluster from the kubectl CLI.

Rook Operator chart installation by using the management console gets ESOCKETTIMEDOUT error

You might see a ESOCKETTIMEDOUT error while you install the Rook Operator chart by using the management console:

Resolve the issue

Check whether the rook-agent and operator pods are running. Run the following commands:

kubectl get nodes -o wide

kubectl -n default get po -o wide

If you have as many agents as the number of worker nodes, and one operator pod is running, then the installation was successful. You can ignore the error.

If the agent or operator pods are not running, check the Helm logs to identify the error:

kubectl -n kube-system get po | grep helm

Following is a sample output:

helm-api-66b98d88bc-6psq6                            2/2       Running    0          1d
helm-repo-5495f5c48c-k9mkl                           1/1       Running    0          1d

kubectl -n kube-system log helm-api-66b98d88bc-6psq6 rudder

Rook Operator Chart deletion does not delete the Rook daemonset

Get the list of pods.

kubectl -n default get po

Following is a sample output:

NAME               READY     STATUS    RESTARTS   AGE
rook-agent-ckxht   1/1       Running   0          23h
rook-agent-jnxh6   1/1       Running   0          23h
rook-agent-wkt26   1/1       Running   0          23h

Get the daemonset.

kubectl -n default get ds

Following is a sample output:

NAME         DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
rook-agent   3         3         3         3            3           <none>          23h

This issue is known in Rook alpha release. Manually delete the agent daemonset. Run the following commands:

Delete the daemonset.

kubectl -n default delete ds rook-agent

Following is a sample output:

daemonset "rook-agent" deleted

Get a list of pods.

kubectl -n default get po

Following is a sample output:

No resources found.

Installing Rook cluster chart (ibm-rook-rbd-cluster) by using the management console gets ESOCKETTIMEDOUT error

You might see a ESOCKETTIMEDOUT error while you install the Rook cluster chart by using the management console.

Check the pods that are running in the namespace where you are installing the chart. Look for rook-cluster-precheck-job and its InitContainer. You might see the following error:
```
kubectl -n default get po
```
Following is a sample output:
```
NAME                              READY     STATUS              RESTARTS   AGE
rook-cluster-precheck-job-mqfv9   0/1       Init:ErrImagePull   0          28s
```
Verify that the Docker repository that is specified for Hyperkube image is correct.

Installing Rook cluster chart (ibm-rook-rbd-cluster) gets Job failed: BackoffLimitExceeded Error

You might see a BackoffLimitExceeded error while you install the Rook cluster chart.

Following two reasons might be causing this error:

The Rook Operator Helm chart is not installed in your cluster.
The Rook cluster chart is already installed in your namespace.

Rook Operator Helm chart is not installed in your cluster

Verify whether the Rook Operator Helm chart is installed in your cluster:

kubectl get po --all-namespaces | grep rook-operator

If the chart is not installed, install it first.

For more information about installing the Rook operator Helm chart, see Ceph Operator Helm Chart Opens in a new tab .

Rook cluster chart is already installed in your namespace

Verify whether the Rook cluster chart is installed in your cluster:

kubectl -n default get cluster

Following is a sample output:

NAME              KIND
default-cluster   Cluster.v1alpha1.rook.io

You cannot install multiple Rook cluster charts in one namespace.

Rook cluster chart starts deploying but rook-ceph-mon gets a CrashLoopBackOff error

Get a list of pods.

kubectl -n default get po

Following is a sample output:

NAME                            READY     STATUS             RESTARTS   AGE
rook-agent-8tlqf                1/1       Running            0          24m
rook-agent-htjdl                1/1       Running            0          24m
rook-agent-q46vw                1/1       Running            0          24m
rook-ceph-mon0-f2bc6            0/1       CrashLoopBackOff   2          37s
rook-operator-947bf78c6-8hjgj   1/1       Running            0          24m

Check the rook-ceph-mon log.

kubectl -n default log rook-ceph-mon0-f2bc6

Following is a sample output:

2018-05-18 10:51:39.932606 I | rook: starting Rook v0.7.1 with arguments '/usr/local/bin/rook mon --config-dir=/var/lib/rook --name=rook-ceph-mon0 --port=6790 --fsid=a01d92fb-8191-4343-8ec1-676abd0de780'
2018-05-18 10:51:39.932749 I | rook: flag values: --admin-secret=*****, --ceph-config-override=/etc/rook/config/override.conf, --cluster-name=default, --config-dir=/var/lib/rook, --fsid=a01d92fb-8191-4343-8ec1-676abd0de780, --help=false, --log-level=INFO, --mon-endpoints=rook-ceph-mon0=10.0.0.185:6790, --mon-secret=*****, --name=rook-ceph-mon0, --port=6790, --private-ipv4=10.1.19.21, --public-ipv4=10.0.0.185
The keyring does not match the existing keyring in /var/lib/rook/rook-ceph-mon0/data/keyring. You may need to delete the contents of dataDirHostPath on the host from a previous deployment.

This error indicates that you had a previous Rook Ceph deployment.

To fix this issue, delete the failed chart and then delete the contents of the dataDirHostPath on the hosts that were used in a previous deployment. Or, specify a different dataDirHostPath setting. Then, reinstall the ibm-rook-rbd-cluster chart.

Deployment completes for ibm-rook-rbd-cluster chart but none of rook-ceph-mon, rook-ceph, manager, or api pods come up

This issue might happen if you tried to reinstall ibm-rook-rbd-cluster without deleting the contents of the dataDirHostPath on the hosts for cleaning up the storage disks.

To resolve the issue, complete the following tasks:

Delete the failed ibm-rook-rbd-cluster chart
Delete the Rook Operator chart
Delete the contents of dataDirHostPath.
Clean up the disk that you used for storage.
Reinstall Rook Operator chart.
Reinstall ibm-rook-rbd-cluster chart.

After a worker node is restarted, Rook agent pod remains in error status

Get the pod information.

kubectl get po -o wide

Following is a sample output:

NAME                             READY     STATUS                 RESTARTS   AGE       IP            NODE
rook-agent-5rst5                 1/1       Running                0          2d        9.5.28.147    9.5.28.147
rook-agent-bsrrx                 1/1       Running                0          2d        9.5.28.143    9.5.28.143
rook-agent-zq4bm                 0/1       CreateContainerError   1          2d        9.5.28.146    9.5.28.146
rook-api-86b5b8849c-fjqf8        1/1       Running                0          7m        10.1.68.153   9.5.28.147
rook-ceph-mgr0-9c56544c8-2mxqr   1/1       Running                0          2d        10.1.19.31    9.5.28.143
rook-ceph-mon0-g5t7m             1/1       Running                0          2d        10.1.19.30    9.5.28.143
rook-ceph-mon1-zl5px             1/1       Running                5          7m        10.1.0.164    9.5.28.146
rook-ceph-mon2-jjjht             1/1       Running                0          2d        10.1.68.151   9.5.28.147
rook-ceph-osd-9.5.28.143-2bpl6   1/1       Running                0          2d        10.1.19.32    9.5.28.143
rook-ceph-osd-9.5.28.146-8qwbx   1/1       Running                5          7m        10.1.0.165    9.5.28.146
rook-ceph-osd-9.5.28.147-mcksg   1/1       Running                0          2d        10.1.68.152   9.5.28.147
rook-operator-947bf78c6-58nng    1/1       Running                0          2d        10.1.19.22    9.5.28.143

Get information about Rook agent pod.

kubectl describe po rook-agent-zq4bm

Following is a sample output:

Name:        rook-agent-zq4bm
Namespace:    default
Node:        9.5.28.146/9.5.28.146
...

6m        6m        3    kubelet, 9.5.28.146    spec.containers{rook-agent}    Warning        Failed    Error: Error response from daemon:
Conflict. The container name "/k8s_rook-agent_rook-agent-zq4bm_default_5b2c4423-5a8e-11e8-a2b0-005056a7db67_2" is already in
use by container ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a. You have to remove (or rename) that
container to be able to reuse that name.

docker kill ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a
ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a

The agent pod starts running normally.

kubectl get  po -o wide

Following is a sample output:

NAME                             READY     STATUS    RESTARTS   AGE       IP            NODE
rook-agent-5rst5                 1/1       Running   0          2d        9.5.28.147    9.5.28.147
rook-agent-bsrrx                 1/1       Running   0          2d        9.5.28.143    9.5.28.143
rook-agent-zq4bm                 1/1       Running   2          2d        9.5.28.146    9.5.28.146
rook-api-86b5b8849c-fjqf8        1/1       Running   0          12m       10.1.68.153   9.5.28.147
rook-ceph-mgr0-9c56544c8-2mxqr   1/1       Running   0          2d        10.1.19.31    9.5.28.143
rook-ceph-mon0-g5t7m             1/1       Running   0          2d        10.1.19.30    9.5.28.143
rook-ceph-mon1-zl5px             1/1       Running   5          12m       10.1.0.164    9.5.28.146
rook-ceph-mon2-jjjht             1/1       Running   0          2d        10.1.68.151   9.5.28.147
rook-ceph-osd-9.5.28.143-2bpl6   1/1       Running   0          2d        10.1.19.32    9.5.28.143
rook-ceph-osd-9.5.28.146-8qwbx   1/1       Running   5          12m       10.1.0.165    9.5.28.146
rook-ceph-osd-9.5.28.147-mcksg   1/1       Running   0          2d        10.1.68.152   9.5.28.147
rook-operator-947bf78c6-58nng    1/1       Running   0          2d        10.1.19.22    9.5.28.143