Troubleshooting Rook Ceph cluster
Review frequently encountered Rook Ceph cluster issues.
- Rook Operator chart installation by using the management console gets ESOCKETTIMEDOUT error
- Rook Operator Chart deletion does not delete the Rook daemonset
- Installing Rook cluster chart (ibm-rook-rbd-cluster) by using the management console gets ESOCKETTIMEDOUT error
- Installing Rook cluster chart (ibm-rook-rbd-cluster) gets Job failed: BackoffLimitExceeded Error
- Rook cluster chart starts deploying but rook-ceph-mon gets a CrashLoopBackOff error
- Deployment completes for ibm-rook-rbd-cluster chart but none of rook-ceph-mon, rook-ceph, manager, or api pods come up
- After a worker node is restarted, Rook agent pod remains in error status
Before you proceed with troubleshooting, ensure that your cluster meets the prerequisites and that you have adequate permissions to perform installation-related operations. For more information, see Prerequisites and limitations.
Note: You need to set up kubectl CLI to run troubleshooting commands. For more information, see Accessing your cluster from the kubectl CLI.
Rook Operator chart installation by using the management console gets ESOCKETTIMEDOUT error
You might see a ESOCKETTIMEDOUT
error while you install the Rook Operator chart by using the management console:
Resolve the issue
Check whether the rook-agent and operator pods are running. Run the following commands:
kubectl get nodes -o wide
kubectl -n default get po -o wide
If you have as many agents as the number of worker nodes, and one operator pod is running, then the installation was successful. You can ignore the error.
If the agent or operator pods are not running, check the Helm logs to identify the error:
kubectl -n kube-system get po | grep helm
Following is a sample output:
helm-api-66b98d88bc-6psq6 2/2 Running 0 1d
helm-repo-5495f5c48c-k9mkl 1/1 Running 0 1d
kubectl -n kube-system log helm-api-66b98d88bc-6psq6 rudder
Rook Operator Chart deletion does not delete the Rook daemonset
-
Get the list of pods.
kubectl -n default get po
Following is a sample output:
NAME READY STATUS RESTARTS AGE rook-agent-ckxht 1/1 Running 0 23h rook-agent-jnxh6 1/1 Running 0 23h rook-agent-wkt26 1/1 Running 0 23h
-
Get the daemonset.
kubectl -n default get ds
Following is a sample output:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE rook-agent 3 3 3 3 3 <none> 23h
This issue is known in Rook alpha release. Manually delete the agent daemonset. Run the following commands:
-
Delete the daemonset.
kubectl -n default delete ds rook-agent
Following is a sample output:
daemonset "rook-agent" deleted
-
Get a list of pods.
kubectl -n default get po
Following is a sample output:
No resources found.
Installing Rook cluster chart (ibm-rook-rbd-cluster) by using the management console gets ESOCKETTIMEDOUT error
You might see a ESOCKETTIMEDOUT
error while you install the Rook cluster chart by using the management console.
-
Check the pods that are running in the namespace where you are installing the chart. Look for
rook-cluster-precheck-job
and its InitContainer. You might see the following error:kubectl -n default get po
Following is a sample output:
NAME READY STATUS RESTARTS AGE rook-cluster-precheck-job-mqfv9 0/1 Init:ErrImagePull 0 28s
-
Verify that the Docker repository that is specified for Hyperkube image is correct.
Installing Rook cluster chart (ibm-rook-rbd-cluster) gets Job failed: BackoffLimitExceeded Error
You might see a BackoffLimitExceeded
error while you install the Rook cluster chart.
Following two reasons might be causing this error:
- The Rook Operator Helm chart is not installed in your cluster.
- The Rook cluster chart is already installed in your namespace.
Rook Operator Helm chart is not installed in your cluster
Verify whether the Rook Operator Helm chart is installed in your cluster:
kubectl get po --all-namespaces | grep rook-operator
If the chart is not installed, install it first.
For more information about installing the Rook operator Helm chart, see Ceph Operator Helm Chart .
Rook cluster chart is already installed in your namespace
Verify whether the Rook cluster chart is installed in your cluster:
kubectl -n default get cluster
Following is a sample output:
NAME KIND
default-cluster Cluster.v1alpha1.rook.io
You cannot install multiple Rook cluster charts in one namespace.
Rook cluster chart starts deploying but rook-ceph-mon gets a CrashLoopBackOff error
-
Get a list of pods.
kubectl -n default get po
Following is a sample output:
NAME READY STATUS RESTARTS AGE rook-agent-8tlqf 1/1 Running 0 24m rook-agent-htjdl 1/1 Running 0 24m rook-agent-q46vw 1/1 Running 0 24m rook-ceph-mon0-f2bc6 0/1 CrashLoopBackOff 2 37s rook-operator-947bf78c6-8hjgj 1/1 Running 0 24m
-
Check the
rook-ceph-mon
log.kubectl -n default log rook-ceph-mon0-f2bc6
Following is a sample output:
2018-05-18 10:51:39.932606 I | rook: starting Rook v0.7.1 with arguments '/usr/local/bin/rook mon --config-dir=/var/lib/rook --name=rook-ceph-mon0 --port=6790 --fsid=a01d92fb-8191-4343-8ec1-676abd0de780' 2018-05-18 10:51:39.932749 I | rook: flag values: --admin-secret=*****, --ceph-config-override=/etc/rook/config/override.conf, --cluster-name=default, --config-dir=/var/lib/rook, --fsid=a01d92fb-8191-4343-8ec1-676abd0de780, --help=false, --log-level=INFO, --mon-endpoints=rook-ceph-mon0=10.0.0.185:6790, --mon-secret=*****, --name=rook-ceph-mon0, --port=6790, --private-ipv4=10.1.19.21, --public-ipv4=10.0.0.185 The keyring does not match the existing keyring in /var/lib/rook/rook-ceph-mon0/data/keyring. You may need to delete the contents of dataDirHostPath on the host from a previous deployment.
This error indicates that you had a previous Rook Ceph deployment.
To fix this issue, delete the failed chart and then delete the contents of the dataDirHostPath
on the hosts that were used in a previous deployment. Or, specify a different dataDirHostPath
setting. Then, reinstall the ibm-rook-rbd-cluster
chart.
Deployment completes for ibm-rook-rbd-cluster chart but none of rook-ceph-mon, rook-ceph, manager, or api pods come up
This issue might happen if you tried to reinstall ibm-rook-rbd-cluster
without deleting the contents of the dataDirHostPath
on the hosts for cleaning up the storage disks.
To resolve the issue, complete the following tasks:
- Delete the failed
ibm-rook-rbd-cluster
chart - Delete the Rook Operator chart
- Delete the contents of
dataDirHostPath
. - Clean up the disk that you used for storage.
- Reinstall Rook Operator chart.
- Reinstall
ibm-rook-rbd-cluster
chart.
After a worker node is restarted, Rook agent pod remains in error status
-
Get the pod information.
kubectl get po -o wide
Following is a sample output:
NAME READY STATUS RESTARTS AGE IP NODE rook-agent-5rst5 1/1 Running 0 2d 9.5.28.147 9.5.28.147 rook-agent-bsrrx 1/1 Running 0 2d 9.5.28.143 9.5.28.143 rook-agent-zq4bm 0/1 CreateContainerError 1 2d 9.5.28.146 9.5.28.146 rook-api-86b5b8849c-fjqf8 1/1 Running 0 7m 10.1.68.153 9.5.28.147 rook-ceph-mgr0-9c56544c8-2mxqr 1/1 Running 0 2d 10.1.19.31 9.5.28.143 rook-ceph-mon0-g5t7m 1/1 Running 0 2d 10.1.19.30 9.5.28.143 rook-ceph-mon1-zl5px 1/1 Running 5 7m 10.1.0.164 9.5.28.146 rook-ceph-mon2-jjjht 1/1 Running 0 2d 10.1.68.151 9.5.28.147 rook-ceph-osd-9.5.28.143-2bpl6 1/1 Running 0 2d 10.1.19.32 9.5.28.143 rook-ceph-osd-9.5.28.146-8qwbx 1/1 Running 5 7m 10.1.0.165 9.5.28.146 rook-ceph-osd-9.5.28.147-mcksg 1/1 Running 0 2d 10.1.68.152 9.5.28.147 rook-operator-947bf78c6-58nng 1/1 Running 0 2d 10.1.19.22 9.5.28.143
-
Get information about Rook agent pod.
kubectl describe po rook-agent-zq4bm
Following is a sample output:
Name: rook-agent-zq4bm Namespace: default Node: 9.5.28.146/9.5.28.146 ... 6m 6m 3 kubelet, 9.5.28.146 spec.containers{rook-agent} Warning Failed Error: Error response from daemon: Conflict. The container name "/k8s_rook-agent_rook-agent-zq4bm_default_5b2c4423-5a8e-11e8-a2b0-005056a7db67_2" is already in use by container ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a. You have to remove (or rename) that container to be able to reuse that name.
-
Log in to the node on which the pod is failing. Kill the conflicting container as reported in the error.
docker kill ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a ac71dc3e805f470d44afe6660f668e71832753505532625a9f30905c30f2063a
The agent pod starts running normally.
kubectl get po -o wide
Following is a sample output:
NAME READY STATUS RESTARTS AGE IP NODE rook-agent-5rst5 1/1 Running 0 2d 9.5.28.147 9.5.28.147 rook-agent-bsrrx 1/1 Running 0 2d 9.5.28.143 9.5.28.143 rook-agent-zq4bm 1/1 Running 2 2d 9.5.28.146 9.5.28.146 rook-api-86b5b8849c-fjqf8 1/1 Running 0 12m 10.1.68.153 9.5.28.147 rook-ceph-mgr0-9c56544c8-2mxqr 1/1 Running 0 2d 10.1.19.31 9.5.28.143 rook-ceph-mon0-g5t7m 1/1 Running 0 2d 10.1.19.30 9.5.28.143 rook-ceph-mon1-zl5px 1/1 Running 5 12m 10.1.0.164 9.5.28.146 rook-ceph-mon2-jjjht 1/1 Running 0 2d 10.1.68.151 9.5.28.147 rook-ceph-osd-9.5.28.143-2bpl6 1/1 Running 0 2d 10.1.19.32 9.5.28.143 rook-ceph-osd-9.5.28.146-8qwbx 1/1 Running 5 12m 10.1.0.165 9.5.28.146 rook-ceph-osd-9.5.28.147-mcksg 1/1 Running 0 2d 10.1.68.152 9.5.28.147 rook-operator-947bf78c6-58nng 1/1 Running 0 2d 10.1.19.22 9.5.28.143