Zookeeper pod hangs in a CrashLoopBackOff
state
Upgrading the Events operator results in a Zookeeper pod hanging in a CrashLoopBackOff
state.
Symptom
When you upgrade the Events operator, a Zookeeper pod enters a CrashLoopBackOff
state and the following exception is visible in the Zookeeper logs:
2022-01-28 20:31:38,850 ERROR Unable to load database on disk (org.apache.zookeeper.server.quorum.QuorumPeer) [main]
java.io.IOException: No snapshot found, but there are log entries. Something is broken!
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
2022-01-28 20:31:38,851 ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.quorum.QuorumPeerMain) [main]
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:938)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Caused by: java.io.IOException: No snapshot found, but there are log entries. Something is broken!
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
... 4 more
Cause
The Zookeeper directory does not contain a snapshot file but there is persisted data from a previous execution of the Zookeeper.
Resolving the problem
Note: These instructions should only be used when a single Zookeeper pod is in the CrashLoopBackOff
state. If you attempt to resolve a problem where multiple Zookeeper pods are in a similar state, you can lose data.
Complete the following steps to resolve the problem.
-
Use the following command to scale down the Zookeeper StatefulSet:
oc -n <events-operator-namespace> scale sts/<events-operator-instance>-zookeeper --replicas=0
-
Use the following command to locate the Persistent Volume Claim for the failing Zookeeper:
oc -n <events-operator-namespace> get pvc | grep zookeeper
Following is the sample output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE data-my-instance-zookeeper-0 Bound pvc-2ad9f783-c1ba-49ab-b48d-5d73cc303b26 2Gi RWO rook-ceph-block 145m data-my-instance-zookeeper-1 Bound pvc-80887966-883e-4a44-ab2e-c4e7cf9defec 2Gi RWO rook-ceph-block 145m data-my-instance-zookeeper-2 Bound pvc-f7670b6f-7ff3-4b91-93af-f6ae4fc9c50c 2Gi RWO rook-ceph-block 145m
-
Use the following command to delete the Persistent Volume Claim for the failing Zookeeper:
oc -n <events-operator-namespace> delete pvc <name-of-the-pvc>`
-
Use the following command to ensure that the Persistent Volume that is bound to the Persistent Volume Claim has been deleted:
oc -n <events-operator-namespace> get pv | grep <name-of-the-pvc>`
-
Scale the Zookeeper StatefulSet back up to the number of replicas that are required.