Zookeeper pod hangs in a CrashLoopBackOff state

Upgrading the Events operator results in a Zookeeper pod hanging in a CrashLoopBackOff state.

Symptom

When you upgrade the Events operator, a Zookeeper pod enters a CrashLoopBackOff state and the following exception is visible in the Zookeeper logs:

2022-01-28 20:31:38,850 ERROR Unable to load database on disk (org.apache.zookeeper.server.quorum.QuorumPeer) [main]
java.io.IOException: No snapshot found, but there are log entries. Something is broken!
    at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
    at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
2022-01-28 20:31:38,851 ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.quorum.QuorumPeerMain) [main]
java.lang.RuntimeException: Unable to run quorum server
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:938)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Caused by: java.io.IOException: No snapshot found, but there are log entries. Something is broken!
    at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
    at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
    ... 4 more

Cause

The Zookeeper directory does not contain a snapshot file but there is persisted data from a previous execution of the Zookeeper.

Resolving the problem

Note: These instructions should only be used when a single Zookeeper pod is in the CrashLoopBackOff state. If you attempt to resolve a problem where multiple Zookeeper pods are in a similar state, you can lose data.

Complete the following steps to resolve the problem.

  1. Use the following command to scale down the Zookeeper StatefulSet:

     oc -n <events-operator-namespace> scale sts/<events-operator-instance>-zookeeper --replicas=0
    
  2. Use the following command to locate the Persistent Volume Claim for the failing Zookeeper:

     oc -n <events-operator-namespace> get pvc | grep zookeeper
    

    Following is the sample output:

     NAME                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
     data-my-instance-zookeeper-0   Bound    pvc-2ad9f783-c1ba-49ab-b48d-5d73cc303b26   2Gi        RWO            rook-ceph-block   145m
     data-my-instance-zookeeper-1   Bound    pvc-80887966-883e-4a44-ab2e-c4e7cf9defec   2Gi        RWO            rook-ceph-block   145m
     data-my-instance-zookeeper-2   Bound    pvc-f7670b6f-7ff3-4b91-93af-f6ae4fc9c50c   2Gi        RWO            rook-ceph-block   145m
    
  3. Use the following command to delete the Persistent Volume Claim for the failing Zookeeper:

     oc -n <events-operator-namespace> delete pvc <name-of-the-pvc>`
    
  4. Use the following command to ensure that the Persistent Volume that is bound to the Persistent Volume Claim has been deleted:

     oc -n <events-operator-namespace> get pv | grep <name-of-the-pvc>`
    
  5. Scale the Zookeeper StatefulSet back up to the number of replicas that are required.