Configuring failure detection in ZooKeeper

You must configure failure detection in ZooKeeper, to avoid temporary period of unavailability due to unexpected failure, network errors, or during the election of a new leader node if the current leader node has failed.

When a ZooKeeper node fails unexpectedly or a networking error causes a particular ZooKeeper node to become unreachable, you might observe a temporary period of unavailability. This period of unavailability is a result of the following issues:
  • ZooKeeper attempting to contact a node that it does not yet know has failed or is unreachable.
  • ZooKeeper electing a new leader node after the failure is detected.
The zoo.cfg file contains the following properties that are relevant to failure detection:
  • tickTime
  • initLimit

It is suggested to retain the default value for tickTime and tune initLimit according to your deployment and environment details to minimize the impact of node failures.

The initLimit property defines how many ticks to wait during connection attempts to the ZooKeeper leader node. For example, if initLimit is set to 5 and tickTime is 2000 millseconds, then a ZooKeeper node waits for 5*2000 milliseconds before giving up on the connection and deciding to elect a new leader.

Setting a lower initLimit decreases the amount of time ZooKeeper waits before deciding that a node has failed. However, ensure that you do not set very low value, as this causes ZooKeeper to consider slower nodes as failed, which ultimately results in less overall availability.

To configure failure detection in ZooKeeper:

  1. Log in to the server where ZooKeeper node is installed.
  2. To stop the ZooKeeper node, go to <install_dir>/MailboxUtilities/bin, and type ./
  3. Go to <install_dir>/zookeeper/conf directory.
  4. Open the zoo.cfg file.
  5. Change value for initLimit as required.
  6. Restart the ZooKeeper node for the change to take effect. Go to <install_dir>/MailboxUtilities/bin, and type ./
  7. Repeat the steps on all ZooKeeper nodes.