How Autorecovery Utilizes Consistent Hashing for High Availability

5 min read

How Autorecovery Utilizes Consistent Hashing for High Availability

Docker, Kubelet, Kube-proxy, and the desired Kubernetes network plugin are critical components that need to be functional in order to have a healthy Kubernetes worker node. Over time these components can break down resulting in the worker node entering a nonfunctional state. Nonfunctional worker nodes decrease total capacity of the cluster and can result in downtime for customer workloads. IBM Cloud Container’s Autorecovery system detects when worker nodes enter a nonfunctioning state and recovers nodes back to functioning states.

The Autorecovery system analyzes results from customer configured checks like HTTP server checks on each worker node or analyzing the node’s recorded Kubelet health statuses. Autorecovery keeps a history of the check results and after a configured number of consecutive failures marks the node as failed, similar to how liveliness checks are done for pods in Kubernetes. Once a node is marked for failure, a corrective action is taken to bring the node back to a functional state. An example of a corrective action is an OS Reload that brings up a fresh machine that is then reconfigured with all the necessary Kubernetes worker node services (Kubelet, Docker, etc.).

One problem faced when designing the Autorecovery system is when a component of the system is running on a node that becomes nonfunctional. If an Autorecovery component is collocated on a node that it is responsible for monitoring, the node can become unhealthy and cause the component to crash. In this case, the node would never be recovered because the monitoring component was brought down with the health of the node. To ensure Autorecovery is always operational in the cluster, the Autorecovery system runs as a multi instance deployment that is spread across nodes in the cluster. Each instance has Kubernetes readiness checks operating on it to determine if the instance is healthy. Healthy instances split the monitoring duties of all the nodes in the cluster between one another using consistent hashing.

Consistent hashing at a high level splits the range of a keyspace across a set of members. The range of the keyspace is usually made deterministic by running the input through a hashing algorithm. Using consistent hashing, an Autorecovery instance can determine the subset of nodes it is responsible for monitoring by dynamically discovering all the nodes in the system and the list of healthy peers. This lists of nodes and healthy peers are updated periodically allowing the application to tolerate individual instance and node churn. This design allows the IBM Container’s Autorecovery system to dynamically scale and split monitoring duties between other instances. In addition, the system can continue to function as long as there is at least one healthy instance in the cluster.

You can read more about how to utilize IBM Container’s Autorecovery system here.

Special thanks to Kodie Glosser, Lucas Copi, Mark Franceschini, and Pratik Mallya for making the Autorecovery system a reality.

Be the first to hear about news, product updates, and innovation from IBM Cloud