High availability IBM Cloud Private clusters

You can configure high availability (HA) for IBM® Cloud Private master and proxy nodes.

You can configure HA for only the master nodes, only the proxy nodes, or for both types of nodes. To reduce the infrastructure requirements of your cluster, you can assign both master and proxy roles to the HA nodes. For the master nodes, the virtual IP manager must be in the same subnet.

Note: To ensure availability, configure more than one proxy node and 3 or 5 master nodes..

You must set up shared storage across your master nodes. IBM Cloud Private requires shared storage for Docker registry. The storage must be a POSIX-compliant shared file system that is located outside of your IBM Cloud Private cluster. Your master nodes must have read/write access to the file system. The file system must be mounted as local hostPath on your master nodes. The following directory must be mounted on your shared storage:

/var/lib/registry - The /var/lib/registry directory stores images in the private image registry and keeps the images that are synchronized on all master nodes.
/var/lib/icp/audit - The /var/lib/icp/audit directory stores pod usage for license check.

Note: You must set the file parameter to 0755 for the directory.

Requirements for master HA

These requirements are for master HA only. Proxy HA is not affected by the number of master nodes.

For N number of masters in a cluster, the cluster can tolerate up to (N-1)/2 permanent failures. For example, in a cluster that has three masters, if one master fails, then the fault tolerance is as (3-1)/2=1. You must aim for a fault tolerance of one or more.

You must have an odd number of masters in your cluster. Adding extra master nodes provides a higher tolerance for failure. You can review how the fault tolerance in a cluster is affected by the number of the master nodes in Table 1:Fault tolerance in HA clusters.

Table 1. Fault tolerance in HA clusters
Number of the master nodes	Failure tolerance
1	0
3	1
5	2
7	3

For HA, master and proxy nodes must be redundantly deployed (at different physical locations, if possible) to tolerate hardware failures and network partitions. The following options can be considered for network HA in IBM Cloud Private:

Node assignment and communication in HA clusters

If an external load balancer is not available, HA of the master and proxy nodes can be achieved by using a virtual IP address, which is in a subnet that is shared by the master and proxy nodes. In HA IBM Cloud Private clusters, the virtual IP manager controls master and proxy node assignment.

The virtual IP manager controls which nodes serve master and proxy roles by assigning virtual IP addresses to those nodes. The virtual IP manager in IBM Cloud Private facilitates communication between nodes through the network interface controller (NIC). The virtual IP manager assigns the cluster_vip IP address to an available master node and assigns the proxy_vip IP address to an available proxy node. These nodes act as the leading master and proxy node. The cluster_vip IP address must be on the NIC that you specify in the vip_iface parameter. Similarly, the proxy_vip IP addresses must be on the NIC that you specify in the proxy_vip_iface parameter.

The virtual IP manager monitors the health of the cluster's master and proxy nodes. If the leading master or proxy node is no longer available, the virtual IP manager selects an available node and assigns it to the correct virtual IP address.

Note: For a HA environment, you must set up at least one of the following parameters: cluster_vip, cluster_lb_address.

For more details on setting up HA during installation, see HA installation options.

VIP management options

There are three options to set up virtual IP management:

Etcd (default)
Ucarp
Keepalived

You can set up the option that you choose during the installation of your IBM Cloud Private cluster by using the vip_manager setting that is in the config.yaml file. For ucarp and keepalived, the advertisements happen on the management interface, and the virtual IP is held on the interface that is provided by the cluster_vip_iface and proxy_vip_iface parameters. In situations where the virtual IP accepts a high load of client traffic, the management network that advertises for the master election must be separate from the data network that accepts client traffic.

Note: Consider the following when you use a virtual IP address:

At any point in time, only one master or proxy node holds the lease for the virtual IP address.
When you use a virtual IP, traffic is not load balanced among all replicas. Using a virtual IP requires that all candidate nodes use a cluster_vip_iface or a proxy_vip_iface interface on the same subnet.
Any long-running and stateful TCP connection from clients is broken during a failover and must be reestablished.

Etcd

IBM Cloud Private uses etcd internally as a distributed key value store to store state information. Etcd uses a distributed census algorithm that is called raft. The etcd-based VIP manager uses the distributed key value store to control which master or proxy node is the instance that holds the virtual IP address. The virtual IP address is leased to the leader so that all traffic is routed to that master or proxy node.

The etcd virtual IP manager is implemented as an etcd client that uses a key-value pair. The current master or proxy node that holds the virtual IP address acquires a lease to this key-value pair with a TTL of 8 seconds. The other standby master or proxy nodes observe the lease key-value pair.

If the lease is not renewed, and it expires, the standby nodes assume that the first master failed and attempt to acquire their own lease to the key to be the new master node. The master node that is successful in writing the key, brings up the virtual IP address. The algorithm uses randomized election timeout to reduce the chance of any racing condition where one or more nodes try to become the leader of the cluster.

Note: Gratuitous ARP is not used with the etcd virtual IP manager when it fails over. Therefore, any existing client connections to the virtual IP address, after the virtual IP address fails, fail until the client’s ARP cache expires and the MAC address for the new holder of the virtual IP is acquired. However, the etcd virtual IP manager avoids the use of multicast as is required by ucarp and keepalived requires.

Ucarp

Ucarp is an implementation of the Common Address Redundancy Protocol (CARP) that is ported to Linux. ucarp allows the master node to advertise by using the multicast address 224.0.0.18 that it owns a particular IP address. Each node sends out an advertisement on its network interface that it can have a virtual IP address. The advertisement is sent every few seconds. This amount of time between advertisements is called the advbase (advertise base). Each master node sends a skew value with that CARP message. This value is similar to its priority of holding that IP address, which is the advskew (advertising skew). When two or more systems advertise at one-second intervals (advbase=1), the system with the lower advskew wins.

The node that has the lower IP address, breaks any ties. For HA, moving one address between several nodes in this manner enables to survive the outage of a host. However, this behavior enables more availability and not more scalability.

A master node becomes a master if any one of the following conditions occurs:

No one else advertises for a duration of three times its own advertisement interval (advbase). That is, 3 x advbase.
You specify the --preempt option, and the master node hears a master that has a longer advbase, or has the same advbase but, a higher advskew.

An existing master node becomes a backup if any one of the following conditions occurs:

Another master advertises a shorter advbase, or the same advbase but, a lower advskew.
Another master advertises the same advbase, and has a lower IP address.

After failover, ucarp sends a gratuitous ARP message to all of its neighbors so that they can update their ARP caches with the MAC address of the new master.

Keepalived

Keepalived provides simple and robust facilities for load balancing and HA. Keepalived uses Virtual Router Redundancy Protocol (VRRP) as an election protocol to determine the master or proxy node that holds the virtual IP address. The keepalived virtual IP manager implements a set of checkers to dynamically and adaptively maintain and manage load balanced server pool according to the health. VRRP is a fundamental brick for failover. The keepalived virtual IP manager implements a set of hooks to the VRRP finite state that provides low-level and high-speed protocol interactions.

To ensure stability, the keepalived daemon is split into the following parts:

A parent process called the watchdog that is in charge of the forked children process monitoring
A child process for VRRP
Another child process for health checking

The keepalived configuration in IBM Cloud Private uses the multicast address 224.0.0.18 and the IP protocol number 112. This address and protocol number must be allowed in the network segment where the master advertisements are made. Keepalived also generates a password for authentication between the master candidates, which is the MD5 sum of the virtual IP. Keepalived, by default, uses the final octet of the virtual IP address as the virtual router ID (VRID). For example, for a virtual IP address of 192.168.10.50, it uses VRID 50. If there are any other devices that use VRRP on the management Layer 2 segment, and these devices use the same VRID, it might be necessary to change the virtual IP address to avoid conflicts.