Understanding high availability and proxy nodes

Learn about the platform components that are required for setting up your cluster.

High availability
Dedicated proxy nodes and shared ingress controller

High Availability

For high availability, master and proxy nodes must be deployed redundantly. The following information discusses options for high availability in IBM® Cloud Private.

External load balancer

If possible, a highly available external load balancer such as an F5 can be used to spread the traffic among separate master or proxy node instances in the cluster. The external load balancer can be either a DNS URL or an IP address, and specified using cluster_lb_address at config.yaml during installation. The cluster_CA_domain and any TLS certificates should be configured to be a CNAME or A record that point at the external load balancer DNS name or IP address. All nodes in the cluster must be able to resolve this CNAME for internal communication. For more information, see Kubelet communication with API server.

When you are using an external load balancer, the master load balancer monitors the Kubernetes API server port 8001 for health on all master nodes.

forward traffic to 8001 (Kubernetes API)
8443 (platform UI), 9443 (authentication service)
8500 and 8600 (private registry)

When you are using the external load balancer, each master node might be in different subnets if the round-trip network time between the master nodes is less than 33 ms for etcd. You can still have a separation of duties, where the network team owns the one-time setup and ongoing management of F5 and the inbuilt integration of Kubernetes. F5 automatically creates the required configuration so that the Kubernetes administrator doesn't have to work with the F5 load balancer directly.

LoadBalancing IBM Cloud Private master and proxy nodes

Virtual IP addresses

In case a load balancer is unavailable, high availability of the master or proxy nodes can be achieved by using a virtual IP address, which is in a subnet that is shared by the master and proxy nodes. IBM Cloud Private supports two types of virtual IP management solutions:

Etcd (default)
Keepalived

This setting is done once as part of the installation of IBM Cloud Private, by using the vip_manager setting in config.yaml. For keepalived, the advertisements happen on the management interface, and the virtual IP is held on the interface that is provided by cluster_vip_iface and proxy_vip_iface. In situations where the virtual IP is accepting a high load of client traffic, the management network that is performing the advertisements for master election must be separate from the data network that is accepting client traffic.

Note the limitations of using a virtual IP address:

At any time, only one master or proxy node holds the lease to the virtual IP address and is represented in a dotted line in the picture. As such, although high availability is achieved by using a virtual IP, traffic is not load-balanced among all replicas.
Using a virtual IP requires that all candidate nodes use a cluster_vip_iface or proxy_vip_iface interface on the same subnet.
Any long-running or stateful TCP connections from clients are broken during a failover and must be reestablished.

Virtual Address in same subnet as master and proxy

Etcd (default)

Etcd Opens in a new tab is a distributed key value store that is used internally by IBM Cloud Private to store state information. Etcd uses a distributed consensus algorithm that is called Raft. The etcd based VIP manager uses the distributed key/value store to control which master or proxy node is the instance that holds the virtual IP address. The virtual IP address is leased to the leader, so all traffic is routed to that master or proxy node.

The etcd virtual IP manager is implemented as an etcd client that uses a key/value pair. The current master or proxy node that is holding the virtual IP address acquires a lease to this key/value pair with a TTL of 8 seconds. The other standby master or proxy nodes observe the lease key/value pair. If the lease expires without being renewed, the standby nodes assume that the first master failed and attempt to acquire their own lease to the key to be the new master node. The master node that is successful writing the key brings up the virtual IP address. The algorithm uses randomized election timeout to reduce the chance of any racing condition where more than one node tries to become the leader of the cluster.

Gratuitous ARP is not used with the etcd virtual IP manager when it fails over, so any existing client connections to the virtual IP address after it fails over, fails until the client's ARP cache expires and the MAC address for the new holder of the Virtual IP is acquired. However, the etcd virtual IP manager avoids the use of multicast which keepalived requires.

Keepalived

Keepalived Opens in a new tab provides simple and robust facilities for load balancing and high-availability, originally used for high availability of virtual routers. Keepalived uses Virtual Router Redundancy Protocol (VRRP) as an election protocol to determine which master or proxy node holds the virtual IP. The keepalived virtual IP manager implements a set of checkers to dynamically and adaptively maintain and manage a load balanced server pool according to its health. VRRP is a fundamental brick for failover. The keepalived virtual IP manager implements a set of hooks to the VRRP finite state that provides low-level and high-speed protocol interactions.

To ensure stability, the keepalived daemon is split into three processes:

A parent process called as watchdog which is in charge of the forked children process monitoring.
A child process for VRRP.
Another child process for health checking.

The keepalived configuration that is shipped with IBM Cloud Private uses multicast address 224.0.0.18 and IP protocol number 112. This must be allowed on the network segment where the master advertisements are made. keepalived also generates a password for authentication between the master candidates, which is the MD5 sum of the virtual IP.

Keepalived by default uses the final octet of the virtual IP address as the virtual router ID (VRID). For example, for a virtual IP address of 192.168.10.50, it uses VRID 50. If there are any other devices that use VRRP on the management layer 2 segment that is using this VRID, it might be necessary to change the virtual IP address to avoid conflicts.

kubelet communication with API server

Every node in the cluster runs a kubelet Opens in a new tab node agent that manages communication from the Kubernetes control plane. kubelet communicates with the cluster_lb_address or cluster_vip on port 8001 (Kubernetes API port), the same ports as the API clients. As such, if this is a DNS entry, it must be resolvable to the virtual IP or load balancer address by every node in the cluster.

Dedicated proxy nodes and shared ingress controller

IBM Cloud Private installation defines some node roles that are dedicated to running the shared IBM Cloud Private ingress controller called proxy nodes. These nodes serve as a layer 7 reverse proxy for the workloads that are running in the cluster. In situations where an external load balancer can be used, such as an F5, this is the recommended configuration. It can be difficult to secure and scale proxy nodes, and using a load balancer avoids additional network hops through proxy nodes to the pods that are running the actual application.

If an external load balancer is planned to be used, set up the cluster to label the master nodes as proxy nodes by using the hosts file before installation. This marks the master nodes with the additional proxy label and the shared ingress controller starts on the master nodes. This ingress controller can generally be ignored for northbound traffic or used for lightweight applications exposed southbound like additional administrative consoles for some applications that are running in the cluster.

[master]
192.168.30.10
192.168.30.11
192.168.30.12

[proxy]
192.168.30.10
192.168.30.11
192.168.30.12

If an ingress controller and ingress resources are required to aggregate several services using the built-in ingress resources, install additional isolated ingress controllers by using the included Helm chart for the namespace. Then expose these resources individually through the external load balancer.