Db2 Warehouse node failure behavior and expectations

When stateful applications such as Db2 Warehouse are deployed on Kubernetes (or OpenShift® by extension), managing the application lifecycle during cluster-wide events such as worker node failures is the responsibility of cluster administrators and not the application itself. In most cases, Kubernetes cannot automatically mitigate such events, and hence applications running on Kubernetes cannot automatically manage these events either.

The Db2 Warehouse service maintains high availability at the pod level. This includes automatic intra-container recovery if any components within the container fail, and automatic container recovery if the container fails. But, Db2 Warehouse cannot manage node failure and this must be handled by the cluster administrator.

When an OpenShift node enters an Unreachable state, the Kubernetes scheduler evicts all pods on that node except for pods that are controlled by StatefulSets. For these pods, Kubernetes waits for up to five minutes by default (defined by the pod-eviction-timeout option in the kubelet configuration) before evicting so that the node has a chance to recover.

This behavior affects Db2 Warehouse because the main pod is controlled by a StatefulSet. If you want to speed the eviction process when the node is unreachable, you can force delete the pod by using this command:

kubectl delete pod --grace-period=0 --force -n PROJECT DB2U-STATEFULSET-POD

Force deletion of the pod should be used with caution. According to the Kubernetes documentation, "Manual force deletion should be undertaken with caution, as it has the potential to violate the at most one semantics inherent to StatefulSet. StatefulSets may be used to run distributed and clustered applications which have a need for a stable network identity and stable storage. These applications often have configuration which relies on an ensemble of a fixed number of members with fixed identities. Having multiple members with the same identity can be disastrous and may lead to data loss (e.g. split brain scenario in quorum-based systems)."

Also, --force --grace-period=0 forces an ungraceful shutdown which can potentially leave open file handles from Db2 Warehouse on the worker node kernel, which can then disrupt pod restarts.

For more background, review Force Delete StatefulSet Pods in the Kubernetes documentation before deciding on a course of action. The topic explains why failover on a node failure is not automatic and explains some of the considerations.

Stateful applications need to have deterministic pod names and maintain the integrity of an application that has state. The service tries to ensure that the specified number of pods from ordinal 0 through n-1 are alive and ready. This ensures that at any time at most one pod with a given identity can run in a cluster. This is referred to as at most one semantics.

A StatefulSet pod can fail over to an available node from a node failure in one of three ways. A pod is not deleted automatically when a node is unreachable. The pods running on an unreachable node enter the Terminating or Unknown state after a timeout. Pods may also enter these states when you attempt graceful deletion of a pod on an unreachable node. The only ways in which a pod in such a state can be removed from the API server are as follows:

  • The node object is deleted (either by you, or by the Node Controller).
  • The kubelet on the unresponsive Node starts responding, kills the pod and removes the entry from the apiserver.
  • You force delete the pod.

The recommended best practice is to use the first or second approach. If a node is confirmed to be dead (for example, permanently disconnected from the network or powered down), you can delete the node. If the node is experiencing a network partition, try to resolve the issue or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the pod and free up its name in the API server.

Whether you use a controller to directly manage the pod object lifecycle or use a StatefulSet to manage pods, the unique pod identity requirement remains.