Node placement considerations

Background

When a workload—such as Cloud Pak for Integration—is deployed onto the Red Hat OpenShift Container Platform, the OpenShift scheduler makes decisions about the optimal node in the cluster for each pod to run on. The scheduler is designed to make the best scheduling decisions possible, so that cluster administrators don't have manage this themselves.

In summary, the scheduler places pods on nodes where:

the node has enough free resources to run the workload.
the node selector's affinity and anti-affinity rules (which are defined in the pod's specification) are satisfied. These rules attempt to spread the members of replica sets and stateful sets evenly across nodes and availability zones.

For more detail about the scheduler's algorithm and the many factors it takes into consideration, see, https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/

Resource requests and limits

Containers in a kubernetes system such as Openshift can have resource requests and limits applied to them as described here: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/. In summary:

The resources request values for a pod describe the minimum amount of CPU and memory that must be available on a node—in other words, not requested by the pods already scheduled to that node—in order for the pod to be scheduled to the node. When the pod executes on the node at run time, it is guaranteed to be able to consume at least as much as it requested if the pod is busy and needs those resources. If the pod is less busy, it may not need to use all of its allocated resources, and in this case the "spare" resources are available for use by other pods on the node. Conversely, if the pod is very busy it is possible, but not guaranteed, that it may be able to consume more than its resource requests. This will be possible if there are unallocated resources on its node, or if other pods on that node are not using all of the resources they requested.
A Pod's resource limits describe the maximum resources that the pod will be permitted to consume. These are equal to or greater than its resource requestvalues. It is common practice for containers to be deployed with the limit values set higher than the request values. The rationale for this is that this allows some flexibility, so that resources can be used efficiently.
When the sum of the resource limits for all containers scheduled to a node exceeds the amount of CPU and memory physically available on that node, the node is said to be overcommitted. In this case, if multiple pods on the node are busy (although all are guaranteed to be able to consume their resources.requests) some may not be able to consume as much as their resources.limits.

The way that this condition is handled is differently for CPU and memory. If Red Hat OpenShift needs to control the amount of CPU a pod is using, it can do so without killing the pod process. However, if the pod is using more memory than it requested, and there is contention for memory, the pod may be killed and evicted from the node. OpenShift may then attempt to schedule the pod to another node.

When to control the placement of workloads on nodes

Under most circumstances, it's best to allow the Openshift scheduler to make its own decisions: the scheduler was designed using best practices and will, more often than not, make the best possible decisions for your use case. However, there are times you may need to control the scheduling of Cloud Pak for Integration pods to particular nodes, or need to constrain the scheduler algorithm. These scenarios include:

Allowing a group of pods to contend for a fixed amount of CPU.

Over-committing the resources available on a node (as described above) has the potential to cause issues. For this reason, users may need to control which workloads have the potential to compete with each other for resource by placing a limit on how much CPU resources a grouping of pods are able to consume as a whole.

To achieve this, you must control which node they are scheduled to. For example, if 10 pods, each of which has a request.cpu=1 and limit.cpu=3 are scheduled to a node that has 20 CPU, the 10 pods in aggregate will not use more than 20 CPUs of resources.

Note: Why doesn't Kubernetes provide better facilities for managing resource contention? Vertical scaling in Kubernetes functions as an anti-pattern. A pattern more in keeping with the spirit of Kubernetes is to have small containers with limit.cpuequal to request.cpu and to scale horizontally with increased load.

Optimizing IBM licensing scenarios

Background: IBM's licensing approach is described in detail on the IBM Container Licenses page. The vCPU Capacity Counting Methodology described there is based on the resources.limits.cpu values of the containers making up a pod. While the Openshift scheduler ensures that the sum of the resources.requests.cpu for the pods deployed to a node never exceeds the actual number of vCPUs on that node, in a typical deployment the sum of the resources.limits.cpu values for the pods deployed to a node may be much higher than the actual number of vCPUs on that node.

The resources.limits.cpu values for Cloud Pak for Integration pods are typically set to higher values than their resources.request.cpu value. This is so that the containers can use more CPU resources than the absolute minimum that they require, to account for varying load. The vCPU Capacity Counting Methodology limits the total vCPU capacity calculated for each "IBM Program"—for each node in the cluster—to the actual vCPU capacity of that worker node. In the context of the licensing methodology, Cloud Pak for Integration is one IBM Program/ Other Cloud Paks are considered to be different IBM Programs.

Implications:

Pods are counted based on their theoretical maximum CPU consumption. Given the background explained above, if a workload is deployed across all of the nodes in a cluster then it is likely that each node in the cluster will contain some Cloud Pak for Integration pods. It is possible that these pods are idle much of the time, consuming only a fraction of the cpu value set in resources.limits.cpu. Even if the workload is busy, it may be that other workloads running on the nodes are contending for CPU, meaning the workload cannot actually obtain all of the CPU specified by resources.limits.cpu. Because the pods are distributed across the cluster, the per-node vCPU capacity limit described above doesn't apply, so the license count is the sum of resources.limits.cpu for all of the pods in the Cloud Pak for Integration. However, if the Cloud Pak for Integration workload is gathered together on a small set of nodes, then the capping described does apply.
Pods from more than one IBM Program on the same Node. One implication of the Container Licensing approach is that if there are Pods from more than one IBM Program on the same node then the vCPU limit approach applies individually for each IBM Program. Thus if a particular worker node has 20 vCPUs and is running pods from Cloud Pak for Integration whose total resources.limits.cpu is 20 or more at the same time as running pods from another different 'IBM Program' whose total resources.request.cpu value is 20 vCPUs or more then the vCPU Capacity Counting Methodology will count this as 20vCPUs of Cloud Pak for Integration AND 20 vCPUS of the other 'IBM Program'. Such a situation is clearly inefficient in terms of license use. A more efficient use of license resources is to deploy these workloads on separate nodes—each with 10 vCPUs—because this results in half the license count (10 Cloud Pak for Integration plus 10 'other program' ) despite the same amount of CPU resource being available.

OpenShift Container Platform licensing scenarios

As part of the purchase of IBM Cloud Paks, customers are provided with a fixed number of OpenShift Container Platform (OCP) license entitlements, as defined accordingly in the License Information document for Cloud Pak for Integration.

Red Hat licensing for OCP is calculated according to the worker node vCPU count (worker node capacity) and licenses are associated with clusters and worker nodes. When multiple IBM software products run on the same worker node, they each consume the resources covered by OCP license for the entire worker node. This is inefficient from the licensing perspective. It is more efficient to separate the OCP licenses used when multiple IBM software products are running on separate worker nodes. It is also easier from the Red Hat License Compliance perspective when the workloads are separated by nodes.

Avoiding a situation where workloads compete for resources or otherwise conflict

Sometimes you deploy multiple workloads onto a cluster and don't want them to compete for resources. One example of this situation would be when you deploy two different Cloud Paks—such as Cloud Pak for Integration and Cloud Pak for Data—onto the same cluster.

Separating workloads that are known to use large amounts of ephemeral storage

Kubernetes emptydir volumes, as well as the writable layers of containers, and some log files, are written to the filesystem on a node during pod lifetime as described in the OpenShift documentation. Although there is a technology preview feature in OCP enabled by the LocalStorageCapacityIsolation feature gate, this feature is disabled by default. Therefore, if the pods scheduled to a node use more ephemeral storage than is provided on that node, the node is tainted and which causes an eviction of all pods that don't tolerate this taint (see Taints and tolerations for more information).

For this reason, if particular workloads are known to require large amounts of ephemeral storage, it may be necessary to associate them with specific nodes so that they are not scheduled to the same node.

Tools for controlling how a workload is deployed to nodes

Kubernetes does provide three tools for workload placement: affinity and anti-affinity, taints and tolerations, and nodeSelectors. If you would like to learn more about these options, see the Reference section below.

However, Cloud Pak for Integration recommends that you use the tools provided by the OpenShift Container Platform (OCP): project node selectors and a default cluster-wide nodeSelector.

OCP project node selectors

A Project node selector applies a nodeSelector to any pod that is deployed within the project.

If Cloud Pak for Integration workloads are deployed into projects which have project node selectors, nodeSelectors can be used to constrain the Cloud Pak for Integration workloads to only those nodes with labels that match the nodeSelector. This does not stop other workloads from being scheduled onto those nodes if they are deployed into projects that do not have a project nodeSelector.

This approach works no matter how the pods are created and managed, but it functions best with a workload that is managed by an operator.

For additional information, see theOpenShift documentation

Default cluster-wide nodeSelector

The default cluster-wide node selector applies a default node selector to any pod that is deployed into a project that does not have a project node selector. This can be used as a solution to avoid non Cloud Pak for Integration workloads from being placed onto the nodes that are reserved for Cloud Pak for Integration use.

Note: The default cluster-wide nodeSelector is not supported on IBM's managed service, Red Hat OpenShift on IBM Cloud.

For additional information, see theOpenShift documentation

Recommended approach for controlling how a workload is divided among nodes

This procedure describes how to deploy Cloud Pak for Integration workloads across the nodes in a cluster.

Define two sets of nodes:
- One set of nodes where Cloud Pak for Integration workloads in the cluster will run ("CP4I nodes").
- Another set of nodes where the non-Cloud Pak for Integration workloads will run ("non-CP4I nodes").
Ensure that each of these sets of nodes is balanced across the physical infrastructure so that availability constraints are satisfied . If the cluster is a multi-zone cluster, ensure that there are the same number, size and type of nodes in each availability zone for both the Cloud Pak for Integration nodes and non-Cloud Pak for Integration nodes.
Apply a label—such as nodeuse=cp4i to the Cloud Pak for Integration nodes, and a different label—such as nodeuse=general to the non-Cloud Pak for Integration nodes.
Create one or more projects for the Cloud Pak for Integration workload, and set a project node selector—such as nodeuse=cp4i—for these projects.
If you are using Openshift Container platform in an environment where the default cluster-wide node selector:

is supported, define a default cluster wide node selector, such as nodeuse=general. Defining this node selector causes all workloads deployed in projects other than the Cloud Pak for Integration projects to be placed onto the general node, leaving the Cloud Pak for Integration nodes dedicated exclusively for Cloud Pak for Integration workloads.
is not supported—such as Red Hat OpenShift on IBM Cloud—set project node selectors, such as nodeuse=general, on the non-Cloud Pak for Integration projects in the cluster. Setting these node selectors causes all workloads deployed in non-Cloud Pak for Integration projects to be placed onto the general node, leaving the Cloud Pak for Integration nodes exclusively for Cloud Pak for Integration workloads. Ensure that any new projects that users create also have this project-wide node selector applied to them.

Reference

Kubernetes node placement toolsKubernetes provides multiple different ways to influence node placement:

Affinity and anti affinity https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity
In short, nodeAffinity provides a way to require or prefer that pods are scheduled to nodes that have particular labels.
Taints and Tolerations https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
In short nodes can be tainted so that only pods that are explicitly marked as tolerating the taint can run there.
At first sight, this might appear to be a good way to place workloads into dedicated nodes. It might seem that it would be possible to taint the nodes for use by Cloud Pak for Integration and then add a toleration to the Cloud Pak for Integration pods so that they and only they will run there. Theoretically this will work; however, workloads are managed by operators that define the PodSpec for their workloads. Some of these operators provide ways to override the podSpecs but some do not. Furthermore, this approach, even for the operators that provide override mechanisms is far more onerous for the user than the use of Project Node selectors described above.
nodeSelector
The capabilities provided by nodeAffinity are a superset of those provided by nodeSelectors and the kubernetes project has said that nodeSelectors will eventually be deprecated; therefor we recommend the use of nodeAffinity rather than nodeSelectors. (https://github.com/kubernetes/kubernetes/issues/82331)