Node placement considerations

Background

When a workload—such as Cloud Pak for Integration—is deployed onto the Red Hat OpenShift Container Platform, the OpenShift scheduler makes decisions about the optimal node in the cluster for each pod to run on. The scheduler is designed to make the best scheduling decisions possible so that cluster administrators don't have to manage this themselves.

In summary, the scheduler places pods on nodes where:

The node has enough free resources to run the workload.
The node selector's affinity and anti-affinity rules (which are defined in the pod's specification) are satisfied. These rules attempt to spread the members of replica sets and stateful sets evenly across nodes and availability zones.

For more detail about the scheduler's algorithm and the many factors it considers, see, https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/.

Resource requests and limits

Containers in a Kubernetes system such as Openshift can have resource requests and limits applied to them as described here: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/. In summary:

The resources request values for a pod describe the minimum amount of CPU and memory that must be available on a node—in other words, not requested by the pods that are already scheduled to that node—in order for the pod to be scheduled to the node. When the pod executes on the node at run time, it is guaranteed to be able to consume at least as much as it requested if the pod is busy and needs those resources. If the pod is less busy, it may not need to use all of its allocated resources, and in this case the "spare" resources are available for use by other pods on the node. Conversely, if the pod is very busy it is possible, but not guaranteed, that it may be able to consume more than its resource requests. This will be possible if there are unallocated resources on its node, or if other pods on that node are not using all of the resources they requested.
A Pod's resource limits describe the maximum resources that the pod will be permitted to consume. These are equal to or greater than its resource requestvalues. It is common practice for containers to be deployed with the limit values set higher than the request values. The rationale for this is that this allows some flexibility so that resources can be used efficiently.
When the sum of the resource limits for all containers scheduled to a node exceeds the amount of CPU and memory physically available on that node, the node is said to be overcommitted. In this case, if multiple pods on the node are busy (although all are guaranteed to be able to consume their resources.requests) some may not be able to consume as much as their resources.limits.

The way that this condition is handled is differently for CPU and memory. If Red Hat OpenShift needs to control the amount of CPU a pod is using, it can do so without killing the pod process. However, if the pod is using more memory than it requested, and there is contention for memory, the pod may be killed and evicted from the node. OpenShift may then attempt to schedule the pod to another node.

When to control the placement of workloads on nodes

Under most circumstances, it's best to allow the Openshift scheduler to make its own decisions: the scheduler was designed using best practices and will, more often than not, make the best possible decisions for your use case. However, there are times you may need to control the scheduling of Cloud Pak for Integration pods to particular nodes, or need to constrain the scheduler algorithm. These scenarios include:

Allowing a group of pods to contend for a fixed amount of CPU

Over-committing the resources available on a node (as described earlier) has the potential to cause issues. For this reason, users may need to control which workloads have the potential to compete with each other for resource by placing a limit on how much CPU resources a grouping of pods are able to consume as a whole.

To achieve this, you must control which node they are scheduled to. For example, if 10 pods, each of which has a request.cpu=1 and limit.cpu=3 are scheduled to a node that has 20 CPU, the 10 pods in aggregate will not use more than 20 CPUs of resources.

Note: Why doesn't Kubernetes provide better facilities for managing resource contention? Vertical scaling in Kubernetes functions as an anti-pattern. A pattern more in keeping with the spirit of Kubernetes is to have small containers with limit.cpu equal to request.cpu and to scale horizontally with increased load.

Optimizing IBM licensing scenarios

Background: IBM's licensing approach is described in detail on the IBM Container Licenses page. The vCPU Capacity Counting Methodology described there is based on the resources.limits.cpu values of the containers making up a pod. While the Openshift scheduler ensures that the sum of the resources.requests.cpu for the pods deployed to a node never exceeds the actual number of vCPUs on that node, in a typical deployment the sum of the resources.limits.cpu values for the pods deployed to a node may be much higher than the actual number of vCPUs on that node.

The resources.limits.cpu values for Cloud Pak for Integration pods are typically set to higher values than their resources.request.cpu value. This is so that the containers can use more CPU resources than the absolute minimum that they require, to account for varying load. The vCPU Capacity Counting Methodology limits the total vCPU capacity that is calculated for each "IBM Program"—for each node in the cluster—to the actual vCPU capacity of that worker node. In the context of the licensing methodology, Cloud Pak for Integration is one IBM Program/ Other Cloud Paks are considered to be different IBM Programs.

Implications:

Pods are counted based on their theoretical maximum CPU consumption. Given the background explained earlier, if a workload is deployed across all of the nodes in a cluster then it is likely that each node in the cluster will contain some Cloud Pak for Integration pods. It is possible that these pods are idle much of the time, consuming only a fraction of the cpu value set in resources.limits.cpu. Even if the workload is busy, it may be that other workloads running on the nodes are contending for CPU, meaning the workload cannot obtain all of the CPU specified by resources.limits.cpu. Because the pods are distributed across the cluster, the per-node vCPU capacity limit described earlier doesn't apply, so the license count is the sum of resources.limits.cpu for all of the pods in the Cloud Pak for Integration. However, if the Cloud Pak for Integration workload is gathered together on a small set of nodes, then the capping described does apply.
Pods from more than one IBM Program on the same Node. One implication of the Container Licensing approach is that if there are Pods from more than one IBM Program on the same node then the vCPU limit approach applies individually for each IBM Program. Thus if a particular worker node has 20 vCPUs and is running pods from Cloud Pak for Integration whose total resources.limits.cpu is 20 or more at the same time as running pods from another different 'IBM Program' whose total resources.request.cpu value is 20 vCPUs or more than the vCPU Capacity Counting Methodology will count this as 20vCPUs of Cloud Pak for Integration AND 20 vCPUS of the other 'IBM Program'. Such a situation is clearly inefficient in terms of license use. A more efficient use of license resources is to deploy these workloads on separate nodes—each with 10 vCPUs—because this results in half the license count (10 Cloud Pak for Integration plus 10 'other program') despite the same amount of CPU resource being available.

OpenShift Container Platform licensing scenarios

As part of the purchase of IBM Cloud Paks, customers are provided with a fixed number of OpenShift Container Platform (OCP) license entitlements, as defined in the License Information document for Cloud Pak for Integration.

Red Hat licensing for OCP is calculated according to the worker node vCPU count (worker node capacity) and licenses are associated with clusters and worker nodes. When multiple IBM software products run on the same worker node, they each consume the resources covered by OCP license for the entire worker node. This is inefficient from the licensing perspective. It is more efficient to separate the OCP licenses used when multiple IBM software products are running on separate worker nodes. It is also easier from the Red Hat License Compliance perspective when the workloads are separated by nodes.

Avoiding a situation where workloads compete for resources or otherwise conflict

Sometimes you deploy multiple workloads onto a cluster and don't want them to compete for resources. One example of this situation would be when you deploy two different Cloud Paks—such as Cloud Pak for Integration and Cloud Pak for Data—onto the same cluster.

Separating workloads that are known to use large amounts of ephemeral storage

Kubernetes emptydir volumes, as well as the writable layers of containers, and some log files, are written to the filesystem on a node during pod lifetime as described in the OpenShift documentation.

You can manage ephemeral storage by specifying limits, but contention for storage can lead to pods being evicted. So, if you have workloads that require large amounts of ephemeral storage, you might want to isolate them to specific nodes. For more information, see Understanding ephemeral storage in the Red Hat OpenShift documentation.

Tools for controlling how a workload is deployed to nodes

Kubernetes provides three tools for workload placement: affinity and anti-affinity, taints and tolerations, and nodeSelectors. To learn more about these tools, see the Kubernetes node placement tools section.

However, Cloud Pak for Integration recommends that you use the tools provided by the OpenShift Container Platform (OCP): project node selectors and a default cluster-wide nodeSelector.

OCP project node selectors

A Project node selector applies a nodeSelector to any pod that is deployed within the project.

If Cloud Pak for Integration workloads are deployed into projects which have project node selectors, nodeSelectors can be used to constrain the Cloud Pak for Integration workloads to only those nodes with labels that match the nodeSelector. This does not stop other workloads from being scheduled onto those nodes if they are deployed into projects that do not have a project nodeSelector.

This approach works no matter how the pods are created and managed, but it functions best with a workload that is managed by an operator.

For additional information, see the OpenShift documentation.

Default cluster-wide nodeSelector

The default cluster-wide node selector applies a default node selector to any pod that is deployed into a project that does not have a project node selector. This can be used as a solution to avoid non Cloud Pak for Integration workloads from being placed onto the nodes that are reserved for Cloud Pak for Integration use.

Note: The default cluster-wide nodeSelector is not supported on IBM's managed service, Red Hat OpenShift on IBM Cloud.

For additional information, see theOpenShift documentation

Recommended approach for controlling how a workload is divided among nodes

This procedure describes how to deploy Cloud Pak for Integration workloads across the nodes in a cluster.

Define two sets of nodes:
- One set of nodes where Cloud Pak for Integration workloads in the cluster will run ("CP4I nodes").
- Another set of nodes where the non-Cloud Pak for Integration workloads will run ("non-CP4I nodes").
Ensure that each of these sets of nodes is balanced across the physical infrastructure so that availability constraints are satisfied. If the cluster is a multi-zone cluster, ensure that there are the same number, size and type of nodes in each availability zone for both the Cloud Pak for Integration nodes and non-Cloud Pak for Integration nodes.
Apply a label—such as nodeuse=cp4i to the Cloud Pak for Integration nodes, and a different label—such as nodeuse=general to the non-Cloud Pak for Integration nodes.
Create one or more projects for the Cloud Pak for Integration workload, and set a project node selector—such as nodeuse=cp4i—for these projects.
If you are using Openshift Container platform in an environment where the default cluster-wide node selector:

Is supported, define a default cluster-wide node selector, such as nodeuse=general. Defining this node selector causes all workloads deployed in projects other than the Cloud Pak for Integration projects to be placed onto the general node, leaving the Cloud Pak for Integration nodes dedicated exclusively for Cloud Pak for Integration workloads.
Is not supported—such as Red Hat OpenShift on IBM Cloud—set project node selectors, such as nodeuse=general, on the non-Cloud Pak for Integration projects in the cluster. Setting these node selectors causes all workloads deployed in non-Cloud Pak for Integration projects to be placed onto the general node, leaving the Cloud Pak for Integration nodes exclusively for Cloud Pak for Integration workloads. Ensure that any new projects that users create also have this project-wide node selector applied to them.

Kubernetes node placement tools

Kubernetes provides the following ways to influence node placement:

Affinity and anti-affinity

You can use the nodeAffinity field to require or prefer that pods are scheduled to nodes that have particular labels. For more information, see Affinity and anti-affinity in the Kubernetes documentation.

Taints and tolerations

You can add a taint to a node so that only pods that are explicitly marked as tolerating the taint can run there. For more information, see Taints and Tolerations in the Kubernetes documentation.

This option might initially appear to be a good way to place workloads into dedicated nodes. You could add taints to the nodes for use by Cloud Pak for Integration then add a toleration to the Cloud Pak for Integration pods so that they and only they will run there. Theoretically, this arrangement will work. However, workloads are managed by operators that define the pod specification for their workloads. Some of these operators provide ways to override the pod specification but some do not. Furthermore, this approach is far more onerous for the user than the use of project node selectors, even for the operators that provide override mechanisms.

nodeSelector

The nodeSelector field is the simplest way to influence node placement. However, the nodeAffinity field provides the same abilities (and more) and the nodeSelector field might eventually be deprecated. For more information, see the Deprecate nodeSelector issue in the Kubernetes GitHub project.