Workload placement

Learn optimal strategies for placing pods on nodes in IBM Cloud Pak® for Integration when you need to override the default Red Hat OpenShift scheduler.

Tip: The terms workload is used here to denote the total amount of resources used by all the containers in a pod.

Key concepts

OpenShift node scheduler

When you deploy Cloud Pak for Integration on the Red Hat® OpenShift® Container Platform, the default OpenShift Container Platform pod scheduler determines the optimal node in the cluster is for each pod to run on. The scheduler is designed to make the best possible decisions so OpenShift cluster administrators don't need to manage scheduling tasks.

In summary, the scheduler places pods on nodes where the following statements apply:

  • The node has enough available resources to run the workload.
  • The node selector's affinity and anti-affinity rules (which are defined in the pod's specification) are satisfied. These rules are designed to evenly distribute the pods that are managed by replica sets and stateful sets across nodes and availability zones.

For more detail about the scheduler's algorithm and the many factors it considers, see https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/ in the Kubernetes documentation.

Workload management in containers

Users can specify resource requests and resource limits for containers on OpenShift Container Platform.
  • The resources request values for a container are the minimum amounts of CPU and memory that must be available on a node for the pod that the container is in to be scheduled to that node. This restriction does not apply to the pods that are already scheduled to that node. When the pod runs on the node at run time, it can use at least as much as it requested if the container in that pod is busy and needs those resources. If the container is less busy, it might not need to use all of its allocated resources. In that case, the additional resources are available for use by containers in other pods on the node. Conversely, if the container is busy, it is possible that it might be able to use more than its resource requests. For this scenario to happen, the pod's node must have available resources.
  • A container's resource limits describe the maximum resources that the pod is allowed to use. These resource limits are equal to or greater than its resource request values. In short, while a container can theoretically use more resources than its resource request, it can never use more than its request limit.
  • The node is over-committed when the sum of the resource limits for all containers in pods that are scheduled on a node exceeds the amount of CPU and memory physically available on that node. In this case, if multiple containers are busy, some of them might not be able to use all of their allowed resources.limits. However, containers are guaranteed to be able to use all their resources.requests).

This condition is handled differently for CPU versus memory. If OpenShift needs to control the amount of CPU a pod is using, it can do so without terminating the pod process. However, if the pod is using more memory than it requested (by way of its containers), and it is competing for memory with other pods, OpenShift deletes the pod and evicts it from the node. OpenShift can then attempt to schedule the pod to another node.

For more information about these concepts, see Resource Management for Pods and Containers in the Kubernetes documentation.

Controlling the placement of pods on nodes

Under most circumstances, you should not interfere with the process for the OpenShift scheduler. In general, the scheduler makes the best possible decisions for your use case. However, you might need to control the scheduling of Cloud Pak for Integration pods to particular nodes, or to constrain the scheduler algorithm in the following cases.
Allowing a group of pods to contend for a fixed amount of CPU

Over-committing the resources available on a node (as described earlier) has the potential to cause issues. Therefore, you need to limit the number of workloads that have the potential to compete with each other for resources. To do so, place a limit on how much CPU resources that the combined containers in a grouping of pods are able to use, as a whole

You exert this control by determining which node the pods (and their associated workloads) are scheduled to. For example, if you have 10 containers, each of which has a request.cpu=1 and limit.cpu=3, schedule the associated pods for these containers to a node that has 20 CPU. Thus, the 10 containers together do not use more than 20 CPUs of resources.

Tip: Why doesn't Kubernetes provide better facilities for managing resource contention? Vertical scaling in Kubernetes functions as an anti-pattern. A pattern more in keeping with the spirit of Kubernetes is to have small containers with limit.cpu equal to request.cpu and to scale horizontally with increased load.
\
Optimizing IBM licensing scenarios

IBM's licensing approach is described in detail on the IBM Container Licenses page. The vCPU Capacity Counting Methodology described there is based on the resources.limits.cpu values of the containers that make up a pod. The OpenShiftscheduler helps ensure that the sum of the resources.requests.cpu for the pods that are deployed to a node never exceeds the actual number of vCPUs on that node. However, in a typical deployment, the sum of the resources.limits.cpu values for the pods that are deployed to a node is higher than the actual number of vCPUs on that node.

The resources.limits.cpu values for Cloud Pak for Integration pods are typically set to higher values than their resources.request.cpu value. The vCPU Capacity Counting Methodology limits the total vCPU capacity that is calculated for each "IBM Program", for each node on the OpenShift cluster, to the actual vCPU capacity of that worker node. In the context of the licensing methodology, Cloud Pak for Integration is one IBM Program; other Cloud Paks are considered to be different IBM Programs.

Be aware of these implications:

  • Pods are counted based on their theoretical maximum CPU consumption. If a workload is deployed across all nodes on an OpenShift cluster, it is likely that each node in the cluster contains some Cloud Pak for Integration pods. It is possible that these pods are idle much of the time and they use only a fraction of the cpu value set in resources.limits.cpu. Even if the workload is busy, other workloads that are running on the nodes might contend for CPU, meaning that the workload cannot use all the CPU specified by resources.limits.cpu. Because the pods are distributed across the cluster, the per-node vCPU capacity limit that was described earlier doesn't apply. Therefore, the license count is the sum of resources.limits.cpu for all pods in the Cloud Pak for Integration. However, if the entire Cloud Pak for Integration workload is on a small set of nodes, the capacity limit does apply.
  • Avoid applying the vCPU limit to pods from more than one IBM Program (software product) on the same node. One implication of the Container Licensing approach is that if pods from more than one IBM Program are on the same node, the vCPU limit approach applies individually for each IBM Program. As an example, you have a worker node with 20 vCPUs. This node is running pods from Cloud Pak for Integration with a total resources.limits.cpu of 20 or more. At the same time, the worker node is running pods from another IBM Program with a total resources.limits.cpu value of 20 vCPUs or more. The vCPU Capacity Counting Methodology counts this value as 20vCPUs of Cloud Pak for Integration plus 20 vCPUs of the other IBM Program. In other words, the total CPU is counted twice, which is an inefficient usage of the license terms.
    Tip: A more efficient use of licensed resources is to deploy each IBM Program on a separate node. For example, if you deploy 10 vCPUs on each node, you get half the license count for each Program (10 vCPUs for Cloud Pak for Integration plus 10 vCPUs for the other Program), even though the same amount of vCPU resources is available.
Optimizing OpenShift Container Platform licensing scenarios

As part of the purchase of IBM Cloud Paks, customers are provided with a fixed number of OpenShift Container Platform license entitlements, as defined in the License Information document for Cloud Pak for Integration.

Red Hat licensing for OpenShift is calculated according to the worker node vCPU count (the worker node capacity) and licenses are associated with clusters and worker nodes. When multiple IBM software products run on the same worker node, they each use the resources that are covered by the OpenShift license for the entire worker node. This deployment architecture is inefficient from the licensing perspective. It is more efficient to separate the OpenShift licenses that are used when multiple IBM software products are running on separate worker nodes. Separating the workloads by nodes also helps ensure compliance with Red Hat licensing.

Avoiding a situation where workloads compete for resources or otherwise conflict

You might have a scenario where you deploy multiple workloads on an OpenShift cluster (for example, two different Cloud Paks, such as Cloud Pak for Integration and Cloud Pak for Data) and you do not want them to compete for resources.

Tools for controlling workload placement

Kubernetes does provide its own tools for pod placement. However, the tools that are provided by the OpenShift Container Platform (namespace node selectors and a default cluster-wide node Selector) support the best system performance in Cloud Pak for Integration. For more information about the Kubernetes options, see the section, "Kubernetes workload placement tools".

OpenShift namespace node selectors

A namespace node selector applies a nodeSelector to any pod that is deployed within the namespace.

You can deploy Cloud Pak for Integration pods in namespaces that have namespace node selectors. This practice restricts workloads to only those nodes with labels that match a specific nodeSelector. Pods located in namespaces that do not have node selectors are not affected by this restriction; they can also be scheduled on nodes that have node selector labels.

This approach works no matter how the pods are created and managed, but it functions best with a workload that is managed by an operator.

For more information, see Creating project-wide node selectors in the Red Hat OpenShift documentation.

Default cluster-wide nodeSelector

The default cluster-wide node selector applies a default node selector to any pod that is deployed into a namespace that does not have a namespace node selector. This node selector prevents Red Hat OpenShift from placing non-Cloud Pak for Integration workloads on the nodes that are reserved for Cloud Pak for Integration use.

Important: The default cluster-wide nodeSelector is not supported on IBM's managed service, Red Hat OpenShift on IBM Cloud.

For more information, see Creating default cluster-wide node selectors in the Red Hat OpenShift documentation.

Procedure for controlling workload placement

This procedure describes an optimal method for distributing Cloud Pak for Integration workloads (pods) across the nodes on an OpenShift cluster.

  1. Define two sets of nodes in the cluster:
    • Nodes where Cloud Pak for Integration workloads run ("Cloud Pak for Integration nodes").
    • Nodes where non-Cloud Pak for Integration workloads run ("non-Cloud Pak for Integration nodes").
    Balance each set of nodes across the physical infrastructure so that availability constraints are satisfied. If the cluster is a multi-zone cluster, confirm that the same number, size, and type of nodes are present in each availability zone. This arrangement needs to be true for both the Cloud Pak for Integration nodes and non-Cloud Pak for Integration nodes.
  2. Apply a label (such as nodeuse=integration) to the Cloud Pak for Integration nodes, and a different label (such as nodeuse=general) to the non-Cloud Pak for Integration nodes.
  3. Create one or more namespaces for the Cloud Pak for Integration workload, and set a namespace node selector (such as nodeuse=integration) for these namespaces.
  4. Set node selectors.
    • If you are using OpenShift Container Platform in an environment where the default cluster-wide node selector is supported, define a default cluster-wide node selector, such as nodeuse=general. Defining a cluster-wide node selector causes all workloads that are deployed in namespaces other than the Cloud Pak for Integration namespaces to be placed onto the general node. The effect of this placement is that Cloud Pak for Integration nodes are dedicated exclusively to Cloud Pak for Integration workloads.
    • If you are using OpenShift Container Platform in an environment where the default cluster-wide node selector is not supported (such as Red Hat OpenShift on IBM Cloud), set namespace node selectors (such as nodeuse=general) on the non-Cloud Pak for Integrationnamespaces in the cluster. Setting these node selectors causes all workloads that are deployed in non-Cloud Pak for Integration namespaces to be placed onto the general node. The effect of this placement is that the Cloud Pak for Integration nodes are dedicated exclusively to Cloud Pak for Integration workloads. Apply this namespace-wide node selector to any new namespaces that users create.

Kubernetes workload placement tools

For reference, this section describes the Kubernetes options for pod placement. For the tools that are provided by the OpenShift Container Platform, which support the best system performance, see the section, "Tools for controlling workload placement".

Kubernetes provides the following methods for pod placement:
Affinity and anti-affinity

You can use the nodeAffinity field to require or prefer that pods are scheduled to nodes that have particular labels. For more information, see Affinity and anti-affinity in the Kubernetes documentation.

Taints and tolerations

You can add a taint to a node so that only pods that are explicitly marked as tolerating the taint can run there. For more information, see Taints and Tolerations in the Kubernetes documentation.

This option initially seems like a good method for placing workloads into dedicated nodes. You can add taints to the nodes for use by Cloud Pak for Integration, then add a toleration to the Cloud Pak for Integration pods so that only those pods run there. Theoretically, this arrangement works well. However, the operators that manage workloads define the pod specification for those workloads. Therefore, some of these operators provide ways to override the pod specification, but other operators do not. Furthermore, this approach is far more onerous for the user than the use of namespace node selectors, even for the operators that provide override mechanisms.

nodeSelector

The nodeSelector field is the simplest way to influence pod placement. However, the nodeAffinity field provides the same abilities (and more) and the nodeSelector field might eventually be deprecated. For more information, see the Deprecate nodeSelector issue in the Kubernetes GitHub project.