Workload placement
Learn optimal strategies for placing pods on nodes in IBM Cloud Pak® for Integration when you need to override the default Red Hat OpenShift scheduler.
Key concepts
OpenShift node scheduler
When you deploy Cloud Pak for Integration on the Red Hat® OpenShift® Container Platform, the default OpenShift Container Platform pod scheduler determines the optimal node in the cluster is for each pod to run on. The scheduler is designed to make the best possible decisions so OpenShift cluster administrators don't need to manage scheduling tasks.
In summary, the scheduler places pods on nodes where the following statements apply:
- The node has enough available resources to run the workload.
- The node selector's affinity and anti-affinity rules (which are defined in the pod's specification) are satisfied. These rules are designed to evenly distribute the pods that are managed by replica sets and stateful sets across nodes and availability zones.
For more detail about the scheduler's algorithm and the many factors it considers, see https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/ in the Kubernetes documentation.
Workload management in containers
- The resources request values for a container are the minimum amounts of CPU and memory that must be available on a node for the pod that the container is in to be scheduled to that node. This restriction does not apply to the pods that are already scheduled to that node. When the pod runs on the node at run time, it can use at least as much as it requested if the container in that pod is busy and needs those resources. If the container is less busy, it might not need to use all of its allocated resources. In that case, the additional resources are available for use by containers in other pods on the node. Conversely, if the container is busy, it is possible that it might be able to use more than its resource requests. For this scenario to happen, the pod's node must have available resources.
- A container's resource limits describe the maximum resources that the pod is allowed to use. These resource limits are equal to or greater than its resource request values. In short, while a container can theoretically use more resources than its resource request, it can never use more than its request limit.
- The node is over-committed when the sum of the resource limits for all containers in pods
that are scheduled on a node exceeds the amount of CPU and memory physically available on that node.
In this case, if multiple containers are busy, some of them might not be able to use all of their
allowed
resources.limits
. However, containers are guaranteed to be able to use all theirresources.requests
).
This condition is handled differently for CPU versus memory. If OpenShift needs to control the amount of CPU a pod is using, it can do so without terminating the pod process. However, if the pod is using more memory than it requested (by way of its containers), and it is competing for memory with other pods, OpenShift deletes the pod and evicts it from the node. OpenShift can then attempt to schedule the pod to another node.
For more information about these concepts, see Resource Management for Pods and Containers in the Kubernetes documentation.Controlling the placement of pods on nodes
- Allowing a group of pods to contend for a fixed amount of CPU
Over-committing the resources available on a node (as described earlier) has the potential to cause issues. Therefore, you need to limit the number of workloads that have the potential to compete with each other for resources. To do so, place a limit on how much CPU resources that the combined containers in a grouping of pods are able to use, as a whole
You exert this control by determining which node the pods (and their associated workloads) are scheduled to. For example, if you have 10 containers, each of which has a
request.cpu=1
andlimit.cpu=3
, schedule the associated pods for these containers to a node that has 20 CPU. Thus, the 10 containers together do not use more than 20 CPUs of resources.Tip: Why doesn't Kubernetes provide better facilities for managing resource contention? Vertical scaling in Kubernetes functions as an anti-pattern. A pattern more in keeping with the spirit of Kubernetes is to have small containers with\limit.cpu
equal torequest.cpu
and to scale horizontally with increased load.- Optimizing IBM licensing scenarios
-
IBM's licensing approach is described in detail on the IBM Container Licenses page. The vCPU Capacity Counting Methodology described there is based on the
resources.limits.cpu
values of the containers that make up a pod. The OpenShiftscheduler helps ensure that the sum of theresources.requests.cpu
for the pods that are deployed to a node never exceeds the actual number of vCPUs on that node. However, in a typical deployment, the sum of theresources.limits.cpu
values for the pods that are deployed to a node is higher than the actual number of vCPUs on that node.The
resources.limits.cpu
values for Cloud Pak for Integration pods are typically set to higher values than theirresources.request.cpu
value. The vCPU Capacity Counting Methodology limits the total vCPU capacity that is calculated for each "IBM Program", for each node on the OpenShift cluster, to the actual vCPU capacity of that worker node. In the context of the licensing methodology, Cloud Pak for Integration is one IBM Program; other Cloud Paks are considered to be different IBM Programs.Be aware of these implications:
- Pods are counted based on their theoretical maximum CPU consumption. If a workload is deployed
across all nodes on an OpenShift cluster, it is likely that each node in the
cluster contains some Cloud Pak for Integration pods. It is possible that these pods
are idle much of the time and they use only a fraction of the
cpu
value set inresources.limits.cpu
. Even if the workload is busy, other workloads that are running on the nodes might contend for CPU, meaning that the workload cannot use all the CPU specified byresources.limits.cpu
. Because the pods are distributed across the cluster, the per-node vCPU capacity limit that was described earlier doesn't apply. Therefore, the license count is the sum ofresources.limits.cpu
for all pods in the Cloud Pak for Integration. However, if the entire Cloud Pak for Integration workload is on a small set of nodes, the capacity limit does apply. - Avoid applying the vCPU limit to pods from more than one IBM Program (software product) on the
same node. One implication of the Container Licensing approach is that if pods from more than one
IBM Program are on the same node, the vCPU limit approach applies individually for each IBM Program.
As an example, you have a worker node with 20 vCPUs. This node is running pods from Cloud Pak for Integration with a total
resources.limits.cpu
of 20 or more. At the same time, the worker node is running pods from another IBM Program with a totalresources.limits.cpu
value of 20 vCPUs or more. The vCPU Capacity Counting Methodology counts this value as 20vCPUs of Cloud Pak for Integration plus 20 vCPUs of the other IBM Program. In other words, the total CPU is counted twice, which is an inefficient usage of the license terms.Tip: A more efficient use of licensed resources is to deploy each IBM Program on a separate node. For example, if you deploy 10 vCPUs on each node, you get half the license count for each Program (10 vCPUs for Cloud Pak for Integration plus 10 vCPUs for the other Program), even though the same amount of vCPU resources is available.
- Pods are counted based on their theoretical maximum CPU consumption. If a workload is deployed
across all nodes on an OpenShift cluster, it is likely that each node in the
cluster contains some Cloud Pak for Integration pods. It is possible that these pods
are idle much of the time and they use only a fraction of the
- Optimizing OpenShift Container Platform licensing scenarios
-
As part of the purchase of IBM Cloud Paks, customers are provided with a fixed number of OpenShift Container Platform license entitlements, as defined in the License Information document for Cloud Pak for Integration.
Red Hat licensing for OpenShift is calculated according to the worker node vCPU count (the worker node capacity) and licenses are associated with clusters and worker nodes. When multiple IBM software products run on the same worker node, they each use the resources that are covered by the OpenShift license for the entire worker node. This deployment architecture is inefficient from the licensing perspective. It is more efficient to separate the OpenShift licenses that are used when multiple IBM software products are running on separate worker nodes. Separating the workloads by nodes also helps ensure compliance with Red Hat licensing.
- Avoiding a situation where workloads compete for resources or otherwise conflict
-
You might have a scenario where you deploy multiple workloads on an OpenShift cluster (for example, two different Cloud Paks, such as Cloud Pak for Integration and Cloud Pak for Data) and you do not want them to compete for resources.
Tools for controlling workload placement
Kubernetes does provide its own tools for pod placement. However, the tools that are provided by the OpenShift Container Platform (namespace node selectors and a default cluster-wide node Selector) support the best system performance in Cloud Pak for Integration. For more information about the Kubernetes options, see the section, "Kubernetes workload placement tools".
- OpenShift namespace node selectors
-
A namespace node selector applies a nodeSelector to any pod that is deployed within the namespace.
You can deploy Cloud Pak for Integration pods in namespaces that have namespace node selectors. This practice restricts workloads to only those nodes with labels that match a specific nodeSelector. Pods located in namespaces that do not have node selectors are not affected by this restriction; they can also be scheduled on nodes that have node selector labels.
This approach works no matter how the pods are created and managed, but it functions best with a workload that is managed by an operator.
For more information, see Creating project-wide node selectors in the Red Hat OpenShift documentation.
- Default cluster-wide nodeSelector
-
The default cluster-wide node selector applies a default node selector to any pod that is deployed into a namespace that does not have a namespace node selector. This node selector prevents Red Hat OpenShift from placing non-Cloud Pak for Integration workloads on the nodes that are reserved for Cloud Pak for Integration use.
Important: The default cluster-wide nodeSelector is not supported on IBM's managed service, Red Hat OpenShift on IBM Cloud.For more information, see Creating default cluster-wide node selectors in the Red Hat OpenShift documentation.
Procedure for controlling workload placement
This procedure describes an optimal method for distributing Cloud Pak for Integration workloads (pods) across the nodes on an OpenShift cluster.
- Define two sets of nodes in the cluster:
- Nodes where Cloud Pak for Integration workloads run ("Cloud Pak for Integration nodes").
- Nodes where non-Cloud Pak for Integration workloads run ("non-Cloud Pak for Integration nodes").
- Apply a label (such as
nodeuse=integration
) to the Cloud Pak for Integration nodes, and a different label (such asnodeuse=general
) to the non-Cloud Pak for Integration nodes. - Create one or more namespaces for the Cloud Pak for Integration workload, and set
a namespace node selector (such as
nodeuse=integration
) for these namespaces. - Set node selectors.
- If you are using OpenShift Container Platform in an environment where the default cluster-wide
node selector is supported, define a default cluster-wide node selector, such as
nodeuse=general
. Defining a cluster-wide node selector causes all workloads that are deployed in namespaces other than the Cloud Pak for Integration namespaces to be placed onto the general node. The effect of this placement is that Cloud Pak for Integration nodes are dedicated exclusively to Cloud Pak for Integration workloads. - If you are using OpenShift Container Platform in an environment where the default cluster-wide
node selector is not supported (such as Red Hat OpenShift on IBM Cloud), set namespace node
selectors (such as
nodeuse=general
) on the non-Cloud Pak for Integrationnamespaces in the cluster. Setting these node selectors causes all workloads that are deployed in non-Cloud Pak for Integration namespaces to be placed onto the general node. The effect of this placement is that the Cloud Pak for Integration nodes are dedicated exclusively to Cloud Pak for Integration workloads. Apply this namespace-wide node selector to any new namespaces that users create.
- If you are using OpenShift Container Platform in an environment where the default cluster-wide
node selector is supported, define a default cluster-wide node selector, such as
Kubernetes workload placement tools
For reference, this section describes the Kubernetes options for pod placement. For the tools that are provided by the OpenShift Container Platform, which support the best system performance, see the section, "Tools for controlling workload placement".
- Affinity and anti-affinity
-
You can use the nodeAffinity field to require or prefer that pods are scheduled to nodes that have particular labels. For more information, see Affinity and anti-affinity in the Kubernetes documentation.
- Taints and tolerations
-
You can add a taint to a node so that only pods that are explicitly marked as tolerating the taint can run there. For more information, see Taints and Tolerations in the Kubernetes documentation.
This option initially seems like a good method for placing workloads into dedicated nodes. You can add taints to the nodes for use by Cloud Pak for Integration, then add a toleration to the Cloud Pak for Integration pods so that only those pods run there. Theoretically, this arrangement works well. However, the operators that manage workloads define the pod specification for those workloads. Therefore, some of these operators provide ways to override the pod specification, but other operators do not. Furthermore, this approach is far more onerous for the user than the use of namespace node selectors, even for the operators that provide override mechanisms.
- nodeSelector
-
The nodeSelector field is the simplest way to influence pod placement. However, the nodeAffinity field provides the same abilities (and more) and the nodeSelector field might eventually be deprecated. For more information, see the Deprecate nodeSelector issue in the Kubernetes GitHub project.
Learn more
For information about how deployed instances in Cloud Pak for Integration handle workload placement, see Workload placement for instances.