Node placement considerations
Background
When a workload—such as Cloud Pak for Integration—is deployed onto the Red Hat OpenShift Container Platform, the OpenShift scheduler makes decisions about the optimal node in the cluster for each pod to run on. The scheduler is designed to make the best scheduling decisions possible so that cluster administrators don't have to manage this themselves.
In summary, the scheduler places pods on nodes where:
- The node has enough free resources to run the workload.
- The node selector's affinity and anti-affinity rules (which are defined in the pod's specification) are satisfied. These rules attempt to spread the members of replica sets and stateful sets evenly across nodes and availability zones.
For more detail about the scheduler's algorithm and the many factors it considers, see, https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/.
Resource requests and limits
Containers in a Kubernetes system such as Openshift can have resource requests and limits applied to them as described here: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/. In summary:
- The resources request values for a pod describe the minimum amount of CPU and memory that must be available on a node—in other words, not requested by the pods that are already scheduled to that node—in order for the pod to be scheduled to the node. When the pod executes on the node at run time, it is guaranteed to be able to consume at least as much as it requested if the pod is busy and needs those resources. If the pod is less busy, it may not need to use all of its allocated resources, and in this case the "spare" resources are available for use by other pods on the node. Conversely, if the pod is very busy it is possible, but not guaranteed, that it may be able to consume more than its resource requests. This will be possible if there are unallocated resources on its node, or if other pods on that node are not using all of the resources they requested.
- A Pod's resource limits describe the maximum resources that the pod will be
permitted to consume. These are equal to or greater than its resource
requestvalues
. It is common practice for containers to be deployed with the limit values set higher than the request values. The rationale for this is that this allows some flexibility so that resources can be used efficiently. - When the sum of the resource limits for all containers scheduled to a node exceeds the
amount of CPU and memory physically available on that node, the node is said to be
overcommitted. In this case, if multiple pods on the node are busy (although all are
guaranteed to be able to consume their
resources.requests
) some may not be able to consume as much as theirresources.limits
.
The way that this condition is handled is differently for CPU and memory. If Red Hat OpenShift needs to control the amount of CPU a pod is using, it can do so without killing the pod process. However, if the pod is using more memory than it requested, and there is contention for memory, the pod may be killed and evicted from the node. OpenShift may then attempt to schedule the pod to another node.
When to control the placement of workloads on nodes
Under most circumstances, it's best to allow the Openshift scheduler to make its own decisions: the scheduler was designed using best practices and will, more often than not, make the best possible decisions for your use case. However, there are times you may need to control the scheduling of Cloud Pak for Integration pods to particular nodes, or need to constrain the scheduler algorithm. These scenarios include:
Allowing a group of pods to contend for a fixed amount of CPU
Over-committing the resources available on a node (as described earlier) has the potential to cause issues. For this reason, users may need to control which workloads have the potential to compete with each other for resource by placing a limit on how much CPU resources a grouping of pods are able to consume as a whole.To achieve this, you must control which node
they are scheduled to. For example, if 10 pods, each of which has a request.cpu=1
and
limit.cpu=3
are scheduled to a node that has 20 CPU, the 10 pods in aggregate will not use
more than 20 CPUs of resources.
limit.cpu
equal to request.cpu
and to scale horizontally with increased
load.Optimizing IBM licensing scenarios
Background: IBM's
licensing approach is described in detail on the IBM Container Licenses page. The vCPU Capacity Counting Methodology
described there is based on the resources.limits.cpu
values of the containers making up a
pod. While the Openshift scheduler ensures that the sum of the resources.requests.cpu
for
the pods deployed to a node never exceeds the actual number of vCPUs on that node, in a typical
deployment the sum of the resources.limits.cpu
values for the pods deployed to a node may
be much higher than the actual number of vCPUs on that node.
The resources.limits.cpu
values for Cloud Pak for Integration pods are typically set to higher values than
their resources.request.cpu
value. This is so that the containers can use more CPU
resources than the absolute minimum that they require, to account for varying load. The vCPU
Capacity Counting Methodology limits the total vCPU capacity that is calculated for each "IBM Program"—for
each node in the cluster—to the actual vCPU capacity of that worker node. In the context of the
licensing methodology, Cloud Pak for Integration is one IBM Program/ Other Cloud Paks are considered
to be different IBM Programs.
Implications:
- Pods are counted based on their theoretical maximum CPU consumption. Given the background
explained earlier, if a workload is deployed across all of the nodes in a cluster then it is likely
that each node in the cluster will contain some Cloud Pak for Integration pods. It
is possible that these pods are idle much of the time, consuming only a fraction of the
cpu
value set inresources.limits.cpu
. Even if the workload is busy, it may be that other workloads running on the nodes are contending for CPU, meaning the workload cannot obtain all of the CPU specified byresources.limits.cpu
. Because the pods are distributed across the cluster, the per-node vCPU capacity limit described earlier doesn't apply, so the license count is the sum ofresources.limits.cpu
for all of the pods in the Cloud Pak for Integration. However, if the Cloud Pak for Integration workload is gathered together on a small set of nodes, then the capping described does apply. - Pods from more than one IBM Program on the same Node. One implication of the Container Licensing approach is that if there are Pods from more than one IBM Program on the same node then the vCPU limit approach applies individually for each IBM Program. Thus if a particular worker node has 20 vCPUs and is running pods from Cloud Pak for Integration whose total resources.limits.cpu is 20 or more at the same time as running pods from another different 'IBM Program' whose total resources.request.cpu value is 20 vCPUs or more than the vCPU Capacity Counting Methodology will count this as 20vCPUs of Cloud Pak for Integration AND 20 vCPUS of the other 'IBM Program'. Such a situation is clearly inefficient in terms of license use. A more efficient use of license resources is to deploy these workloads on separate nodes—each with 10 vCPUs—because this results in half the license count (10 Cloud Pak for Integration plus 10 'other program') despite the same amount of CPU resource being available.
OpenShift Container Platform licensing scenarios
As part of the purchase of IBM Cloud Paks, customers are provided with a fixed number of OpenShift Container Platform (OCP) license entitlements, as defined in the License Information document for Cloud Pak for Integration.
Red Hat licensing for OCP is calculated according to the worker node vCPU count (worker node capacity) and licenses are associated with clusters and worker nodes. When multiple IBM software products run on the same worker node, they each consume the resources covered by OCP license for the entire worker node. This is inefficient from the licensing perspective. It is more efficient to separate the OCP licenses used when multiple IBM software products are running on separate worker nodes. It is also easier from the Red Hat License Compliance perspective when the workloads are separated by nodes.
Avoiding a situation where workloads compete for resources or otherwise conflict
Sometimes you deploy multiple workloads onto a cluster and don't want them to compete for resources. One example of this situation would be when you deploy two different Cloud Paks—such as Cloud Pak for Integration and Cloud Pak for Data—onto the same cluster.
Separating workloads that are known to use large amounts of ephemeral storage
Kubernetes emptydir
volumes, as well as the writable
layers of containers, and some log files, are written to the filesystem on a node during pod
lifetime as described in the OpenShift documentation. Although there is a technology
preview feature in OCP enabled by the LocalStorageCapacityIsolation
feature gate, this
feature is disabled by default. Therefore, if the pods scheduled to a node use more ephemeral
storage than is provided on that node, the node is tainted and which causes an eviction
of all pods that don't tolerate this taint (see Taints and tolerations for more information).
For this reason, if particular workloads are known to require large amounts of ephemeral storage, it may be necessary to associate them with specific nodes so that they are not scheduled to the same node.
Tools for controlling how a workload is deployed to nodes
Kubernetes does provide three tools for workload placement: affinity and anti-affinity, taints and tolerations, and nodeSelectors. If you would like to learn more about these options, see the Reference section.
However, Cloud Pak for Integration recommends that you use the tools provided by the OpenShift Container Platform (OCP): project node selectors and a default cluster-wide nodeSelector.
OCP project node selectors
A Project node selector applies a nodeSelector to any pod that is deployed within the project.
If Cloud Pak for Integration workloads are deployed into projects which have project node selectors, nodeSelectors can be used to constrain the Cloud Pak for Integration workloads to only those nodes with labels that match the nodeSelector. This does not stop other workloads from being scheduled onto those nodes if they are deployed into projects that do not have a project nodeSelector.
This approach works no matter how the pods are created and managed, but it functions best with a workload that is managed by an operator.
For additional information, see the OpenShift documentation.
Default cluster-wide nodeSelector
The default cluster-wide node selector applies a default node selector to any pod that is deployed into a project that does not have a project node selector. This can be used as a solution to avoid non Cloud Pak for Integration workloads from being placed onto the nodes that are reserved for Cloud Pak for Integration use.
For additional information, see theOpenShift documentation
Recommended approach for controlling how a workload is divided among nodes
This procedure describes how to deploy Cloud Pak for Integration workloads across the nodes in a cluster.
- Define two sets of nodes:
- One set of nodes where Cloud Pak for Integration workloads in the cluster will run ("CP4I nodes").
- Another set of nodes where the non-Cloud Pak for Integration workloads will run ("non-CP4I nodes").
- Apply a label—such as
nodeuse=cp4i
to the Cloud Pak for Integration nodes, and a different label—such asnodeuse=general
to the non-Cloud Pak for Integration nodes. - Create one or more projects for the Cloud Pak for Integration workload, and set
a project node selector—such as
nodeuse=cp4i
—for these projects. - If you are using Openshift Container platform in an environment where the default cluster-wide node selector:
- Is supported, define a default cluster-wide node selector, such as
nodeuse=general
. Defining this node selector causes all workloads deployed in projects other than the Cloud Pak for Integration projects to be placed onto the general node, leaving the Cloud Pak for Integration nodes dedicated exclusively for Cloud Pak for Integration workloads. - Is not supported—such as Red Hat OpenShift on IBM Cloud—set project node selectors, such
as
nodeuse=general
, on the non-Cloud Pak for Integration projects in the cluster. Setting these node selectors causes all workloads deployed in non-Cloud Pak for Integration projects to be placed onto the general node, leaving the Cloud Pak for Integration nodes exclusively for Cloud Pak for Integration workloads. Ensure that any new projects that users create also have this project-wide node selector applied to them.
Reference
Kubernetes node placement toolsKubernetes provides multiple different ways to influence node placement:
- Affinity and anti affinity https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity
- In short, nodeAffinity provides a way to require or prefer that pods are scheduled to nodes that have particular labels.
- Taints and Tolerations https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
- In short nodes can be tainted so that only pods that are explicitly marked as tolerating the taint can run there.
- At first sight, this might appear to be a good way to place workloads into dedicated nodes. It might seem that it would be possible to taint the nodes for use by Cloud Pak for Integration and then add a toleration to the Cloud Pak for Integration pods so that they and only they will run there. Theoretically this will work; however, workloads are managed by operators that define the PodSpec for their workloads. Some of these operators provide ways to override the podSpecs but some do not. Furthermore, this approach, even for the operators that provide override mechanisms is far more onerous for the user than the use of Project Node selectors.
- nodeSelector
- The capabilities that are provided by nodeAffinity are a superset of those provided by nodeSelectors and the Kubernetes project has said that nodeSelectors will eventually be deprecated; therefore, we recommend the use of nodeAffinity rather than nodeSelectors. (https://github.com/kubernetes/kubernetes/issues/82331)