Node licensing

Using a node licensing model can reduce the number of cores that are counted by License Service. Thus it can be a cost-effective way to maximize CPU usage across the containers on your cluster,

The role of the License Service tool

Container usage for components within IBM Cloud Pak® for Integration is tracked through the IBM License Service tool.

License Service collects and measures the license usage of your IBM products at the cluster level. You can request to retrieve this data for monitoring and compliance purposes. You can also retrieve an audit snapshot of data that is audit evidence.

You must install License Service to monitor and measure license usage of IBM Cloud Paks and IBM stand-alone containerized software to comply with the pricing rule for containerized environments. Manual license measurements are not allowed.

According to the conventional licensing model, the License Service monitors the number of containers that are deployed in Cloud Pak for Integration and the amount of CPU that is allocated to them. Specifically, the License Service uses the spec.template.spec.containers[].resources.limits.cpu pod resource values, which represent the maximum CPU that is available to the workload (not the minimum you might select). This approach is similar to how licensing works for virtual machines (VMs). You pay based on the total number of CPUs that are assigned to a VM, regardless of whether a Cloud Pak for Integration component fully uses them or not.

By default, when deploying containers to Kubernetes clusters (including Red Hat OpenShift), the containers are distributed across all worker nodes. This distribution helps to balance the load and maximizes CPU utilization across the cluster. For example, if you have a cluster with 3 worker nodes and deploy a Cloud Pak for Integration component with 3 replicas, each replica is typically be scheduled on a different worker node. This ensures efficient use of resources and better performance.

Node licensing overview

According to the node licensing model, instead of licensing based on the spec.template.spec.containers[].resources.limits.cpu values of the chargeable workload, the workload is licensed based on the CPU capacity of the worker node.

When the License Service analyzes the data it has gathered, it checks the value of the CPU limits that are allocated to pods on each worker node. If the spec.template.spec.containers[].resources.limits.cpu total that is assigned to pods on a worker exceeds that worker's actual size (number of cores), the License Service only assigns VPU licenses up to the size of that worker.

For example, suppose you have 100 pods running on worker 1, each with the following CPU resource allocations:

spec.template.spec.containers[].resources.requests.cpu = 100m (0.1 core)
spec.template.spec.containers[].resources.limits.cpu = 1000m (1 core)

In the node licensing model, if worker 1 has 24 cores, instead of paying for 100 cores worth of VPU licenses, you would pay only for 24 cores. The "Example scenario" in the next section explains this in more detail.

Node licensing is best suited to workloads that involve high initialization startup CPU usage or that have very flexible workloads where the flows are not all being exercised all the time at a constant rate.

Example scenario: Conventional licensing model

In this example, your organization uses Cloud Pak for Integration on a Kubernetes cluster with the following configuration.

Cluster details

Number of worker nodes: 36
Resources for each worker: 24 CPU cores/48 GB RAM
Total cluster capacity: 864 CPU cores (36 x 24)

Workload characteristics

Number of pods: approximately 700
Workload behavior:
- CPU usage is variable throughout the day.
- Initial container startup requires a high CPU limit (approximately 500m).
- After initialization, steady-state CPU usage is low (approximately 100m).

Current (conventional) licensing situation

Default resource limits cause workloads to be spread across many nodes.
Total licensed cores based on limits is approximately 700 x 0.5 = 350 cores.

This situation leads to higher licensing costs.

Proposed solution: move to node-based licensing

Instead of licensing based on total allocated cores, node-based licensing assigns one license per node.

Your organization can potentially gain the following benefits: - Simplified cost calculation - Might reduce overall license cost by leveraging the full node capacity rather than the sum of pod limits - Pods are free to have virtually unlimited CPU limits to use if needed, which allows for better workload performance and startup speeds.

Let's assume that you want to optimize worker configuration while allowing a tolerance for up to n failures. In this scenario, we assume that only one node at a time will be removed (for example, during maintenance in a rolling update sequence).

Optimal configuration: 7 workers

Runtime Distribution: 700 runtimes distributed evenly across 7 workers, which each worker running 100 containers
Total CPU usage per xontainer (normal operations): 0.1 cores
Total CPU usage per worker: 100 containers × 0.1 cores = 10 cores
Available CPU remaining per worker** (assuming 24 cores per node):
24 (cores per node) - 10 (CPU usage per worker) = 14 available cores

The example configuration has 7 workers. Each worker uses 10 cores of CPU on a nodes with a total of 24 cores.

Licensing impact: Using this configuration, the License Service counts only 7 (workers) × 24 (cores per node) = 168 cores for licensing purposes. This is less than half the number of cores that would be used for distributing workloads across the entire cluster (350 cores).

Worker failure scenario

If a worker fails, 100 container workloads (each requiring 0.5 cores to start) are rescheduled across the remaining workers.

To start all 100 instances (quickly, and at the same time) a total of 50 cores are needed (100 × 0.5 cores).
- These 50 cores are distributed evenly among the remaining 6 workers.
  - Each worker handles 12.5 cores, equivalent to 25 instances.
  - Each worker has 14 available cores by default, so they can accommodate the additional 12.5 cores with some headroom. This additional capacity accounts for slight variations in resource requirements for a small percentage of containers. It enables all pods to start at the same time. However, it’s not required and not having the additional capacity would just delay the startup of all flows by a few minutes.
After startup, each node runs approximately 125 pods (original plus rescheduled workload), using 125 cores × 0.1 = 12.5 cores.

Following Worker 1’s failure, the CPU (cores) are redistributed across the remaining 6 worker nodes, with 12.5 cores each.

Additional considerations for node licensing

Keep these factors in mind when considering this architecture:

Cluster auto-scaling:
- If auto-scaling is enabled, ensure that auto-scaling aligns with node-based licensing limits.
Monitoring and optimization:
- Continuously monitor actual versus requested CPU usage.
- Adjust scheduling policies as needed to improve node utilization.
Cost analysis:
- Compare current core-based license fees with projected node-based license fees.
- Consider the savings from fewer licensed cores versus fixed per-node costs.
Simultaneous deployment:
- For initial deployment, you need to stage your flows. There is not sufficient capacity across all workers for every flow to start simultaneously. Simultaneous deployment would likely result in delays due to CPU bottlenecks and slow container startups.
Cost-benefit for increased failure tolerance:
- Failure involving more than one worker may impact startup times. To fully tolerate multiple failures, add extra workers. This ensures sufficient headroom across the remaining workers to host the workload from the failed nodes. However, if you try to tolerate 3/4 worker failures, this could result in excessive scaling up that negates the benefit of node licensing over conventional workload placement.

Additional potential benefit:: You can gain faster startup times. With this configuration, you have the option to use even larger limits. If you keep your cluster sized appropriately, you can set limits of 1 core on each container. This implication is that during normal running where containers are created individually, they would start up even more quickly than the required limits. Only in a failure scenario would startup time be closer to the maximum allowed time.

Scheduling Cloud Pak for Integration workloads on specific worker nodes

To ensure that all your Cloud Pak for Integration workloads run only on a specific subset of worker nodes, use a combination of nodeSelectors, taints, and tolerations.

Label and taint your worker nodes

Identify the nodes that you want to dedicate to Cloud Pak for Integration workloads. For each node, complete these steps:
- Add a label, to enable integration runtimes to be scheduled onto that node.
- Add a taint, to prevent other workloads from being scheduled on that node.
You can use any label and taint key/value pairs. In this example, we use:
- Label: workloadType=integration
- Taint: workloadType=integration:NoSchedule
1. Apply the label. For <node_name>, enter your actual node name.
```
kubectl label nodes <node_name> workloadType=integration
```
2. Add the taint:
```
kubectl taint nodes nodeName workloadType=integration:NoSchedule
```
Update nodeSelector and tolerations for the instances in your installation. The example below shows how to update an integration runtime.

For each integration runtime resource, update the Kubernetes resource to include the matching nodeSelector and the corresponding toleration. Updating these values allows the integration runtime to be scheduled on the tainted nodes from the previous step.
```
spec:
  template:
     spec:
       nodeSelector:
         workloadType: integration
       tolerations:
         - key: workloadType
           operator: Equal
           value: integration
           effect: NoSchedule
```

After you apply these changes, the following result is expected:

The integration runtime pods are scheduled only on the designated Cloud Pak for Integration worker nodes.
No other pods (except for essential Kubernetes system pods, like DNS or ingress) are running on these dedicated workers.

Tip: You must run chargeable workloads Cloud Pak for Integration-labeled workers because they require VPC licensing. However, non-chargeable components, such as the Platform UI, Automation assets, App Connect Designer, App Connect Dashboard, and certain deployments of MQ are license-free. These components can be scheduled on other worker nodes to avoid additional cost. For more information, see "Chargeable and non-chargeable components" in Licensing.

The table provides links to available documentation on adding node selectors and tolerations for the respective products.

Component product	Node selectors	Tolerations
API Connect	Additional Kubernetes settings	Additional Kubernetes settings
App Connect	Custom resource values (`spec.affinity` row)
DataPower	nodeSelector	tolerations
Event Streams		adminApi

Additional information

Taints and Tolerations in the Kubernetes documentation
nodeSelector in the Kubernetes documentation
License Service in the Cloud Pak foundational services documentation