November 24, 2020 | Written by: Chen Wang
Categorized: Hybrid Cloud
Share this post:
Allocating resources for containerized microservices in Kubernetes is not a task that should be taken lightly. When not enough CPU, memory and other resources are requested, developers run the risk of application performance issues or having their under-provisioned containers evicted from the nodes in which they are deployed. Grossly over-provisioning resources—the default for many developers—creates a different set of problems, such as low utilization in some container groupings while others are without access to much-needed resources.
Unfortunately, there are no default scheduler plugins in Kubernetes to consider the actual load in clusters for scheduling. To achieve that goal, we developed a way to optimize resource allocation through load-aware scheduling and submitted our “Trimaran: Real Load Aware Scheduling” Kubernetes enhancement proposal, with the hope of soon merging this feature into the Kubernetes scheduler plugin.
The core of our proposal is to account for the real load when making scheduling decisions. Kubernetes developers can specify the amount of resources a container needs by setting the request and the limit of resources. The limit guarantees a container’s resource usage never exceeds a specific value. If the node in which a Pod (the smallest deployable units of computing that you can create and manage in Kubernetes) is running has enough available resources, the node allows a container in the Pod to use more resources than its request but not exceeding its limit. As the request is reserved for a container, accurately setting a container’s request is critical for efficient resource allocation in a cluster.
Using fewer nodes more efficiently
Cluster admins often point out that overall cluster resource utilization is very low, yet they are reluctant to resize their containers set to handle peak usage. Such peak usage-based resource allocation leads to a forever-increasing cluster scale, extremely low utilization of computing resources most of the time and a huge amount of machine costs.
For a cluster that runs stable production services, the main objective is minimizing machine costs by efficiently utilizing all nodes within that cluster. Our approach makes the Kubernetes scheduler aware of the gap between resource allocation and actual resource utilization. A scheduler that takes advantage of this gap may help pack Pods more efficiently, whereas the default scheduler that only considers Pod requests and allocable resources on nodes cannot.
Maintaining node utilization at a certain level and balancing peak usage risk
Increasing resource utilization as much as possible may not be the right solution for all clusters. As it always takes some time to scale up a cluster to handle sudden spikes of load, cluster admins typically prefer to leave adequate room for bursty loads. This gives them enough time to add more nodes to the cluster when needed.
For example, a cluster-admin might find that the load has some seasonality and periodically increases. Resource utilization always increases x%, however, before new nodes can be added to the cluster. In such a scenario, the cluster-admin would want to maintain the cluster to have all nodes with the average utilization around or below 1 – x%.
In some circumstances, scheduling pods to maintain the average utilization on all nodes is also risky because we do not know how the utilization of different nodes varies.
For example, suppose two nodes have a capacity of eight CPUs, but only five are requested on each. In that case, the default scheduler will deem the two nodes equivalent (assuming everything else is identical). However, our approach to optimize resource allocation through load-aware scheduling can extend the node scoring algorithm to favor the node with less average actual utilization over a given period (e.g., the last six hours).
If both Node 1 and Node 2 are equally favorable according to the average actual utilization over a given period, the scoring algorithm that considers only the average utilization cannot differentiate these two nodes and may randomly select one of those, as shown in the above figure.
However, by looking at historical data and the actual CPU utilization on the node, we can clearly see that the resource utilization on Node 2 has more variations than Node 1. Therefore, at peak hours its utilization is more likely to exceed the total capacity or the targeted utilization level. We should then favor Node 1 to place the new Pod to prevent the risk of under-provisioning in peak hours and to guarantee a better performance for the new Pod.
In addition to efficient scheduling according to the actual usage, we also need an advanced scheduler that can balance the risks of resource contention during peak usage. IBM scientist Asser Tantawi, who previously contributed an open-source safe agent scheduler for use in Kubernetes, will help solve this issue by adapting his earlier plugin to a new plugin. The new “LoadVariationRiskBalancing” plugin is part of the Trimaran scheduling framework, a proposed enhancement for the Kubernetes scheduling plugin framework.
Load aware scheduling: system design
The graph below shows the design of the Trimaran load aware scheduling framework. We added a load watcher that can retrieve, aggregate and analyze resource usage metrics periodically from metric providers—such as Prometheus—to the default Kubernetes scheduler. The load watcher also caches the analysis results and exposes those to scheduler plugins to filter and score nodes.
We also propose two new plugins—”TargetLoadPacking” and LoadVariationRiskBalancing plugins—to further exploit load watcher potential. The TargetLoadPacking plugin aims to score nodes according to their actual usage. The new LoadVariationRiskBalancing plugin attempts to equalize the risks of resource contention on all nodes at peak usage.
More details of our load aware scheduling framework can be found in the KEP under scheduler-plugins repo: https://github.com/kubernetes-sigs/scheduler-plugins/pull/61.