Hybrid Cloud

Improving resource efficiency for Kubernetes clusters via load-aware scheduling

Share this post:

Allocating resources for containerized microservices in Kubernetes is not a task that should be taken lightly. When not enough CPU, memory and other resources are requested, developers run the risk of application performance issues or having their under-provisioned containers evicted from the nodes in which they are deployed. Grossly over-provisioning resources—the default for many developers—creates a different set of problems, such as low utilization in some container groupings while others are without access to much-needed resources.

Unfortunately, there are no default scheduler plugins in Kubernetes to consider the actual load in clusters for scheduling. To achieve that goal, we developed a way to optimize resource allocation through load-aware scheduling and submitted our “Trimaran: Real Load Aware Scheduling” Kubernetes enhancement proposal, with the hope of soon merging this feature into the Kubernetes scheduler plugin.

The core of our proposal is to account for the real load when making scheduling decisions. Kubernetes developers can specify the amount of resources a container needs by setting the request and the limit of resources. The limit guarantees a container’s resource usage never exceeds a specific value. If the node in which a Pod (the smallest deployable units of computing that you can create and manage in Kubernetes) is running has enough available resources, the node allows a container in the Pod to use more resources than its request but not exceeding its limit. As the request is reserved for a container, accurately setting a container’s request is critical for efficient resource allocation in a cluster.

Using fewer nodes more efficiently

Cluster admins often point out that overall cluster resource utilization is very low, yet they are reluctant to resize their containers set to handle peak usage. Such peak usage-based resource allocation leads to a forever-increasing cluster scale, extremely low utilization of computing resources most of the time and a huge amount of machine costs.

For a cluster that runs stable production services, the main objective is minimizing machine costs by efficiently utilizing all nodes within that cluster. Our approach makes the Kubernetes scheduler aware of the gap between resource allocation and actual resource utilization. A scheduler that takes advantage of this gap may help pack Pods more efficiently, whereas the default scheduler that only considers Pod requests and allocable resources on nodes cannot.

Maintaining node utilization at a certain level and balancing peak usage risk

Increasing resource utilization as much as possible may not be the right solution for all clusters. As it always takes some time to scale up a cluster to handle sudden spikes of load, cluster admins typically prefer to leave adequate room for bursty loads. This gives them enough time to add more nodes to the cluster when needed.

For example, a cluster-admin might find that the load has some seasonality and periodically increases. Resource utilization always increases x%, however, before new nodes can be added to the cluster. In such a scenario, the cluster-admin would want to maintain the cluster to have all nodes with the average utilization around or below 1 – x%.

In some circumstances, scheduling pods to maintain the average utilization on all nodes is also risky because we do not know how the utilization of different nodes varies.

cluster nodes 1 and 2

For example, suppose two nodes have a capacity of eight CPUs, but only five are requested on each. In that case, the default scheduler will deem the two nodes equivalent (assuming everything else is identical). However, our approach to optimize resource allocation through load-aware scheduling can extend the node scoring algorithm to favor the node with less average actual utilization over a given period (e.g., the last six hours).

If both Node 1 and Node 2 are equally favorable according to the average actual utilization over a given period, the scoring algorithm that considers only the average utilization cannot differentiate these two nodes and may randomly select one of those, as shown in the above figure.

However, by looking at historical data and the actual CPU utilization on the node, we can clearly see that the resource utilization on Node 2 has more variations than Node 1. Therefore, at peak hours its utilization is more likely to exceed the total capacity or the targeted utilization level. We should then favor Node 1 to place the new Pod to prevent the risk of under-provisioning in peak hours and to guarantee a better performance for the new Pod.

In addition to efficient scheduling according to the actual usage, we also need an advanced scheduler that can balance the risks of resource contention during peak usage. IBM scientist Asser Tantawi, who previously contributed an open-source safe agent scheduler for use in Kubernetes, will help solve this issue by adapting his earlier plugin to a new plugin. The new “LoadVariationRiskBalancing” plugin is part of the Trimaran scheduling framework, a proposed enhancement for the Kubernetes scheduling plugin framework.

Load aware scheduling: system design

The graph below shows the design of the Trimaran load aware scheduling framework. We added a load watcher that can retrieve, aggregate and analyze resource usage metrics periodically from metric providers—such as Prometheus—to the default Kubernetes scheduler. The load watcher also caches the analysis results and exposes those to scheduler plugins to filter and score nodes.

the design of the Trimaran load aware scheduling framework

We also propose two new plugins—”TargetLoadPacking” and LoadVariationRiskBalancing plugins—to further exploit load watcher potential. The TargetLoadPacking plugin aims to score nodes according to their actual usage. The new LoadVariationRiskBalancing plugin attempts to equalize the risks of resource contention on all nodes at peak usage.

More details of our load aware scheduling framework can be found in the KEP under scheduler-plugins repo: https://github.com/kubernetes-sigs/scheduler-plugins/pull/61.

Research Staff Member, Container Cloud Platform

More Hybrid Cloud stories

Using iter8 and Kiali to evolve your cloud applications while gaining insights into their behavior

IBM Research has partnered with Red Hat to bring iter8 into Kiali. Iter8 lets developers automate the progressive rollout of new microservice versions. From Kiali, developers can launch these rollouts interactively, watch their progress while iter8 shifts user traffic to the best microservice version, gain real-time insights into how competing versions (two or more) perform, and uncover trends on service metrics across versions.

Continue reading

Hybrid clouds will rely on magnetic tape for decades to come

New IBM, Fujifilm prototype breaks world record, delivers record 27X more areal density than today’s tape drives

Continue reading

IEDM 2020: Advances in memory, analog AI and interconnects point to the future of hybrid cloud and AI

At this year’s IEEE International Electron Devices Meeting, IBM researchers will describe a number of breakthroughs aimed at advancing key hardware infrastructure components, including: Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM), analog AI hardware, and advanced interconnect scaling designed to meet those hardware infrastructure demands.

Continue reading