Manage resources on overcommitted KVM hosts

Consolidating workloads by overcommitting resources

A key benefit of virtualization is the ability to consolidate multiple workloads onto a single computer system. This consolidation yields savings in power consumption, capital expense, and administration costs. The degree of savings depends on the ability to overcommit hardware resources such as memory, CPU cycles, I/O, and network bandwidth. Technologies such as memory ballooning and Kernel Same-page Merging (KSM) can improve memory overcommitment with proper manual tuning. Autonomic reconfiguration of these controls in response to host and VM conditions can result in even greater savings. In this article, learn how to apply these techniques to increase your savings.

Share:

Adam Litke (agl@us.ibm.com), Software Engineer, IBM

Adam Litke began his work on Linux by assisting with the bring-up of the PPC64 architecture at IBM in 2001. Since then he has worked on a variety of projects including kexec-based crash dumping, huge pages, libhugetlbfs, and an automation harness used by test.kernel.org. Currently Adam is focused on virtualization and contributes to qemu, libvirt, the Linux kernel, and Memory Overcommitment Manager.



08 February 2011

Also available in Russian Japanese Portuguese

Virtualization promises to increase efficiency by enabling workload consolidation. Maximizing virtual machine density while maintaining good performance can be a real challenge:

  • A workload's utilization of resources such as CPU, memory, and bandwidth for network and storage access varies over time; if you can dynamically allocate these resources according to demand, then you can achieve greater density with overcommitment.
  • Optimal resource management also depends on factors such as the system configuration and hardware setup.

One way to develop a management strategy that incorporates both factors is to write a set of rules in a policy language. You can tailor your policy to your unique conditions and effectively balance your consolidation and performance goals.

This article explores the challenges associated with aggressive overcommitment and proposes a policy-driven management tool that can help you address the challenges. The tool is deployed in a typical KVM environment and manages two distinct workloads. This article evaluates the results of this exercise and concludes with suggestions for additional work items and potential research topics.

Overcommitment challenges

A thorough investigation of memory overcommitment in KVM should begin with the Linux memory manager itself. Because it is built on the concepts of virtual memory, Linux employs memory overcommitment techniques by design.

Memory pages requested by a process are not allocated until they are actually used. Using the Linux page cache, multiple processes can save memory by accessing files through shared pages; as memory is exhausted, the system can free up memory by swapping less frequently used pages to disk. While not perfect, these techniques can result in a substantial difference between the amount of memory that is allocated and the amount actually used.

Because KVM virtual machines are regular processes, the standard memory conservation techniques apply. But unlike regular processes, KVM guests contain a nested operating system, which impacts memory overcommitment in two key ways. KVM guests can have greater memory overcommitment potential than regular processes. This is due to a large difference between minimum and maximum guest memory requirements caused by swings in utilization.

Capitalizing on this variability is central to the appeal of virtualization, but it is not always easy. While the host is managing the memory allocated to a KVM guest, the guest kernel is simultaneously managing the same memory. Lacking any form of collaboration between the host and guest, neither the host nor the guest memory manager is able to make optimal decisions regarding caching and swapping, which can lead to less efficient use of memory and degraded performance.

Linux provides additional mechanisms to address memory overcommitment specific to virtualization.

  • Memory ballooning is a technique in which the host instructs a cooperative guest to release some of its assigned memory so that it can be used for another purpose. This technique can help refocus memory pressure from the host onto a guest.
  • Kernel Same-page Merging (KSM) uses a kernel thread that scans previously identified memory ranges for identical pages, merges them together, and frees the duplicates. Systems that run a large number of homogeneous virtual machines benefit most from this form of memory sharing.
  • Other resource management features such as Cgroups have applications in memory overcommitment that can dynamically shuffle resources among virtual machines.

These mechanisms provide an effective framework for managing overcommitted memory, but they all share an important limitation: They must be configured and tuned by an external entity. Automatic configuration is not available because the individual components of the virtualization stack do not have enough information to make tuning decisions.

For example, it is beyond the scope of a single qemu process to know whether the host is running out of memory and how much memory can be ballooned from its guest without adversely impacting performance. Those decisions can be made only when administrative policy is combined with real-time information about the host and its guests.

To solve this challenge with KSM, a daemon called ksmtuned has been written to dynamically manage the tunable parameters according to sharing potential and the need for memory. Because ksmtuned manages a single mechanism in isolation, it cannot be part of a comprehensive solution.

The use of any overcommitment mechanism fundamentally affects the operation of the host and guest memory management systems. As such, the simultaneous use of multiple mechanisms is likely to cause secondary effects. Figure 1 illustrates an interaction between memory ballooning and KSM.

Figure 1. The effects of memory ballooning on KSM
The effects of memory ballooning on KSM

In this example, increased ballooning pressure undermines KSM's ability to share pages. Guest balloon drivers select pages to balloon without considering whether the host page might be shared. Ballooning a shared page is a mistake because it deprives the guest of resources without actually saving any memory. To maximize efficiency, these types of interactions must be anticipated and managed.

You can deploy KVM-based virtualization in a myriad of configurations using architectures designed to meet different goals. Aggressive virtual machine consolidation requires a performance trade-off. The proper balance between the two depends on the situation.

Memory overcommitment can increase the demand for all other system resources. Pressure on memory caches will cause increased disk or network I/O, and increased page reclaim activity will cause a guest to consume more CPU cycles.

You can address these challenges with a new daemon designed for the task: Memory Overcommitment Manager (MOM).


Dynamic management with MOM

The MOM (Memory Overcommitment Manager) daemon is shown in Figure 2.

Figure 2. Memory Overcommitment Manager
Memory Overcommitment Manager

MOM provides a flexible framework for host and guest statistics collection, a policy engine, and control points for ballooning and KSM. With MOM, an administrator can construct an overcommitment policy that responds to real-time conditions with dynamic adjustments to system tunables to achieve a greater level of memory overcommitment.

MOM uses libvirt to maintain a list of virtual machines running on the host. At a regular collection interval, data is gathered about the host and guests. Data can come from multiple sources (the host /proc interface, the libvirt API, virtio-serial, or a network connection to the guest). Once collected, the data is organized for use by the policy engine. (Virtio is a standard for network and disk device drivers in which just the guest's device driver "knows" it is running in a virtual environment and therefore cooperates with the hypervisor, enabling guests to get high performance network and disk operations. In sum, Virtio delivers most of the performance benefits of paravirtualization.)

The policy engine is essentially a small interpreter that understands programs written in a simple policy language. Periodically, MOM will evaluate the user-supplied policy using the collected data. The policy may trigger configuration changes such as inflation of a guest's memory balloon or a change in the KSM scanning rate.

Over time, the MOM framework can be expanded to collect from additional data sources and to control new memory overcommitment mechanisms as development progresses in this area.


Evaluating MOM

To evaluate MOM and its effectiveness at managing overcommitted memory, let's look at two virtualization workloads:

  • In the first workload, the virtual machines are configured to consume varying amounts of anonymous memory according to a prescribed access pattern.
  • The second workload uses a LAMP benchmark where each virtual machine functions as an independent MediaWiki instance.

For each workload, a MOM policy that controls memory ballooning and KSM is evaluated. To avoid host swapping, each guest's memory balloon is dynamically adjusted in accordance with host and guest memory pressure. Using the same algorithm as ksmtuned, KSM is adjusted based on memory pressure and the amount of shareable memory.

The two workload scenarios

The Memknobs workload scenario uses a simple program called Memknobs to create memory pressure by allocating and touching anonymous memory pages in a pattern that challenges the kernel's memory reclaim algorithms. Memknobs allocates a fixed-sized buffer and loops through that buffer's pages, writing to each one. By invoking Memknobs repeatedly with a gradually changing buffer size for each iteration, a guest can simulate a memory-bound workload with no I/O component. To overcommit memory on a host, we deployed Memknobs onto 32 virtual machines where each instance used memory in a unique pattern.

For the Cloudy workload scenario, we generated a realistic workload with a disk I/O component using an open suite called Cloudy. Cloudy is an easy-to-set-up LAMP benchmark that measures virtualization scalability and shows the effects of resource overcommitment. Multiple virtual machines are configured as MediaWiki servers. Each wiki is loaded with pages from Wikipedia and randomly generated image data.

An included JMeter test plan exercises all of the instances and measures throughput and response times. The test plan can be configured to produce a steady-state workload or to introduce variability by alternating the load between multiple virtual machine groups. The amount and type of load can be varied by changing the request rate, the number of concurrent users, and the size of the randomly generated wiki image files. A PHP accelerator reduces CPU consumption to a negligible amount. This workload can generate a significant amount of I/O and is sensitive to the amount of bandwidth available when accessing the virtual machine image storage device.

The policy

Our memory ballooning strategy is designed to prevent host swapping, instead directing memory pressure onto the guests. The host, lacking guest page access information, is unable to properly apply its LRU algorithms when selecting pages to swap. Further, the guest operating system has the most information about the workload's memory use and should be able to make the best page replacement decisions.

The policy relies on memory ballooning to maintain a pool of readily freeable host memory. We define freeable memory as the sum of MemFree, Buffers, and Cache as reported in /proc/meminfo. When this pool shrinks below 20 percent of total memory, memory ballooning is initiated. Pressure is applied to guests according to the level of host memory pressure. Free memory is reclaimed first so guests with more unused memory are ballooned the most. With enough balloon pressure, guests will evict cached pages and start swapping. Once memory pressure subsides, the balloons are deflated to return memory to the guests.

Our KSM tuning algorithm is designed to match that of ksmtuned. First, a decision whether to enable the kernel thread is made according to a free memory threshold that we set to 20 percent of total host memory. KSM is allowed to run unless the host meets both of the following two conditions:

  • Free memory exceeds the threshold, and
  • Total memory less the free memory threshold exceeds the total amount of memory assigned to all virtual machines.

These conditions test, respectively, whether the host is under memory pressure and whether virtualization is responsible. Switching ksmd off when it is not needed will save CPU cycles.

When ksmd is enabled, its operation is tuned according to total memory size and memory pressure. The sleep time between scans is adjusted according to host memory size. For a 16GB host, 10 seconds is the default. Larger machines will sleep less, and smaller ones will sleep more. The number of pages to scan per interval is either increased or decreased depending on the amount of free memory. If free memory is less than the free memory threshold, the scanning rate is increased by 300 pages. Otherwise it is decreased by 50. The number of pages to scan will not be allowed outside of a range of 64 and 1,250.


The results

Our experiments were conducted on an IBM BladeCenter® HS22. The system has 16 logical CPUs with EPT support and 48GB of RAM. For increased storage capacity and bandwidth, virtual machine images were hosted on an external NFS appliance connected via a private 10 gigabit Ethernet LAN. We evaluated the effectiveness of the MOM policy using the Memknobs and Cloudy workload scenarios. To measure the performance impact within each scenario, one-hour tests were completed both with and without the MOM policy active.

Memknobs workload scenario

For this experiment we provisioned 32 virtual machines with 2GB of RAM and 1 VCPU each. We calibrated the Memknobs memory load, shown in Figure 3, to place the host under enough memory pressure to cause regular swapping activity. Through experimentation, we were able to produce a pattern that gave the desired host behavior.

Figure 3. Memknobs memory access pattern
Memknobs memory access pattern

The Memknobs program tracks the number of pages it was able touch per iteration. This throughput value is reported in units of megabytes of memory touched per second. To facilitate comparisons between different experiments, we derive a total average throughput score by computing the average throughput for each guest and summing those 32 values. Table 1 compares the scores achieved when different Memory Overcommitment Manager policies were used.

Table 1. MOM policy results: Memknobs total average throughput
MOM policyResult
None4331.1
KSM only4322.6
KSM and Ballooning5399.5

These results show that the use of memory ballooning contributed to nearly a 20 percent increase in throughput. Even though KSM was highly effective at sharing pages for this workload, it was not responsible for the increase in performance. Figure 4 compares swap activity between a Memknobs run with memory ballooning and a run without.

Figure 4. Memknobs swap comparison
Memknobs swap comparison

With the MOM policy active, swap activity was effectively concentrated in the guests, and the total number of swap operations was cut in half.

Cloudy workload scenario

To properly size Cloudy for our environment, we configured 32 virtual machine MediaWiki instances. This workload uses less memory than Memknobs. To ensure a memory-bound guest, only 1GB of RAM was assigned to each VM. Compensating for the reduced footprint of the virtual machines, we reserved 15,714 huge pages—effectively reducing the host's available memory to 16GB.

The JMeter test plan was configured to deliver requests at an average rate of 16 requests per second per virtual machine. This produced a moderate load with no resource bottlenecks when the system was not overcommitted. JMeter records statistics about each request it makes. We calculate a quality of service (QOS) metric as the 95th percentile request duration in milliseconds. Average instance throughput is the total size of all completed requests in kilobytes divided by the number of participating guests.

Table 2 shows the QOS and throughput achieved with and without our KSM and memory ballooning MOM policy enabled.

Table 2. Cloudy QOS and throughput
VMsMOM policyQoSThroughput
1No1669710007
32No3240774555
32Yes3231764762

Results from a run involving a single VM are shown to illustrate performance on an unconstrained host. The key observation from this data is that the same MOM policy that dramatically improved the performance of Memknobs had no effect on Cloudy. Memory usage in this workload is primarily attributable to file I/O, not anonymous memory, and very little swap activity is observed. Instead, memory overcommitment puts pressure on the host and guest page caches, causing increased I/O as the system tries to keep the working set loaded in a smaller amount of memory (see Figure 5).

Figure 5. The impact of memory ballooning on I/O for a single guest
The impact of memory ballooning on I/O for a single guest

Although effective at limiting memory, using a huge page reservation to reduce host available memory may interfere with the host's memory management algorithms in subtle ways. In the future, this experiment should be repeated on a machine with 16GB of physical memory to evaluate the system in the most realistic scenario possible.

Analyzing the results

The contrasting results found when studying these two different workloads lead to one obvious conclusion: When overcommitting memory, all aspects of the system and its workload must be considered. There is no "one size fits all" management policy.


Planned improvements

While still in its early stages, this work promises to advance the state of resource overcommitment with KVM, and many improvements are planned and in progress.

Today, no standardized method of communication between the host and guests is available. The current approaches (like host-to-guest network communication) depend on manual configuration and setup in the host and each guest, the manner of which is dictated by operating system types and versions, data center networking configuration, and virtual machine device models. Our goal is to simplify this problem by integrating a communication mechanism into qemu that can support multiple data transports including virtio-serial and emulated serial. The guest side of the channel would be supported by an open-source, multiplatform qemu-guest-tools package. Such a mechanism would improve the ability for MOM to gather guest statistics and would be widely useful for features such as copy/paste and administration tasks.

As demonstrated, an overcommitment policy can help in some situations and cause harm in others. In order to be safely deployable, a policy must not hinder performance. MOM policies should be improved by adding safeguards that would roll back management operations when they do not produce expected results.

Today, KVM has a very effective cooperative CPU-sharing mechanism. When a guest CPU executes the hlt instruction, it voluntarily yields CPU time to other guests. The performance impacts of memory overcommitment could be reduced if a similar guest-driven protocol existed for yielding memory resources.

When the community develops new features that can improve KVM overcommitment, support for those features will be integrated into the MOM infrastructure. For example, we plan to enable support for cgroups-based RSS limits to enforce ballooning directives and protect the system from non-cooperative or malicious guests.


Conclusion

Resource overcommitment is crucial to maximizing the benefits of virtualization and the subject of much research and development. As Linux virtualization evolves, so too will overcommitment best practices. A holistic, policy-driven approach to managing overcommitment is the best way to maximize efficiency and drive incremental improvements.

Acknowledgments

From a long list of deserving colleagues, I would like to specifically thank Anthony Liguori and Karl Rister. Their advice and technical expertise were instrumental in the conception and development of the Memory Overcommitment Manager and this research.

Resources

Learn

Get products and technologies

Discuss

  • Read this discussion on limitations of the EPT processor feature.
  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, Cloud computing
ArticleID=625881
ArticleTitle=Manage resources on overcommitted KVM hosts
publish-date=02082011