Virtualization promises to increase efficiency by enabling workload consolidation. Maximizing virtual machine density while maintaining good performance can be a real challenge:
- A workload's utilization of resources such as CPU, memory, and bandwidth for network and storage access varies over time; if you can dynamically allocate these resources according to demand, then you can achieve greater density with overcommitment.
- Optimal resource management also depends on factors such as the system configuration and hardware setup.
One way to develop a management strategy that incorporates both factors is to write a set of rules in a policy language. You can tailor your policy to your unique conditions and effectively balance your consolidation and performance goals.
This article explores the challenges associated with aggressive overcommitment and proposes a policy-driven management tool that can help you address the challenges. The tool is deployed in a typical KVM environment and manages two distinct workloads. This article evaluates the results of this exercise and concludes with suggestions for additional work items and potential research topics.
A thorough investigation of memory overcommitment in KVM should begin with the Linux memory manager itself. Because it is built on the concepts of virtual memory, Linux employs memory overcommitment techniques by design.
Memory pages requested by a process are not allocated until they are actually used. Using the Linux page cache, multiple processes can save memory by accessing files through shared pages; as memory is exhausted, the system can free up memory by swapping less frequently used pages to disk. While not perfect, these techniques can result in a substantial difference between the amount of memory that is allocated and the amount actually used.
Because KVM virtual machines are regular processes, the standard memory conservation techniques apply. But unlike regular processes, KVM guests contain a nested operating system, which impacts memory overcommitment in two key ways. KVM guests can have greater memory overcommitment potential than regular processes. This is due to a large difference between minimum and maximum guest memory requirements caused by swings in utilization.
Capitalizing on this variability is central to the appeal of virtualization, but it is not always easy. While the host is managing the memory allocated to a KVM guest, the guest kernel is simultaneously managing the same memory. Lacking any form of collaboration between the host and guest, neither the host nor the guest memory manager is able to make optimal decisions regarding caching and swapping, which can lead to less efficient use of memory and degraded performance.
Linux provides additional mechanisms to address memory overcommitment specific to virtualization.
- Memory ballooning is a technique in which the host instructs a cooperative guest to release some of its assigned memory so that it can be used for another purpose. This technique can help refocus memory pressure from the host onto a guest.
- Kernel Same-page Merging (KSM) uses a kernel thread that scans previously identified memory ranges for identical pages, merges them together, and frees the duplicates. Systems that run a large number of homogeneous virtual machines benefit most from this form of memory sharing.
- Other resource management features such as Cgroups have applications in memory overcommitment that can dynamically shuffle resources among virtual machines.
These mechanisms provide an effective framework for managing overcommitted memory, but they all share an important limitation: They must be configured and tuned by an external entity. Automatic configuration is not available because the individual components of the virtualization stack do not have enough information to make tuning decisions.
For example, it is beyond the scope of a single qemu process to know whether the host is running out of memory and how much memory can be ballooned from its guest without adversely impacting performance. Those decisions can be made only when administrative policy is combined with real-time information about the host and its guests.
To solve this challenge with KSM, a daemon called
ksmtuned has been written to dynamically manage
the tunable parameters according to sharing potential and the need for
ksmtuned manages a single
mechanism in isolation, it cannot be part of a comprehensive solution.
The use of any overcommitment mechanism fundamentally affects the operation of the host and guest memory management systems. As such, the simultaneous use of multiple mechanisms is likely to cause secondary effects. Figure 1 illustrates an interaction between memory ballooning and KSM.
Figure 1. The effects of memory ballooning on KSM
In this example, increased ballooning pressure undermines KSM's ability to share pages. Guest balloon drivers select pages to balloon without considering whether the host page might be shared. Ballooning a shared page is a mistake because it deprives the guest of resources without actually saving any memory. To maximize efficiency, these types of interactions must be anticipated and managed.
You can deploy KVM-based virtualization in a myriad of configurations using architectures designed to meet different goals. Aggressive virtual machine consolidation requires a performance trade-off. The proper balance between the two depends on the situation.
Memory overcommitment can increase the demand for all other system resources. Pressure on memory caches will cause increased disk or network I/O, and increased page reclaim activity will cause a guest to consume more CPU cycles.
You can address these challenges with a new daemon designed for the task: Memory Overcommitment Manager (MOM).
Dynamic management with MOM
The MOM (Memory Overcommitment Manager) daemon is shown in Figure 2.
Figure 2. Memory Overcommitment Manager
MOM provides a flexible framework for host and guest statistics collection, a policy engine, and control points for ballooning and KSM. With MOM, an administrator can construct an overcommitment policy that responds to real-time conditions with dynamic adjustments to system tunables to achieve a greater level of memory overcommitment.
MOM uses libvirt to maintain a list of virtual machines running on the
host. At a regular collection interval, data is gathered about the host
and guests. Data can come from multiple sources (the host
/proc interface, the libvirt API,
virtio-serial, or a network connection to the guest). Once collected, the data is
organized for use by the policy engine. (Virtio is a
standard for network and disk device drivers in which just the guest's
device driver "knows" it is running in a virtual environment and therefore
cooperates with the hypervisor, enabling guests to get high performance
network and disk operations. In sum, Virtio delivers most of the performance
benefits of paravirtualization.)
The policy engine is essentially a small interpreter that understands programs written in a simple policy language. Periodically, MOM will evaluate the user-supplied policy using the collected data. The policy may trigger configuration changes such as inflation of a guest's memory balloon or a change in the KSM scanning rate.
Over time, the MOM framework can be expanded to collect from additional data sources and to control new memory overcommitment mechanisms as development progresses in this area.
To evaluate MOM and its effectiveness at managing overcommitted memory, let's look at two virtualization workloads:
- In the first workload, the virtual machines are configured to consume varying amounts of anonymous memory according to a prescribed access pattern.
- The second workload uses a LAMP benchmark where each virtual machine functions as an independent MediaWiki instance.
For each workload, a MOM policy that controls memory ballooning and KSM is
evaluated. To avoid host swapping, each guest's memory balloon is
dynamically adjusted in accordance with host and guest memory pressure.
Using the same algorithm as
ksmtuned, KSM is
adjusted based on memory pressure and the amount of shareable memory.
The two workload scenarios
The Memknobs workload scenario uses a simple program called Memknobs to create memory pressure by allocating and touching anonymous memory pages in a pattern that challenges the kernel's memory reclaim algorithms. Memknobs allocates a fixed-sized buffer and loops through that buffer's pages, writing to each one. By invoking Memknobs repeatedly with a gradually changing buffer size for each iteration, a guest can simulate a memory-bound workload with no I/O component. To overcommit memory on a host, we deployed Memknobs onto 32 virtual machines where each instance used memory in a unique pattern.
For the Cloudy workload scenario, we generated a realistic workload with a disk I/O component using an open suite called Cloudy. Cloudy is an easy-to-set-up LAMP benchmark that measures virtualization scalability and shows the effects of resource overcommitment. Multiple virtual machines are configured as MediaWiki servers. Each wiki is loaded with pages from Wikipedia and randomly generated image data.
An included JMeter test plan exercises all of the instances and measures throughput and response times. The test plan can be configured to produce a steady-state workload or to introduce variability by alternating the load between multiple virtual machine groups. The amount and type of load can be varied by changing the request rate, the number of concurrent users, and the size of the randomly generated wiki image files. A PHP accelerator reduces CPU consumption to a negligible amount. This workload can generate a significant amount of I/O and is sensitive to the amount of bandwidth available when accessing the virtual machine image storage device.
Our memory ballooning strategy is designed to prevent host swapping, instead directing memory pressure onto the guests. The host, lacking guest page access information, is unable to properly apply its LRU algorithms when selecting pages to swap. Further, the guest operating system has the most information about the workload's memory use and should be able to make the best page replacement decisions.
The policy relies on memory ballooning to maintain a pool of readily
freeable host memory. We define freeable memory as the sum of
Cache as reported in
/proc/meminfo. When this pool shrinks below 20
percent of total memory, memory ballooning is initiated. Pressure is
applied to guests according to the level of host memory pressure. Free
memory is reclaimed first so guests with more unused memory are
ballooned the most. With enough balloon pressure, guests will evict cached
pages and start swapping. Once memory pressure subsides, the balloons are
deflated to return memory to the guests.
Our KSM tuning algorithm is designed to match that of
ksmtuned. First, a decision whether to enable
the kernel thread is made according to a free memory threshold that we set
to 20 percent of total host memory. KSM is allowed to run unless the host
meets both of the following two conditions:
- Free memory exceeds the threshold, and
- Total memory less the free memory threshold exceeds the total amount of memory assigned to all virtual machines.
These conditions test, respectively, whether the host is under memory
pressure and whether virtualization is responsible. Switching
ksmd off when it is not needed will save CPU
ksmd is enabled, its operation is tuned
according to total memory size and memory pressure. The sleep time between
scans is adjusted according to host memory size. For a 16GB host, 10
seconds is the default. Larger machines will sleep less, and smaller ones
will sleep more. The number of pages to scan per interval is either
increased or decreased depending on the amount of free memory. If free
memory is less than the free memory threshold, the scanning rate is
increased by 300 pages. Otherwise it is decreased by 50. The number of
pages to scan will not be allowed outside of a range of 64 and 1,250.
Our experiments were conducted on an IBM BladeCenter® HS22. The system has 16 logical CPUs with EPT support and 48GB of RAM. For increased storage capacity and bandwidth, virtual machine images were hosted on an external NFS appliance connected via a private 10 gigabit Ethernet LAN. We evaluated the effectiveness of the MOM policy using the Memknobs and Cloudy workload scenarios. To measure the performance impact within each scenario, one-hour tests were completed both with and without the MOM policy active.
Memknobs workload scenario
For this experiment we provisioned 32 virtual machines with 2GB of RAM and 1 VCPU each. We calibrated the Memknobs memory load, shown in Figure 3, to place the host under enough memory pressure to cause regular swapping activity. Through experimentation, we were able to produce a pattern that gave the desired host behavior.
Figure 3. Memknobs memory access pattern
The Memknobs program tracks the number of pages it was able touch per iteration. This throughput value is reported in units of megabytes of memory touched per second. To facilitate comparisons between different experiments, we derive a total average throughput score by computing the average throughput for each guest and summing those 32 values. Table 1 compares the scores achieved when different Memory Overcommitment Manager policies were used.
Table 1. MOM policy results: Memknobs total average throughput
|KSM and Ballooning||5399.5|
These results show that the use of memory ballooning contributed to nearly a 20 percent increase in throughput. Even though KSM was highly effective at sharing pages for this workload, it was not responsible for the increase in performance. Figure 4 compares swap activity between a Memknobs run with memory ballooning and a run without.
Figure 4. Memknobs swap comparison
With the MOM policy active, swap activity was effectively concentrated in the guests, and the total number of swap operations was cut in half.
Cloudy workload scenario
To properly size Cloudy for our environment, we configured 32 virtual machine MediaWiki instances. This workload uses less memory than Memknobs. To ensure a memory-bound guest, only 1GB of RAM was assigned to each VM. Compensating for the reduced footprint of the virtual machines, we reserved 15,714 huge pages—effectively reducing the host's available memory to 16GB.
The JMeter test plan was configured to deliver requests at an average rate of 16 requests per second per virtual machine. This produced a moderate load with no resource bottlenecks when the system was not overcommitted. JMeter records statistics about each request it makes. We calculate a quality of service (QOS) metric as the 95th percentile request duration in milliseconds. Average instance throughput is the total size of all completed requests in kilobytes divided by the number of participating guests.
Table 2 shows the QOS and throughput achieved with and without our KSM and memory ballooning MOM policy enabled.
Table 2. Cloudy QOS and throughput
Results from a run involving a single VM are shown to illustrate performance on an unconstrained host. The key observation from this data is that the same MOM policy that dramatically improved the performance of Memknobs had no effect on Cloudy. Memory usage in this workload is primarily attributable to file I/O, not anonymous memory, and very little swap activity is observed. Instead, memory overcommitment puts pressure on the host and guest page caches, causing increased I/O as the system tries to keep the working set loaded in a smaller amount of memory (see Figure 5).
Figure 5. The impact of memory ballooning on I/O for a single guest
Although effective at limiting memory, using a huge page reservation to reduce host available memory may interfere with the host's memory management algorithms in subtle ways. In the future, this experiment should be repeated on a machine with 16GB of physical memory to evaluate the system in the most realistic scenario possible.
Analyzing the results
The contrasting results found when studying these two different workloads lead to one obvious conclusion: When overcommitting memory, all aspects of the system and its workload must be considered. There is no "one size fits all" management policy.
While still in its early stages, this work promises to advance the state of resource overcommitment with KVM, and many improvements are planned and in progress.
Today, no standardized method of communication between the host and guests is available. The current approaches (like host-to-guest network communication) depend on manual configuration and setup in the host and each guest, the manner of which is dictated by operating system types and versions, data center networking configuration, and virtual machine device models. Our goal is to simplify this problem by integrating a communication mechanism into qemu that can support multiple data transports including virtio-serial and emulated serial. The guest side of the channel would be supported by an open-source, multiplatform qemu-guest-tools package. Such a mechanism would improve the ability for MOM to gather guest statistics and would be widely useful for features such as copy/paste and administration tasks.
As demonstrated, an overcommitment policy can help in some situations and cause harm in others. In order to be safely deployable, a policy must not hinder performance. MOM policies should be improved by adding safeguards that would roll back management operations when they do not produce expected results.
Today, KVM has a very effective cooperative CPU-sharing mechanism. When a
guest CPU executes the
hlt instruction, it
voluntarily yields CPU time to other guests. The performance impacts of
memory overcommitment could be reduced if a similar guest-driven protocol
existed for yielding memory resources.
When the community develops new features that can improve KVM overcommitment, support for those features will be integrated into the MOM infrastructure. For example, we plan to enable support for cgroups-based RSS limits to enforce ballooning directives and protect the system from non-cooperative or malicious guests.
Resource overcommitment is crucial to maximizing the benefits of virtualization and the subject of much research and development. As Linux virtualization evolves, so too will overcommitment best practices. A holistic, policy-driven approach to managing overcommitment is the best way to maximize efficiency and drive incremental improvements.
From a long list of deserving colleagues, I would like to specifically thank Anthony Liguori and Karl Rister. Their advice and technical expertise were instrumental in the conception and development of the Memory Overcommitment Manager and this research.
- For more background, read the paper "Increasing memory density by using KSM" (PDF) by Andrea Arcangeli, Izik Eidus, and Chris Wright.
- The Fedora/KSM page provides more information on KSM and ksmtuned.
- Find details on Cgroups in the Linux Kernel Documentation.
- In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth of other resources for Linux developers and administrators.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
- Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.
Get products and technologies
- Get the Cloudy scenario benchmark.
- Get JMeter from the Apache Jakarta Project.
- Get the Memknobs micro-benchmark.
- Get Memory Overcommitment Manager (MOM) source code and documentation.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Read this discussion on limitations of the EPT processor feature.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.