Anatomy of Linux Kernel Shared Memory

Memory de-duplication in the Linux kernel

Linux® as a hypervisor includes a number of innovations, and one of the more interesting changes in the 2.6.32 kernel is Kernel Shared Memory (KSM). KSM allows the hypervisor to increase the number of concurrent virtual machines by consolidating identical memory pages. Explore the ideas behind KSM (such as storage de-duplication), its implementation, and how you manage it.

Share:

M. Tim Jones, Independent author

M. Tim JonesM. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.


developerWorks Contributing author
        level

07 April 2010

Also available in Japanese Portuguese

Connect with Tim

Tim is one of our most popular and prolific authors. Browse all of Tim's articles on developerWorks. Check out Tim's profile and connect with him, other authors, and fellow readers in My developerWorks.

Software engineering tends to be an evolutionary process. Problems are addressed with solutions that can create new problems and subsequently new solutions. Ideally, the new problems that are created justify the original solution. The technology discussed here is one of the secondary solutions to a problem created by server virtualization. Before delving into KSM, however, let's take a quick look at the original solution and how KSM is applied here and elsewhere.

Server virtualization

Extreme server consolidation

Although enterprise servers can consolidate 10 or more servers using virtualization, a single IBM System z® server can support for thousands of Linux® guests on a single logical partition.

Virtualization as a technology has been around since the 1960s, popularized by the IBM® System/360® mainframe. Five decades later, virtualization has exploded, making it possible to share a single server with multiple operating systems and applications. This particular use, called server virtualization, is transforming data centers, as a single physical machine can be used to host typically 10 or more virtual machines (VMs), as shown in Figure 1. This virtualization enables the infrastructure to be much more dynamic, power efficient, and (therefore) cost-efficient. See Resources for more information on server virtualization and its benefits.

Figure 1. Server consolidation through virtualization
Diagram of server consolidation through virtualization, showing physical machines consolidated onto a single physical server

With only 10-15% of a typical server in use at any given time, consolidating virtual servers onto a single physical server makes perfect sense. But what about resources? The major resources available include CPU, memory, and networking bandwidth. CPU and networking are typically underutilized, so the real issue becomes available memory. Each operating system makes its own unique demands on available memory resources. But the question that arises is, how unique are those demands?


Memory sharing

It turns out that if you virtualize many of the same operating system and application sets, many memory pages are identical. This makes sense, given that operating system and application code and constant data are identical among VMs. And when pages are unique, they can be consolidated, which frees memory for use in other applications. Figure 2 illustrates memory sharing and shows the benefit of more free memory available when pages are shared among VMs whose contents are the same.

Figure 2. Memory sharing across VMs
Diagram of memory sharing across VMs, showing more free memory when pages are shared among VMs whose contents are the same

Feature naming

The feature described here is very new; therefore, the name has undergone some changes. You'll find this Linux kernel feature called Kernel Shared Memory as well as Kernel Samepage Merging.

What you'll soon discover is that although memory sharing in Linux is advantageous in virtualized environments (KSM was originally designed for use with the Kernel-based Virtual Machine [KVM]), it's also useful in non-virtualized environments. In fact, KSM was found to be beneficial even in embedded Linux systems, indicating the flexibility of the approach. Now, let's explore the Linux approach to memory sharing and how you can use it to increase the memory density of a server and therefore increase its ability to host additional applications or VMs.


Parallels to other technologies

A recent advancement in storage technologies called de-duplication is a precursor to memory sharing in Linux and other hypervisors. De-duplication is a technology in which stored data is reduced by removing redundant data (either on a block basis or on larger segments of data such as files). Common segments are merged (in a copy-on-write [CoW] fashion), freeing space for other uses. Storage de-duplication is an important technology moving forward, as it optimizes the capacity of storage by removing duplicate data. In this way, storage is cheaper, because less is ultimately required. Given the rate of data growth, this functionality is very relevant.


KSM operation

KSM exists as a daemon in the kernel (called ksmd) that periodically performs page scans to identify duplicate pages and merges duplicates to free pages for other uses. It does this in a way that's transparent to the user. For example, duplicate pages are merged (and subsequently marked read only), but if one of the users of the page changes it for any reason, that user will receive his or her own copy (in a CoW fashion). You can find the full implementation of the KSM kernel module within the kernel source at ./mm/ksm.c.

KSM relies on a higher-level application to provide instruction on which memory regions are candidates for merging. Although it would be possible for KSM to simply scan all anonymous pages in the system, it would be wasteful of CPU and memory (given the space necessary to manage the page-merging process). Therefore, applications can register the virtual areas that are likely to contain duplicate pages.

The KSM application programming interface (API) is implemented through the madvise system call (see Listing 1) and a new advice parameter called MADV_MERGEABLE (which indicates that the defined region is mergeable). A region can be removed from the mergeable state through the MADV_UNMERGEABLE parameter (which immediately un-merges any merged pages from that region). Note that removing a region of pages through madvise can result in an EAGAIN error, as the operation could run out of memory during the un-merge process, potentially leading to even greater trouble (out-of-memory conditions).

Listing 1. The madvise system call
#include <sys/mman.h>

int madvise( void *start, size_t length, int advice );

Once a region is defined as mergeable, KSM adds this region to its list of working memory. When KSM is enabled, it searches for identical pages, keeping one page in a write-protected CoW fashion and freeing one for other uses.

KSM uses a approach that's different from the approach used in storage de-duplication. In traditional de-duplication, objects are hashed, and the hash value is used as the initial check for similarity. When hashes are identical, the next step is a comparison of the actual objects (in this case, a memory comparison) to formally determine whether the objects are identical. KSM used this approach in its first implementation, but a more straightforward approach was developed to simplify it.

In the current KSM, pages are managed by two red-black trees, one of which is ephemeral. The first tree, call the unstable tree, is used to store new pages that are not yet understood to be stable. In other words, pages that are candidates for merging (unchanged for some period of time) are stored in the unstable tree. Pages in the unstable tree are not write protected. The second tree, called the stable tree, stores those pages that have been found to be stable and merged by KSM. To identify whether a page is volatile or non-volatile, KSM uses a simple 32-bit checksum. When a page is scanned, its checksum is calculated and stored with the page. On a subsequent scan, if a newly computed checksum is different from the previously generated checksum, the page is changing and is therefore not a good candidate for merging.

The first step in handling a single page using KSM's process is to see whether the page can be found in the stable tree. The process of searching the tree is interesting in that each page is treated as a very large number (the contents of the page). A memcmp (memory compare) operation is performed on the page and node's page of consideration. If the memcmp returns 0, the pages are equal and a match is found. Otherwise, memcmp can return -1 (which means that the candidate page is less than the current node's page) or 1 (the candidate page is greater than the current node's page). In each case, the operation leads the search down the red-black tree (to the left if less than and to the right if greater than). Although comparing 4KB pages seems fairly heavyweight, in most cases, the memcmp will end prematurely once a difference is found. Therefore, the process itself is fast and efficient. See Figure 3 for a visualization of this process.

Figure 3. Search process for a page in the tree
Search process for a page in the red-black tree, showing a match for the candidate page

If the candidate page is found in the stable tree, then it is merged and the candidate page freed. This code can be found in ksm.c/stable_tree_search() (which is called by ksm.c/cmp_and_merge_page()). Otherwise, if the candidate page is not found, you move on to the unstable tree (see ksm.c/unstable_tree_search()).

The first step in the unstable tree search is to recalculate the checksum over the page. If it's found to be different than the original checksum, the page is discarded from further search in this scan (because if it changes, it's not worthwhile to track). If the checksum has not changed, the unstable tree is searched for the candidate page. The unstable tree processing is a bit different than the stable tree. First, if the search does not find the page in the unstable tree, you add a new node in the unstable tree for this page. But if the page is found in the unstable tree, you merge the pages, and then migrate the node to the stable tree.

When the scan is complete (performed through ksm.c/ksm_do_scan()), the stable tree is kept, but the unstable tree is removed and rebuilt at the time of the next scan. This process simplifies things quite a bit, as the organization of the unstable tree can change based upon changes to the pages (recall that pages in the unstable tree are not write protected). Because all pages in the stable tree are write protected, a page fault is generated when a page tries to be written, allowing the CoW process to un-merge the page for the writer (see ksm.c/break_cow()). Orphaned pages in the stable tree are subsequently removed (unless two or more users of the page exist, indicating that the page is still shared).

As discussed, KSM uses red-black trees for managing pages for fast lookup. Linux actually includes red-black trees as a reusable data structure, and they can be used generically. Red-black trees are also used by the Completely Fair Scheduler (CFS) for storage of tasks in time order. You can find the implementation of red-black trees in the kernel at ./lib/rbtree.c.


KSM configuring and monitoring

Managing and monitoring KSM occurs through sysfs (at the root /sys/kernel/mm/ksm). In this sysfs subdirectory, you'll find a collection of files, some used for control and others used for monitoring.

The first file, run, is used to enable or disable paging merging with KSM. By default, KSM is disabled (0) but can be enabled by writing a 1 to the file to enable the KSM daemon (for example, echo 1 > sys/kernel/mm/ksm/run). From the run state, you can disable the daemon (but keep the current set of merged pages) by writing a 0. Also from the run (1) state, you can stop KSM and request that all merged pages become un-merged by writing a 2.

While KSM runs, you can control it with three parameters (files within sysfs). The sleep_millisecs file defines how many milliseconds ksmd should sleep before performing another page scan. The max_kernel_pages file defines the maximum number of pages that ksmd can use (the default is 25% of the available memory, but you can write a 0 to specify no limit). Finally, the pages_to_scan file defines how many pages can be scanned in a given scan. Any user can view these files, but the user must have root privileges to change them.

There are also five monitorable files exported through sysfs that indicate ksmd's operation and effectiveness (these are read only). The full_scans file indicates the number of full area scans that have been performed. The remaining four files indicate the page-level statistics of KSM:

  • pages_shared: The number of unswappable kernel pages that KSM is using
  • pages_sharing: An indication of memory savings
  • pages_unshared: The number of unique pages repeatedly checked for merging
  • pages_volatile: The number of pages that are changing too often

The KSM authors define that a high ratio of pages_sharing to pages_shared indicates efficient sharing of pages (whereas the reverse is indicative of wasted effort).


Going further

Linux is not alone in using page sharing to improve memory efficiency, but it is unique in its implementation as an operating system feature. VMware's ESX server hypervisor provides this feature under the name Transparent Page Sharing (TPS), while XEN calls it Memory CoW. But whatever the name or implementation, the feature provides better memory utilization, allowing the operating system (or hypervisor, in the case of KVM) to over-commit memory to support greater numbers of applications or VMs. You can find KSM—and many other interesting features—in the latest 2.6.32 Linux kernel.

Resources

Learn

  • The authors of KSM at Red Hat wrote a great paper on KSM entitled "Increasing memory density by using KSM" (PDF). This paper introduces KSM and delves into the details of its implementation.
  • Virtualization is growing dramatically as a way to consolidate for power and cost savings but also as a key element of cloud computing architectures. Learn about virtualization and its role in green IT in "Growing green with virtualization: Virtualization as a backbone of green IT" (developerWorks, August 2009).
  • This article demonstrates an advantage of using Linux as a hypervisor. You can learn more about this and other topics in "Anatomy of a Linux Hypervisor" (developerWorks, May 2009), "Discover the Linux Kernel Virtual Machine" (developerWorks, April 2007), and "Virtual Linux" (developerWorks, December 2006). Finally, read about another kernel subsystem that uses red-black trees in "Inside the 2.6 Completely Fair Scheduler" (developerWorks, December 2009).
  • You can read more about memory de-duplication from the 2.6.32 kernel release notes. These notes provide useful materials on KSM (including a link to the kernel documentation) in addition to the various other changes that made their way into 2.6.32. In the KSM section, you'll find an interesting reference to running 52 Windows® XP VMs on a single server with 16GB of RAM (where each was allocated its own 1GB RAM segment but shared with KSM).
  • Learn more about de-duplication of storage systems and the approaches in the IT Analysis article, "De-dupe for big storage savings?".
  • Wikipedia provides a great introduction to de-duplication, including the various implementation strategies (such as source and target).
  • Although KSM is an automatic mechanism for reducing memory footprint in virtualized servers, application developers have had a similar manual capability for some time. Linux dynamic (or shared) libraries permit a binary to use a common library object instead of one statically compiled with the application. Read more about shared libraries in "Anatomy of Linux dynamic libraries" (developerWorks, August 2008).
  • A red-black tree (or symmetric binary B-tree) is a self-balancing binary tree invented by Rudolf Bayer. It's a useful tree representation that has good worst-case time for operations such as insert, search, and delete. You can find red-black trees used in a variety of applications, including the construction of associative arrays.
  • In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth other resources for Linux developers and administrators.
  • Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
  • Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
  • Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
  • Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=480989
ArticleTitle=Anatomy of Linux Kernel Shared Memory
publish-date=04072010