Anatomy of Linux Kernel Shared Memory
Memory de-duplication in the Linux kernel
Software engineering tends to be an evolutionary process. Problems are addressed with solutions that can create new problems and subsequently new solutions. Ideally, the new problems that are created justify the original solution. The technology discussed here is one of the secondary solutions to a problem created by server virtualization. Before delving into KSM, however, let's take a quick look at the original solution and how KSM is applied here and elsewhere.
Virtualization as a technology has been around since the 1960s, popularized by the IBM® System/360® mainframe. Five decades later, virtualization has exploded, making it possible to share a single server with multiple operating systems and applications. This particular use, called server virtualization, is transforming data centers, as a single physical machine can be used to host typically 10 or more virtual machines (VMs), as shown in Figure 1. This virtualization enables the infrastructure to be much more dynamic, power efficient, and (therefore) cost-efficient. See Related topics for more information on server virtualization and its benefits.
Figure 1. Server consolidation through virtualization
With only 10-15% of a typical server in use at any given time, consolidating virtual servers onto a single physical server makes perfect sense. But what about resources? The major resources available include CPU, memory, and networking bandwidth. CPU and networking are typically underutilized, so the real issue becomes available memory. Each operating system makes its own unique demands on available memory resources. But the question that arises is, how unique are those demands?
It turns out that if you virtualize many of the same operating system and application sets, many memory pages are identical. This makes sense, given that operating system and application code and constant data are identical among VMs. And when pages are unique, they can be consolidated, which frees memory for use in other applications. Figure 2 illustrates memory sharing and shows the benefit of more free memory available when pages are shared among VMs whose contents are the same.
Figure 2. Memory sharing across VMs
What you'll soon discover is that although memory sharing in Linux is advantageous in virtualized environments (KSM was originally designed for use with the Kernel-based Virtual Machine [KVM]), it's also useful in non-virtualized environments. In fact, KSM was found to be beneficial even in embedded Linux systems, indicating the flexibility of the approach. Now, let's explore the Linux approach to memory sharing and how you can use it to increase the memory density of a server and therefore increase its ability to host additional applications or VMs.
Parallels to other technologies
A recent advancement in storage technologies called de-duplication is a precursor to memory sharing in Linux and other hypervisors. De-duplication is a technology in which stored data is reduced by removing redundant data (either on a block basis or on larger segments of data such as files). Common segments are merged (in a copy-on-write [CoW] fashion), freeing space for other uses. Storage de-duplication is an important technology moving forward, as it optimizes the capacity of storage by removing duplicate data. In this way, storage is cheaper, because less is ultimately required. Given the rate of data growth, this functionality is very relevant.
KSM exists as a daemon in the kernel (called
periodically performs page scans to identify duplicate pages and merges duplicates
to free pages for other uses. It does this in a way that's transparent to the user.
For example, duplicate pages are merged (and subsequently marked read only), but
if one of the users of the page changes it for any reason, that user will receive
his or her own copy (in a CoW fashion). You can find the full implementation of the
KSM kernel module within the kernel source at ./mm/ksm.c.
KSM relies on a higher-level application to provide instruction on which memory regions are candidates for merging. Although it would be possible for KSM to simply scan all anonymous pages in the system, it would be wasteful of CPU and memory (given the space necessary to manage the page-merging process). Therefore, applications can register the virtual areas that are likely to contain duplicate pages.
The KSM application programming interface (API) is implemented through the
madvise system call (see Listing 1)
and a new advice parameter called
(which indicates that the defined region is mergeable). A region can be removed
from the mergeable state through the
parameter (which immediately un-merges any merged pages from that region). Note
that removing a region of pages through
result in an
EAGAIN error, as the operation could run out
of memory during the un-merge process, potentially leading to even greater trouble
Listing 1. The madvise system call
#include <sys/mman.h> int madvise( void *start, size_t length, int advice );
Once a region is defined as mergeable, KSM adds this region to its list of working memory. When KSM is enabled, it searches for identical pages, keeping one page in a write-protected CoW fashion and freeing one for other uses.
KSM uses a approach that's different from the approach used in storage de-duplication. In traditional de-duplication, objects are hashed, and the hash value is used as the initial check for similarity. When hashes are identical, the next step is a comparison of the actual objects (in this case, a memory comparison) to formally determine whether the objects are identical. KSM used this approach in its first implementation, but a more straightforward approach was developed to simplify it.
In the current KSM, pages are managed by two red-black trees, one of which is ephemeral. The first tree, call the unstable tree, is used to store new pages that are not yet understood to be stable. In other words, pages that are candidates for merging (unchanged for some period of time) are stored in the unstable tree. Pages in the unstable tree are not write protected. The second tree, called the stable tree, stores those pages that have been found to be stable and merged by KSM. To identify whether a page is volatile or non-volatile, KSM uses a simple 32-bit checksum. When a page is scanned, its checksum is calculated and stored with the page. On a subsequent scan, if a newly computed checksum is different from the previously generated checksum, the page is changing and is therefore not a good candidate for merging.
The first step in handling a single page using KSM's process is to see whether the page
can be found in the stable tree. The process of searching the tree is interesting in
that each page is treated as a very large number (the contents of the page). A
memcmp (memory compare) operation is performed
on the page and node's page of consideration. If the
returns 0, the pages are equal and a match is found. Otherwise,
memcmp can return -1 (which means that the
candidate page is less than the current node's page) or 1 (the candidate page
is greater than the current node's page). In each case, the operation leads the search
down the red-black tree (to the left if less than and to the right if greater than).
Although comparing 4KB pages seems fairly heavyweight, in most cases, the
memcmp will end prematurely once a difference is found.
Therefore, the process itself is fast and efficient. See Figure 3
for a visualization of this process.
Figure 3. Search process for a page in the tree
If the candidate page is found in the stable tree, then it is merged and the candidate
page freed. This code can be found in ksm.c/
(which is called by ksm.c/
Otherwise, if the candidate page is not found, you move on to the unstable tree
The first step in the unstable tree search is to recalculate the checksum over the page. If it's found to be different than the original checksum, the page is discarded from further search in this scan (because if it changes, it's not worthwhile to track). If the checksum has not changed, the unstable tree is searched for the candidate page. The unstable tree processing is a bit different than the stable tree. First, if the search does not find the page in the unstable tree, you add a new node in the unstable tree for this page. But if the page is found in the unstable tree, you merge the pages, and then migrate the node to the stable tree.
When the scan is complete (performed through ksm.c/
the stable tree is kept, but the unstable tree is removed and rebuilt at the time of
the next scan. This process simplifies things quite a bit, as the organization of the
unstable tree can change based upon changes to the pages (recall that pages in
the unstable tree are not write protected). Because all pages in the stable tree
are write protected, a page fault is generated when a page tries to be written,
allowing the CoW process to un-merge the page for the writer (see
break_cow()). Orphaned pages in the stable
tree are subsequently removed (unless two or more users of the page exist,
indicating that the page is still shared).
As discussed, KSM uses red-black trees for managing pages for fast lookup. Linux actually includes red-black trees as a reusable data structure, and they can be used generically. Red-black trees are also used by the Completely Fair Scheduler (CFS) for storage of tasks in time order. You can find the implementation of red-black trees in the kernel at ./lib/rbtree.c.
KSM configuring and monitoring
Managing and monitoring KSM occurs through sysfs (at the root /sys/kernel/mm/ksm). In this sysfs subdirectory, you'll find a collection of files, some used for control and others used for monitoring.
The first file, run, is used to enable or disable paging merging with KSM. By
default, KSM is disabled (
0) but can be enabled by
1 to the file to enable the KSM daemon (for
echo 1 > sys/kernel/mm/ksm/run). From the
run state, you can disable the daemon (but keep the current set of merged pages) by
0. Also from the run (
state, you can stop KSM and request that all merged pages become un-merged by
While KSM runs, you can control it with three parameters (files within sysfs). The
sleep_millisecs file defines how many milliseconds
should sleep before performing another page scan. The max_kernel_pages file defines
the maximum number of pages that
ksmd can use (the
default is 25% of the available memory, but you can write a
to specify no limit). Finally, the pages_to_scan file defines how many pages can be
scanned in a given scan. Any user can view these files, but the user must have root
privileges to change them.
There are also five monitorable files exported through sysfs that indicate
ksmd's operation and effectiveness (these are read only).
The full_scans file indicates the number of full area scans that have been performed. The
remaining four files indicate the page-level statistics of KSM:
- pages_shared: The number of unswappable kernel pages that KSM is using
- pages_sharing: An indication of memory savings
- pages_unshared: The number of unique pages repeatedly checked for merging
- pages_volatile: The number of pages that are changing too often
The KSM authors define that a high ratio of pages_sharing to pages_shared indicates efficient sharing of pages (whereas the reverse is indicative of wasted effort).
Linux is not alone in using page sharing to improve memory efficiency, but it is unique in its implementation as an operating system feature. VMware's ESX server hypervisor provides this feature under the name Transparent Page Sharing (TPS), while XEN calls it Memory CoW. But whatever the name or implementation, the feature provides better memory utilization, allowing the operating system (or hypervisor, in the case of KVM) to over-commit memory to support greater numbers of applications or VMs. You can find KSM—and many other interesting features—in the latest 2.6.32 Linux kernel.
- The authors of KSM at Red Hat wrote a great paper on KSM entitled "Increasing memory density by using KSM" (PDF). This paper introduces KSM and delves into the details of its implementation.
- This article demonstrates an advantage of using Linux as a hypervisor. You can learn more about this and other topics in "Anatomy of a Linux Hypervisor" (developerWorks, May 2009). Finally, read about another kernel subsystem that uses red-black trees in "Inside the 2.6 Completely Fair Scheduler" (developerWorks, December 2009).
- You can read more about memory de-duplication from the 2.6.32 kernel release notes. These notes provide useful materials on KSM (including a link to the kernel documentation) in addition to the various other changes that made their way into 2.6.32. In the KSM section, you'll find an interesting reference to running 52 Windows® XP VMs on a single server with 16GB of RAM (where each was allocated its own 1GB RAM segment but shared with KSM).
- Learn more about de-duplication of storage systems and the approaches in the IT Analysis article, "De-dupe for big storage savings?".
- Wikipedia provides a great introduction to de-duplication, including the various implementation strategies (such as source and target).
- Although KSM is an automatic mechanism for reducing memory footprint in virtualized servers, application developers have had a similar manual capability for some time. Linux dynamic (or shared) libraries permit a binary to use a common library object instead of one statically compiled with the application. Read more about shared libraries in "Anatomy of Linux dynamic libraries" (developerWorks, August 2008).
- A red-black tree (or symmetric binary B-tree) is a self-balancing binary tree invented by Rudolf Bayer. It's a useful tree representation that has good worst-case time for operations such as insert, search, and delete. You can find red-black trees used in a variety of applications, including the construction of associative arrays.
- In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth other resources for Linux developers and administrators.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment.
- Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.