zswap" is discussed, with some initial performance data provided to demonstrate the potential benefits for a system (partition or guest) which has constrained memory and is beginning to swap memory pages to disk. The technique improves the throughput of a system, while significantly reducing the disk I/O activity normally associated with page swapping. We also explore how zswap works in conjunction with the new compression accelerator feature of the POWER7+ processor to potentially improve the system throughput even more than software compression alone.
This article is a good example of the ongoing collaboration that occurs in the Linux open-source community. New implementations are proposed, discussed, debated, refined and updated across developers, community members, interested customers, and performance teams. Here on the PowerLinux technical community, we are working to highlight more of these examples of work-in-progress from the broader Linux community. These proposals are applicable to both x86 systems and Power systems, so examples shown below cover both realms.
What is zswap?
Zswap is a new lightweight backend framework that takes pages that are in the process of being swapped out and attempts to compress them and store them in a RAM-based memory pool. Aside from a small reserved portion intended for very low-memory situations, this zswap pool is not pre-allocated, it grows on demand and the max size is user-configurable. Zswap leverages an existing frontend already in mainline called frontswap. The zswap/frontswap process intercepts the normal swap path before the page is actually swapped out, so the existing swap page selection algorithms are unchanged. Zswap also introduces key functionality that automatically evicts pages from the zswap pool to a swap device when the zswap pool is full. This prevents stale pages from filling up the pool.
The zswap patches have been submitted to the Linux Kernel Mailing List
(lkml) for review, you can view them in this post
Instructions for building a zswap-enabled kernel on a system installed with Fedora 17 can be found on this wiki
What are the benefits?
When a page is compressed and stored in a RAM-based memory pool instead of actually being swapped out to a swap device, this results in a significant I/O reduction and in some cases can significantly improve workload performance. The same is true when a page is "swapped back in" - retrieving the desired page from the in-memory zswap pool and decompressing it can result in performance improvements and I/O reductions compared to actually retrieving the page from a swap device.
Using the SPECjbb2005 workload for our engineering tests, we gathered some performance data to show the benefits of zswap. SPECjbb2005 uses a Java™ benchmark that evaluates server performance and calculates a throughput metric called "bops" (business operations per second). To find out more about this benchmark or see the latest official results, see the SPEC web site
. Note that the following results are not tuned for optimal performance and should not be considered official benchmark results for the system, but rather results obtained for research purposes. We liked this benchmark for this use case because we could more carefully control the amount of active memory being used in increments.
The SPECjbb2005 workload ramps up a specified number of "warehouses", or units of stored data, during the run. The number of warehouses is a user-controlled setting that is configured depending on the number of threads available to the JVM. As the benchmark increases the number of warehouses throughout the run, the system utilization level increases. A bops score is reported for each warehouse run. For this work, we focused on the bops score from the warehouse that keeps the system about 50% utilized. We also increased the default runtime for each warehouse to 5 minutes since swapping can be bursty and a longer runtime helps to achieve more consistent results.
For these results, the system was assigned 2 cores, 10 GB of memory, and a 20 GB swap device. A single JVM was created for the SPECjbb2005 runs, using IBM Java. First, a baseline measurement was taken where normal swapping activity occurred, then a run with zswap enabled was measured to show the benefits of zswap. We gathered results on both a Power7+ system and an x86 system to observe the performance impacts on different architecture types. The mpstat, vmstat, and iostat profilers from the sysstat package were used to record CPU utilization, memory usage, and I/O statistics. We would recommend taking advantage of the lpcpu
package to gather these data points.
To demonstrate the performance effects of swapping and compression, we started with a JVM heap size that could be covered by available memory, and then increased the JVM heap size in increments until we were well beyond the amount of free memory, which forced swapping and/or compression to occur. We recorded the throughput metric and swap rate at each data point to measure the impacts as the workload demanded more and more pages.
Settting up zswap
With the current implementation, zswap is enabled by this kernel boot parameter:
We looked at several new in-kernel stats to determine the characteristics of compression during the run. The metrics used were as follows:
pool_pages - number pages backing the compressed memory pool
reject_compress_poor - reject pages due to poor compression policy (cumulative) (see max_compressed_page_size sysfs attribute)
reject_zsmalloc_fail - rejected pages due to zsmalloc failure (cumulative)
reject_kmemcache_fail - rejected pages due to kmem failure (cumulative)
reject_tmppage_fail - rejected pages due to tmppage failure (cumulative)
reject_flush_attempted - reject flush attempted (cumulative)
reject_flush_fail - reject flush failed (cumulative)
stored_pages - number of compressed pages stored in zswap
outstanding_flushes - the number of pages queued to be written back
flushed_pages - the number of pages written back from zswap to the swap device (cumulative)
saved_by_flush - the number of stores that succeeded after an initial failure due to reclaim by flushing pages to the swap device
pool_limit_hit - the zswap pool limit has been reached
There are two user-configurable zswap attributes:
max_pool_percent - the maximum percentage of memory that the compressed pool can occupy
max_compressed_page_size - the maximum size of an acceptable compressed page. Any pages that do not compress to be less than or equal to this size will be rejected (i.e. sent to the actual swap device)
failed_stores - how many store attempts have failed (cumulative)
loads - how many loads were attempted (all should succeed) (cumulative)
succ_stores - how many store attempts have succeeded (cumulative)
invalidates - how many invalidates were attempted (cumulative)
To observe performance and swapping behavior once the zswap pool becomes full, we set the max_pool_percent parameter to 20 - this means that zswap can use up to 20% of the 10GB of total memory.
The following graphs represent the SPECjbb2005 performance and swap rate for a run using the normal swapping mechanism.
Note that as "available" memory is used up around 10GB, the performance falls off very quickly (the Blue Line) and normal page swapping (the Red Line) to disk increases. The behavior is consistent both on Power7+ and x86 systems.
Power7+ baseline results:
x86 baseline results:
As you can see, performance dramatically decreased once the system started swapping and continued to level off as the JVM heap was increased.
The following graphs represent the SPECjbb2005 performance and swap rate for a run when zswap is enabled. In these cases, memory is now being compressed, which significantly reduces the need to go to disk for swapped pages. Performance of the workload (the blue line) still drops off but not as sharply, but more importantly the system load on I/O drops dramatically.
Power7+ with zswap compression:
x86 with zswap compression:
As you can see, the swap (I/O) rate was dramatically reduced. This is because most pages were compressed and stored in the zswap pool instead of swapped to disk, and taken from the zswap pool and decompressed instead of swapped in from disk when the page was requested again. The small amount of "real" swapping that occurred is due to the fact that some pages compressed poorly - which means they did not meet a user-defined max compressed page size - and were therefore swapped out to the disk, and/or stale pages were evicted from the zswap pool.
Looking at the zswap metrics for each run, we can calculate some interesting statistics from this set of runs - keep in mind the base page size is different between Power (64K pages) and x86 (4K pages), which accounts for some of the different behaviour. Also note that we set the max zswap pool size to 20% of total memory for these runs, as mentioned above - this max setting can be adjusted as needed. On Power, the average zswap compression ratio was 4.3. On x86, the average zswap compression ratio was 3.6. For the Power runs, we saw entries for "pool_limit_hit" starting at the 17 GB data point. For the x86 runs, the pool limit was hit earlier - starting at the 15.5 GB data point. For the Power runs, at most the zswap pool stored 139,759 pages. For the x86 runs, the max number of stored pages was 1,914,720. This means all those pages were compressed and stored in the zswap pool, rather than being swapped out to disk, which results in the performance improvements seen here.
POWER7+ hardware acceleration
The POWER7+ processor introduces new onboard hardware assist accelerators that offer memory compression and decompression capabilities, which can provide significant performance advantages over software compression. As an example, the system specifications for the IBM Flex System p260 and p460 Compute Nodes
mention the "Memory Expansion acceleration" feature of the processor.
The current zswap implementation is designed to work with these hardware accelerators when they are available, allowing for either software compression or hardware compression. When a user enables zswap and the hardware accelerator, zswap simply passes the pages to be compressed or decompressed off to the accelerator instead of performing the work in software. Here we demonstrate the performance advantages that can result from leveraging the POWER7+ on-chip memory compression accelerator.
POWER7+ hardware compression results
Because the hardware accelerator speeds up compression, looking at the zswap metrics we observed that there were more store and load requests in a given amount of time, which filled up the zswap pool faster than a software compression run. Because of this behavior, we set the max_pool_percent parameter to 30 for the hardware compression runs - this means that zswap can use up to 30% of the 10GB of total memory.
The following graph represents the SPECjbb2005 performance and swap rate for a run when zswap and the POWER7+ hardware accelerator are enabled. In this case, memory is now being compressed in hardware instead of software, and this results in a significant performance improvement. Performance of the workload (the blue line) still drops off, but even less sharply than the zswap software compression case, and the system load on I/O still remains very low.
Power7+ hardware compression:
As you can see, the swap (I/O) rate was dramatically reduced. This is because most pages were compressed using the hardware accelerator and stored in the zswap pool instead of swapped to disk, and taken from the zswap pool and decompressed in the hardware accelerator instead of swapped in from disk when the page was requested again. The small amount of "real" swapping that occurred is due to the fact that some pages compressed poorly - which means they did not meet a user-defined max compressed page size - and were therefore swapped out to the disk, and/or stale pages were evicted from the zswap pool.
The following graphs show the performance comparison between normal swapping and zswap compression, and the POWER7+ graph also includes the hardware compression results, showing that the hardware accelerator provides even more performance advantages over software compression alone:
Power7+ performance comparison:
x86 performance comparison:
As you can see, this workload shows up to a 40% performance improvement in some cases after the heap size exceeds available memory when zswap is enabled, and the POWER7+ results show that the hardware accelerator can improve the performance by up to 60% in some cases compared to the baseline performance.
Swap (I/O) comparison
The following graphs show the swap rate comparison between normal swapping and zswap compression, and the POWER7+ graph includes the hardware compression results, showing that the hardware accelerator also reduces the swap rate dramatically. Swap rates are dramatically reduced on both architectures when zswap is enabled, including the POWER7+ hardware compression results.
Power7+ swap I/O comparison:
x86 swap I/O comparison:
The new zswap implementation can improve performance while reducing swap I/O , which can also have positive effects on other partitions that share the same I/O bus. The new POWER7+ on-chip memory compression accelerator can be leveraged to provide performance improvements while still keeping swap I/O very low.