64KB pages on Linux for Power systems
For discussions or questions...
To start a discussion or get a question answered, consider posting on the Linux for Power Architecture forum.
Additional Linux on Power performance information is available on the Performance page
The IBM POWER5+ and POWER6 processors add support for two new page sizes in their memory management hardware - 64kB and 16GB. This document outlines the support for the 64kB page size in Linux for Power.
RHEL5 on Power uses a 64kB base page size in order to exploit this new hardware feature and gain additional performance. Using a 64kB base page size has significant performance advantages but also some disadvantages, as discussed below.
SLES 11 is now available which also provides the 64KB base page size.
Linux for Power has historically used 4kB as the base page size - the basic unit in which the Linux virtual memory subsystem manages memory. The base page size is the granularity at which memory can be allocated, files can be mapped into memory, and memory protections (such as read-only or no-execute) can be applied.
Using a 4kB base page size was convenient because it is supported in hardware by all Power architecture processors, and is a suitable page size for machines with relatively small amounts of memory (a few hundred of megabytes or less). However, the PowerPC ELF ABI specification allows the operating system to use any power-of-2 page size between 4kB and 64kB, and it is quite possible to use a larger base page size even on a machine that only supports 4kB pages in hardware. Doing so requires the architecture-specific part of the Linux virtual memory subsystem to generate multiple page table entries (PTEs) for each page mapped into a process's address space.
Note: For more details on the PowerPC ELF ABI 64-bit specification and the defined rules of program loading, see this link.
The Linux virtual memory subsystem currently supports only two page sizes - the base page size and a larger "huge" page size.
Huge pages are accessible only through the "hugetlbfs" file system interface and have several limitations, compared to normal pages. Huge pages can't be demand-paged from a file on disk, and can't be swapped out to swap storage.
Furthermore, the amount of memory available for use as huge pages has to be set by the system administrator, and that memory is then not available for use as normal pages. Linux for Power supports 16MB huge pages on POWER4 and subsequent IBM Power processors, including IBM's latest POWER6 processor.
A related Linux community project on SourceForge called libhugetlbfs provides transparent access to the huge pages for compiled executables. This approach is still limited by the resource availability of the 16MB pages and the system administrative burden of allocating the sufficient number of huge pages for the application.
There are two choices for exploiting 64kB hardware pages in Linux for Power: change the base page size to 64kB, or change the huge page size to 64kB. Changing the huge page size offers no compelling advantage: programs that have been adapted to use huge pages will often see a minor performance regression, and those that have not been adapted will see no change.
On the other hand, changing the base page size to 64kB means that the performance advantages of 64kB pages, compared to 4kB pages, will be available to all programs without modification. Thus the option of using 64kB as the base page size was added to Linux for Power. The 64kB base page size is a configuration option in the Linux kernel source code. This option is enabled in the RHEL5 kernel for 64-bit Power, but not in the SLES10 kernel. Kernels with the option enabled (i.e., using a 64kB base page size) will still run on machines that don't have hardware support for 64kB pages. On those machines, the kernel will create up to 16 hardware PTEs (HPTEs), on demand, for each 64kB software page.
- The SLES 10 kernel source code does not have all of the kernel patches required to fully support 64kB pages. For a discussion on this, see re-building the SLES 10 kernel.
- SLES 11 is built with 64KB pages.
- The RHEL 5 kernel can be re-built with 4kB pages for direct before/after performance comparisons. See re-building a RHEL 5 kernel for Power for details.
There are two main advantages of a 64kB base page size, compared to 4kB:
- The amount of memory that can be accessed without causing a TLB miss (the "TLB reach") is expanded by a factor of 16, from 4MB to 64MB on POWER5+. This reduces the TLB miss rate and improves performance, particularly for programs with a working set between 4MB and 64MB. This only applies to machines that have hardware support for 64kB pages.
- The per-byte software overheads of the kernel memory management code are reduced because it is working with larger units of data. For example, one page fault makes 64kB accessible to the process rather than just 4kB. This applies whether or not the hardware supports 64kB pages.
There are also two main disadvantages:
- Files are cached in kernel memory in units of the base page size. That is, a 1-byte file will take up 1 page of kernel memory when it is cached in the kernel page cache, which will be true whenever the file contents are being accessed. Thus, if the workload involves access to a large number of small files (tens of kB or smaller), the 64kB page size will result in more memory being wasted in the kernel page cache than a 4kB page size would. This wastage is sometimes referred to as "internal fragmentation".
- Buggy programs that assume that the page size is 4kB may behave incorrectly. Sometimes the assumption is implicit rather than explicit. For example, a program may request a 12kB stack for a newly-created thread and then fail when it finds that the thread has been given a 64kB stack rather than 12kB. Fortunately this has not proved to be a major problem in practice.
On Power machines with hardware support for 64kB pages, we see performance improvements ranging from 10% to 30% across a wide range of applications, provided that the machine has enough memory. On machines without hardware 64kB page support, most applications see a performance improvement of around 1% or 2% from the reduced kernel overheads.
The degree to which internal fragmentation increases the amount of memory used by the page cache depends on the distribution of file sizes in use, which is very workload-dependent. For compute-intensive HPC applications that deal with very large files (megabytes or more), the effect is very small. For a performance workload like SPECsfs, which can have many small files, the performance effects can be larger.