A Performance Evaluation of 64KB Pages on Linux for Power Systems

This page has not been liked. Updated 4/2/13, 6:47 PM by hravalTags: None
For discussions or questions...

To start a discussion or get a question answered, consider posting on the Linux for Power Architecture forum.

Additional Linux on Power performance information is available on the Performance page

Contents

 

Introduction

There has been a significant increase in main memory size on systems, from PCs to supercomputers, to run increasingly complex software which requires larger working sets. However, the sizes of caches and buffers for address translation grow at a much slower pace. For example, the translation-lookaside buffer (TLB) inside a POWER5+ processor has 2048 entries to map effective addresses to physical addresses. If each entry corresponds to a 4KB page, on a 16GB system, the TLB can only map to 0.05% of the total main memory space, i.e., 8MB. This ratio is defined as the TLB coverage by the researchers at Rice University [1]. Their definition of TLB coverage is the amount of memory accessible without incurring TLB misses. Except for programs with very high spatial locality, small pages can cause frequent TLB misses which can incur significant performance penalty.

The use of larger page sizes is a well-known method to facilitate more efficient address translation. Larger hardware pages, such as 16MB on Power systems and 2MB on Intel systems, have been utilized by Linux for some time. Today, some processor vendors allow a wide range of hardware page sizes to address the same problem. For instance, Itanium 2 supports page sizes ranging from 4KB to 4GB. One advantage of using 64KB pages over 16MB pages on Power systems is that system administrators do not have to explicitly pre-allocate memory for 16MB pages (also called hugepages). In other words, most applications can run as-is without any special system setup.

Besides address translation, the efficiency of page fault handling, page prefaulting and data prefetching from main memory can also be improved by larger pages. These aspects will be elaborated later in the article.

The 64KB hardware page size represents the first major change in the base page size of the Power systems since its inception. This feature started to be supported by the GS chip DD2.1 revision of the POWER5 processor, aka POWER5+. The current Linux implementation does not allow the use of 4KB pages and 64KB pages simultaneously. Red Hat has adopted 64KB as the base page size on Power systems since the release of RHEL 5. For older Power systems which do not support 64KB pages, the operating system runs in an emulated mode which could help performance under some circumstances.

Information about microarchitectures of POWER5 and POWER6 can be found in [2, 3, 4].

In this article, we use three workloads, namely STREAM, linpack and SPECjbb2005, to quantify performance impact of 64KB pages over 4KB pages on two POWER6 systems.

Address mapping

One necessary step for memory access is to translate the effective address (EA) used by the software to the real address (RA) used by the hardware for instructions and data in memory and storage. The PowerPC Architecture specifies a translation-lookaside buffer (TLB) and a segment-lookaside buffer (SLB) for the translation task. Once translated, the (EA, RA) pair is stored in one of the two first-level translation tables called Effective-to-Real Address Translation (ERAT) tables: one for instructions and one for the data. The SLB and TLB are only used if the ERATs fail to find the needed mapping. If both SLB and TLB fail as well, the page tables are being walked to get the mapping, which is expensive and can shoot up the number of cycles per instruction (CPI) for the applications.

In the POWER5+ processor, the TLB contains 2048 page table entries. With 4KB pages, TLB can only map up 8MB in the main memory. On a 1GB system, 8MB corresponds to 0.8% of the physical memory space. By using 64KB pages, it covers 16 times more memory. If the cost of address translation is significant for the overall performance of a workload, increasing the page size can be very beneficial.

Note that the POWER6 processor does not have TLB, so the role of ERAT for address mapping has increased. We will be looking into data ERAT for performance evaluation below.

Page Fault Handling

Besides address translation, the efficiency of page fault handling can be improved with large page sizes. When using 4KB pages, it takes 16 page faults to get 64KB of data. When using 64KB pages, it only takes one. The time to process page faults is reduced accordingly. Page prefaulting is another secondary benefit with larger page sizes. In the 64KB emulation mode, one 64KB page fault results in prefaulting 16 contiguous 4KB pages implicitly. Therefore, it is possible that emulated 64KB pages can perform better than the true 4KB pages under some situations. However, sometimes, it could cause degradation since there is overhead to process the prefetched but unwanted pages, which also consume extra memory.

Memory Bandwidth

For workloads sensitive to memory transfer rate, a larger page size can provide a gain as well. POWER chips have a built-in hardware prefetching capability. This prefetching mechanism is activated by hardware upon detection of sequential access and prefetches based on page boundary from the main memory. That is, if 64KB is the page size, it prefetches until the whole 64KB page is read. Prefetching has the beneficial effect of hiding memory latency for applications which are sequentially accessing memory. In the case of 64KB pages, there will be 16 times fewer prefetches initiated, so the effect of memory latency is more effectively hidden and overall memory throughput increases.

Internal Fragmentation

There are disadvantages of using 64KB pages. Increased internal fragmentation is the biggest problem. When a 3KB file is put in a 4KB pages, 25% of the page is unused. When this same file is put in a 64KB page, 95% of the page is unused. If there is a large amount of under-utilized pages, which take away space for database buffers, page caches, etc., system performance could suffer. The degree of impact depends on how much memory pressure the system is put under.

Building of the 4KB Kernel

In order to quantify the performance impact of 64KB pages over 4KB pages, we need two kernels for comparison purposes. For the kernel with 64KB pages, we use RHEL 5.2 for POWER. For the kernel with 4KB pages, we build it from the RHEL 5.2 source code by disabling the 64KB feature.

The build instruction can be found at http://www.ibm.com/developerworks/wikis/display/LinuxP/Re-building+a+RHEL+5+kernel+for+Power.

Measurement Systems

Data were collected and analyzed on two POWER6 systems:

System Abbreviation Clock speed (GHz) Number of cores Memory (GB) L3 per chip (MB)
IBM Power 575 Power 575 4.7 32 128 32
IBM System BladeCenter JS22 JS22 4.0 4 32 0

RHEL 5.2 is used across these systems.

Performance Tool

Oprofile [5], a profiling tool and a Sourceforge project, is used to examine the amount of cycles used, instructions completed and data ERAT misses between these two base page sizes. The following performance counters will be examined, wherever appropriate.

POWER6 performance counter Sampling frequency Description
PM_CYC_GRP1 10,000,000 Cycles
PM_INST_CMPL_GRP1 10,000,000 Instructions completed
PM_LSU_DERAT_MISS_GRP2 >= 10,000 Data ERAT misses

Sampling frequency for each counter is set in such a way that profiling overhead is minimal and the counting is highly accurate, based on our profiling experiences.

Performance Results

To quantify the performance impact of 64KB pages and 4KB pages, STREAM, linpack and SPECjbb2005 were measured with the 4KB kernel and the 64KB kernel based on RHEL 5.2 for each metrics of interest.

For improvement on some metrics in percentage, we use

((metrics of 64KB pages - metrics of 4KB pages) / metrics of 4KB pages) * 100%, where a metrics could be memory bandwidth or instructions per cycle.

For reduction on some metrics in percentage, we use

((metrics of 4KB pages - metrics of 64KB pages) / metrics of 4KB pages) * 100%, where a metrics could be data ERAT misses.

(1) STREAM

STREAM [6] is one of the de facto industry standard benchmarks for measuring computer memory bandwidth. It was created by John McCalpin while he was at the University of Delaware. It is specifically designed to process a data set which is much larger than the available caches on a given system, so that data must be read from the main memory and the transfer rate is indicative of the performance of very large, vector style applications.

In this study, we use OpenMP that comes with the IBM XL Fortran compiler to invoke parallel computation. Each OpenMP thread is affinitized to a particular logical processor by setting the following environment variable before executing the STREAM workload:

export XLSMPOPTS=startproc=0:stride=s,

where s is 1 for the SMT mode, while 2 for the ST mode.

Key parameters used:

System SMT Total L3 caches (MB) Array size (N) Memory for three arrays (MB)
p 575 disabled 1,024 1,000,000,284 22,888
JS22 enabled 64 125,000,000 2,861

It is clear that the required memory for the arrays is much larger than the available memory of L3 caches. It then guarantees that the elements of the arrays must be transferred from the main memory to caches continuously during the runs. The tunings are based on their respective published results.

Performance data are as follows:

Performance impact of 64KB pages over 4KB pages Power 575 JS22
Improvement on GB/sec 56.9 % 6.3 %

Observations:

(1) The performance gain by using 64KB pages should mainly come from the efficient prefetching mechanism of the hardware.

(2) The significant performance gain on the IBM Power 575, i.e., 56.9%, is due to the use of a special hardware mode called the chip pump mode where each POWER6 chip is affinitized to some particular memory DIMMs. It is probably reasonable to expect 6% improvement on other Power systems.

(2) Linpack

Linpack (High-performance Linpack or HPL) [7] is a collection of Fortran subroutines that solve a dense system of linear equations. It is widely used for determining the TOP500 list. Its performance results are available for a wide range of systems. In this study, we use the IBM Engineering Scientific Subroutine Library 4.3.1 as the math library and gcc 4.1.2 that comes with RHEL 5.2 to compile the linpack code.

Key parameters used:

System SMT N NB P Q
p 575 enabled 105,400 120 4 16
JS22 disabled 59,000 480 2 2

These parameters are based on their respective published results.

Performance data are as follows:

Performance impact of 64KB pages over 4KB pages Power 575 JS22
Improvement on Gflops 14.4 % 11.3 %
Improvement on Instructions Per Cycle (IPC) 12.6 % 10.1 %
Reduction on Data ERAT Misses 98.4 % 96.1%

Observations:

(1) 64KB pages provide a performance boost over 4KB pages by 11-14 %. Instructions per cycle improves in the similar range.

(2) There is a significant reduction on data ERAT misses. It means address mapping is mostly resolved in data ERAT. The increased memory bandwidth and more efficient page fault handling should contribute to the improved score as well, although it is not easy to separate their contributions quantitatively.

(3) SPECjbb2005

SPECjbb2005 [8] is a SPEC's benchmark for evaluating the performance of server side Java. The benchmark evaluates the implementations of Java Virtual Machine (JVM), Just-In-Time (JIT) compiler, garbage collection, threads and some aspects of the operating system.

In this study, we make use of IBM's 32-bit Java 6 SR1. We run this workload with multiple JVMs. Each JVM is affinitized to a particular POWER6 processor chip. In other words, we have two JVMs on the JS22 Blade and 16 JVMs on the IBM Power 575 with SMT being enabled. Each JVM runs up to 8 warehouses while potentially hitting the peak performance at 4 warehouses. These runs are simply engineering runs without published results, for complying with the SPEC.org rules.

Key Java parameters used:

-Xms2560m -Xmx2560m -Xmn1843m -Xgcpolicy:gencon -Xgcthreads4 -Xcompactgc -XlockReservation -Xnoloa

Performance data are as follows:

Performance impact of 64KB pages over 4KB pages Power 575 JS22
Improvement on ops/sec 14.7 % 21.4 %
Improvement on Instructions Per Cycle (IPC) 14.7 % 18.5 %
Reduction on Data ERAT Misses 36.3 % 34.7 %

Observations:

(1) 64KB pages provide a performance boost over 4KB pages by 15-21%. Instructions per cycle improves in the similar range.

(2) There is a good reduction on data ERAT misses. As mentioned above, the increased memory bandwidth and more efficient page fault handling should contribute to the improved score as well, although it is not easy to separate their contributions quantitatively.

Conclusions

The performance data of Linpack and SPECjbb2005 indicate that 64KB pages perform better than 4KB by 11-21%. The performance gain on memory bandwidth can be as high as 57% on the IBM Power 575 with the chip pump mode. Indeed, performance gain would vary for different applications under different system configurations.

However, the performance gain above has an assumption that the systems are not under memory pressure. If an application makes use of a large number of small files, say 1K files, the higher internal fragmentation due to use of 64KB pages can increase memory pressure to the system and might cause performance degradation.

Overall, leveraging the 64KB pages is intended to provide good performance gain across the majority of environments.

References

[1] J. Navarro, S. Iyer, P. Druschel, A. Cox, "Practical, Transparent Operating System Support for Superpages," Proceedings of the Fifth Symposium on Operating System Design and Implementation," Boston, Mass., December 2002.

[2] P. Mackerras, T. S. Matthews, R. C. Swanberg, "Operating System Exploitation of the POWER5 System,", IBM J. Res. & Dev. 49, No. 4/5, 533-539 (July/September 2005).

[3] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, J. B. Joyner, "POWER5 System Microarchitecture," IBM J. Res. & Dev. 49, No. 4/5, 505-521 (July/September 2005).

[4] H. W. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M . Schwarz, M. T. Vaden, "POWER6 System Microarchitecture," IBM J. Res. & Dev. 51, No. 6, 639-662 (November 2007).

[5] Oprofile, http://oprofile.sourceforge.net/news/.

[6] STREAM benchmark, http://www.cs.virginia.edu/stream/.

[7] Linpack benchmark, http://www.top500.org/project/linpack.

[8] SPECjbb2005 benchmark, http://www.spec.org/jbb2005/.

Acknowledgment

The author would like to thank Bill Buros and Sonny Rao for their support to this project, Paul MacKerras for sharing his knowledge on the implementation of the 64KB pages on Power systems, and Arthur Ban for sharing his knowledge on the IBM Power 575.

Author

Peter W. Wong is a member of the Linux Performance Team in IBM. He has been a performance analyst for thirteen years in the areas of Java graphics, graphical user interface, data warehousing and high performance computing. He holds a Ph.D. degree in computer science from Ohio State University.