As a reminder, the postings on this Wiki site
solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management. Be sure to read the Terms of Use for the wiki. Comments, corrections and clarifications to the contents of this page are welcome on the libhugetlbfs mailing list
or (wburos@us.ibm.com
}.
 | Related reading
For those common questions, see the libhugetlbfs FAQ page.
For a short, sweet, and simple usage guide, try libhuge short and simple .
For usage characteristics, another paper has been made available on IBM's developerWorks web site |
This page introduces a working example of tuning and analyzing a Linux on Power memory intensive workload (stream) with incremental steps for improving the performance and consistency of the runs. This page assumes your Power system is defined with "using all resources" available in the partition. Partitions on a Power system can be easily defined with subsets of the CPU and memory available, but for the sake of performance measurements we use the system as a whole. This approach is recommended for the first time through in "proofing" your runs. In the examples below, we are running on a 4-core system, so we see 8 logical processors when running with SMT on. If your system has more cores, you should leverage them all, and not artificially limit the runs to 8 threads as specified in the examples.
 | Updated for SLES 10, RHEL 5 and libhugetlbfs 1.0.1
libhugetlbfs is available on SourceForge and is supported on SLES 10 and RHEL 5 for several hardware platforms. In this example, we focus on using libhugetlbfs on a Power system, but we've also used libhugetlbfs for performance improvements with Linux on AMD and Intel systems. |
We have found it important to remind users that libhugetlbfs is a niche performance enhancement approach. In some cases, using transparent huge pages can provide up to a 10% performance gain on selected work loads.
Contents
Introduction
Stream is an easy and simple benchmark workload developed by John Mccalpin which measures the sustainable memory bandwidth in high performance computers. The workload is available on the web at the stream web site
. The version downloaded and used was stream.c
version 5.6.
In this easy exercise, several emerging Linux Community features (libhugetlbfs and the 2.6.16 kernel) and existing IBM Compiler products were leveraged for improved performance on the Linux on Power base. Two methods of using the sourceforge project libhugetlbfs (for malloc'ed memory and .bss segments) will show how these can improve performance of a workload like stream.
- For the best performance, and OpenMP support, the IBM XL C/C++ Compilers available for Linux on POWER are used. This compiler provides built-in support for OpenMP directives which allows threads to be automatically created on the system, one for each processor seen by the operating system. Trial versions of the IBM compiler can be downloaded from the IBM web site
.
- libhugetlbfs
is a sourceforge community project emerging for applications to more easily exploit larger page sizes on a system. In this case, two approaches are leveraged on Linux on Power systems. The first approach backs malloc calls with 16MB large pages, and the second approach loads the .bss segment (in this case, the un-initialized arrays) into 16MB large pages.
- The libhugetlbfs project depends on a 2.6.16 or later kernel. In our example, we just use SLES 10. RHEL 5 also will work fine, but the differences between using 16MB huge pages and normal mode will be much smaller on Power 5+ since RHEL 5 was built to use 64KB pages as the default.
- The tests were run on a new IBM p5 55A system (see the IBM RedBook
for details on the system - note the RedBook is for the original p5 550 - a newer and faster system was used in this example). The system was a 4-core 2.10 Power 5+ system with 32GB balanced memory. Tests were run with SLES 10 booted with the defaults (SMT turned on), so 8 logical CPUs were running. The IBM C/C++ Version 8 compilers were downloaded from the web and installed. Values will obviously vary depending on your system configuration.
libhugetlbfs continues to be actively tested.
 | libhugetlbfs is ready for customers to use in production!
libhugetlbfs on sourceforge has been released as Version 1 and can be supported for customers using SLES 10 and RHEL 5. Various improvements have been made to the library and steps have been simplified. At the time of this writing, libhugetlbfs 1.0.1 is the recommended level for both SLES 10 and RHEL 5. Each distribution provides rpm files for loading. Be sure you use the 1.0.1 versions.
Continued testing and feedback on libhugetlfs is definitely welcome, consider joining the mailing list on the sourceforge project and contribute! Continued usability and manageability improvements are being worked on. |
A quick note on terminology.
- The Linux community refers to these 16MB pages as "Huge Pages", while the Power community refers to the 16MB pages as "Large Pages". For consistency in this exercise, the phrase "16MB large pages" will be used.
- Also, NUMA means different things to different people. For this particular Power 5 system, as explained in the Redbook above, there is memory local to each DCM, and there is memory remote to each DCM. There are two DCMs in this system. Remote in this case is not that far away, but the Linux kernel works to recognize local and remote memory and will try to optimize memory placement with respect to the executing threads.
Stream
Let's begin.
Building stream "normally"
Building and using stream is easy. First, go to http://www.cs.virginia.edu/stream/FTP/Code/
and download stream.c
You'll need the IBM compilers installed to take advantage of the OpenMP features. The IBM compilers get installed in /opt/ibmcmp/vac/8.0/bin by default, and is assumed to be in the PATH environment variable. So to compile and run the first test...
The output should be something like the following as an initial "out of the box" first test.
Reading the results
In the above results, the Rate (MB/s) value is used. This is the memory transfer rate as measured by the program for each of the samples. When discussing results, people often want "just one number" to focus on, so the Triad measurement can be used. In this case, the program "out of the box" was able to drive access to memory at a rate of a little over 2GB per second, somewhat less than the 19GB per second expected from this specific hardware system (as explained in the IBM Redbook above). By the end of the exercise, the 19GB/sec range for Triad will be demonstrated.
What does stream do??
Each of the functions in stream are pretty basic. Three arrays are declared...
With 2,000,000 default entries, each a double (8 bytes), the stream program by default will use 45.78 MB of memory. The stream program outputs a message which says how much memory is being used.
With each array being 15.26MB, each array "could" fit into its own 16MB large page, and using the 16MB large pages is the piece to be tested. So next in the exercise, a key step will be making "N" bigger to be sure more 16MB large pages are used.
Each of the arrays is initialized before timing measurements are made, so all of the pages are touched at least once before measuring.
Copy simply copies elements from one array to another array.
Scale uses a "scalar" (a multiplier) to take an element from one array, multiply by the scalar (in this case 3.0), and save the result in the corresponding element in another array.
Add adds corresponding elements from two arrays and stores the result in the third array.
And finally, Triad combines the three operations...
In stream.c, there's timing and checking code, and there's a special "TUNED" mode where you're allowed to do all sorts of tricks to get the best possible timings, but this level of detail is not covered here. The focus is on what users tend to see "out of the box" on Linux on POWER systems.
Make the array size bigger
The array size (declared as "N" in the program) is a value which can be changed to be sure the memory being used is big enough to keep the work out of the processor caches.
Based on experience, this array size is too small (as shown earlier, the default only uses about 45MB of memory), so add an additional zero to the #define N line, making "N" 20,000,000 entries. That'll make the memory usage about 457MB.
That's seven zeros. Rebuild. Rerun.
Notice the Total memory required (in the output) changed to almost half a gig of memory (457.8 MB), and the results were about the same.
Use a better compiler optimization level
The next step is to try different optimization levels to improve the code being generated... Compiling with the -O5 level, the results are improved substantially:
Leverage OpenMP support
Next, by default, only running one thread on one processor is used to process the arrays. The stream code is already instrumented with OMP controls. For example, in stream.c, do a quick search for "pragma"...
Telling the IBM compilers to leverage this, add the -qsmp=omp compiler directive, which gives the warning message that -qthreaded should be used as well.
So adding the -qthreaded option, compile and run. Notice that when run, a "stream program" informational message will be displayed for each thread created, one for each processor seen on the system. In this exercise, running on a 4-core system with SMT turned on, so the 8 logical processors will be used by 8 threads automatically generated.
When run again, fairly different results are seen.
With SMT on, results can vary with this type of workload as the threads are dispatched to the various processors. OpenMP and the IBM compilers provide controls to bind the threads to the processors and improve the repeatability of the results.
Binding threads to processors
To bind the threads to the processors, simply set an environment variable and re-run.
With the downloaded trial compilers, the compiler-runtime controlled binding feature currently does not work when SMT is turned off. When SMT is off, STRIDE must be set to 2 (for processors 0,2,4,6), but the processor numbering convention isn't handled correctly. The runtime controller gets mixed up by processor numbers (ie: 6) being higher than the number of processors available (ie: 4), and stops trying to attach threads to processors. This is already fixed in the supported compilers, but isn't yet available in the trial download package.
Re-running stream with the binding controls set, repeat'able, consistent, and fairly nice results are acheived.
With a 19GB/sec result for Triad, this is the target range hoped for. The exercise will continue on to see if using 16MB large pages can improve this number even more.
So let's look at large page exploitations available with libhugetlbfs from sourceforge.
The next steps will first try modifying the stream program to use malloc since leveraging malloc is pretty simple with libhugetlbfs.
After the malloc test, go back and re-link the existing stream code (which has the statically declared arrays) to have libhugetlbfs put the .bss segment into the 16MB large pages. A .bss segment is essentially the un-initialized static data arrays declared in the program. See this page
for a quick definition from the web.
Modify stream to use malloc
It's simple to modify the stream.c program to take advantage of malloc calls instead of predefining the arrays. Copy stream.c to a new program stream-malloc.c.
Then change
to
Then add the mallocs before the a,b,c arrays are initialized.
Re-compile and run.
What is very interesting is that malloc performance costs about an 8% - 12% penalty from the statically defined arrays. We'll dig into this in a separate Wiki Page in the near future. This is believed to be a known Linux malloc problem (which can be tuned around) for malloc's which are bigger than 128KB each. For example, an [] from last year hints at the problem.
Using libhugetlbfs
To enable and use libhugetlbfs, there are a series of simple steps needed.
- Use SLES 10 or RHEL 5 (or for the more adventerous, download and build a recent mainline kernel...)
- Get and build libhugetlbfs
- Allocate enough large pages
- Try libhugetlbfs with malloc
- Try libhugetlbfs with bss segment
Need 2.6.16 kernel
The SLES 10 and the RHEL 5 GA'ed kernels have the correct support with no additional changes required. We recommend using the latest service packs and updates.
Get and build libhugetlbfs
Get the libhugetlbfs library from sourceforge.net
into a working directory, for example /usr/local/src. There's a HOWTO included in the tar ball, but here's the quick and simple steps to install the Version 1 level of libhugetlbfs.
Allocate some large pages
First, reserve some large pages to be used by malloc. Since the program is using about 457.8 for the arrays, 457.8MB / 16MB is about 28.6. For the sake of easier round numbers, 30 large pages were allocated.
When using libhugetlbfs 1.0, more huge pages than needed were allocated. Be sure to use the latest libhugetlbfs (at least 1.0.1).
 | Clear hugepages before allocating more
Based on experience, we now recommend that you set the number of hugepages to Zero (0) before changing the hugepage count. By freeing all of the hugepages, this allows the newly reserved hugepages to be laid out better for NUMA placement with respect to the program about to be executed. |
Setup and re-run with libhugetlbfs
Then re-run the malloc'ed version of stream. But first mount the virtual hugetlbfs filesystem, and set some environment control variables for libhugetlbfs.
or, you can combine setting the environment variables on the command line....
Note that this improves the previous malloc result about 5%. For Triad, (18547 - 17581) / 17581 => 5.5%
But malloc is still slower than the .bss segment in this experiment. More on this in a later technical paper.
Use stream as-is with libhugetlbfs .bss segment
With the basic stream program, with statically declared arrays, libhugetlbfs can be used to load the .bss segment into 16MB large pages. When using the IBM Compilers, you must re-define which loader to use. It's easy... just add three options
Where -tl (dash tee ell) tells the compiler to use the path specified by the -B option for the ld command, and -Wl, (dash W ell comma) specifies the link options to use.
Then re-build and re-link.... (actually, just a re-link of the object files is needed... the libhugetlbfs library does not require a re-compile).
Then re-run... Note that LD_PRELOAD and HUGETLB_MORECORE environment variables are not needed.
Interesting. Not much improvement from the original best-case stream.
To confirm, execute the original stream which was
not linked for large pages.
The hypothesis would be that large pages may not help with smaller memory sizes. To test, let's see what happens if 10 times the memory is used for the arrays.
Try much larger array size to exploit large pages
Since relatively small array sizes do not seem to benefit much from large pages for this program, change N to be 200,000,000 (eight zeros) in stream.c, rebuild it to create stream and stream-lp, and then re-run them. The memory used now will be 4577MB (4.5 GB)
Note that stream must be compiled in 64-bit mode because of the size of the arrays being used. For the xlc compiler, that's the "-q64" option.
And repeat with the large page enabled executable. Be sure to reserve more large pages in the system. 4577MB / 16MB = 286+ large pages needed. After echo'ing, confirm that the large pages were allocated, Linux gives no indication on the success or failure of setting
nr_hugepages when echo'ing values into the variable.
Using 16 MB large pages in this case demonstrates about a 12.7% "(21123-18747)/18747" performance gain when pushing the .bss segment into 16MB large pages. This 21GB/sec is the best result achieved in this exercise, so we'll wrap up here.
Summary so far
A working example of a workload using 16MB large pages with the executable and libhugetlbfs is demonstrated. It's easy to get quick gains with performance with some simple steps:
Obviously, compile with better optimizations. "xlc -O5" creates nicely tuned executables for ppc64 platforms.
For OpenMP threaded applications, take advantage of the xlc provided "binding capability" where threads are automatically bound to the available processors. This improves both consistency and performance.
For some applications and in some cases, leveraging the 16MB large pages available by loading the .bss segment into them can provide gains. The sourceforge libhugetlbfs project provides an easy way to do this for 2.6.16 kernels and above, and supported on SLES 10 and RHEL 5.
Here's a quick summary of the performance gains and hits that the various options provided:

Challenges still left
In the course of this exercise, several known challenges were highlighted which still need to be addressed...
In this example, using three large malloc's instead of static arrays cost about 8%-12%. This is likely related to a known problem with Linux when using malloc's greater than 128KB. More on this later.
Binding threads to processors doesn't work w/OpenMP and the IBM compiler Trial Version when SMT is off. This is a known and fixed IBM compiler problem. Customers of course can get an updated compiler from IBM when needed. In the exercise shown here, testing was focused with SMT on.
For smaller arrays (450MB) in this program, loading the .bss segment into 16MB large pages is slightly slower than just normal pages. It was demonstrated that larger arrays (across 4.5GB memory) did make a nice positive difference. This serves to highlight that not all workloads and not all cases will benefit from using large 16MB pages with the .bss segment.