IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > Linux for Power Architecture > ... > Performance Tuning > Tuning stream with libhugetlbfs
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
Tuning stream with libhugetlbfs
Added by wburos, last edited by wburos on Mar 11, 2008  (view change)
Labels: 
(None)

As a reminder, the postings on this Wiki sitesolely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management. Be sure to read the Terms of Use for the wiki. Comments, corrections and clarifications to the contents of this page are welcome on the libhugetlbfs mailing list or (wburos@us.ibm.com}.

Related reading

For those common questions, see the libhugetlbfs FAQ page.

For a short, sweet, and simple usage guide, try libhuge short and simple.

For usage characteristics, another paper has been made available on IBM's developerWorks web site

This page introduces a working example of tuning and analyzing a Linux on Power memory intensive workload (stream) with incremental steps for improving the performance and consistency of the runs. This page assumes your Power system is defined with "using all resources" available in the partition. Partitions on a Power system can be easily defined with subsets of the CPU and memory available, but for the sake of performance measurements we use the system as a whole. This approach is recommended for the first time through in "proofing" your runs. In the examples below, we are running on a 4-core system, so we see 8 logical processors when running with SMT on. If your system has more cores, you should leverage them all, and not artificially limit the runs to 8 threads as specified in the examples.

Updated for SLES 10, RHEL 5 and libhugetlbfs 1.0.1

libhugetlbfs is available on SourceForge and is supported on SLES 10 and RHEL 5 for several hardware platforms. In this example, we focus on using libhugetlbfs on a Power system, but we've also used libhugetlbfs for performance improvements with Linux on AMD and Intel systems.

We have found it important to remind users that libhugetlbfs is a niche performance enhancement approach. In some cases, using transparent huge pages can provide up to a 10% performance gain on selected work loads.

Contents



Introduction

Stream is an easy and simple benchmark workload developed by John Mccalpin which measures the sustainable memory bandwidth in high performance computers. The workload is available on the web at the stream web site. The version downloaded and used was stream.c version 5.6.

/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.6 2005/10/04 00:19:59 mccalpin Exp mccalpin $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/

In this easy exercise, several emerging Linux Community features (libhugetlbfs and the 2.6.16 kernel) and existing IBM Compiler products were leveraged for improved performance on the Linux on Power base. Two methods of using the sourceforge project libhugetlbfs (for malloc'ed memory and .bss segments) will show how these can improve performance of a workload like stream.

  • For the best performance, and OpenMP support, the IBM XL C/C++ Compilers available for Linux on POWER are used. This compiler provides built-in support for OpenMP directives which allows threads to be automatically created on the system, one for each processor seen by the operating system. Trial versions of the IBM compiler can be downloaded from the IBM web site.
  • libhugetlbfs is a sourceforge community project emerging for applications to more easily exploit larger page sizes on a system. In this case, two approaches are leveraged on Linux on Power systems. The first approach backs malloc calls with 16MB large pages, and the second approach loads the .bss segment (in this case, the un-initialized arrays) into 16MB large pages.
  • The libhugetlbfs project depends on a 2.6.16 or later kernel. In our example, we just use SLES 10. RHEL 5 also will work fine, but the differences between using 16MB huge pages and normal mode will be much smaller on Power 5+ since RHEL 5 was built to use 64KB pages as the default.
  • The tests were run on a new IBM p5 55A system (see the IBM RedBook for details on the system - note the RedBook is for the original p5 550 - a newer and faster system was used in this example). The system was a 4-core 2.10 Power 5+ system with 32GB balanced memory. Tests were run with SLES 10 booted with the defaults (SMT turned on), so 8 logical CPUs were running. The IBM C/C++ Version 8 compilers were downloaded from the web and installed. Values will obviously vary depending on your system configuration.

libhugetlbfs continues to be actively tested.

libhugetlbfs is ready for customers to use in production!

libhugetlbfs on sourceforge has been released as Version 1 and can be supported for customers using SLES 10 and RHEL 5. Various improvements have been made to the library and steps have been simplified. At the time of this writing, libhugetlbfs 1.0.1 is the recommended level for both SLES 10 and RHEL 5. Each distribution provides rpm files for loading. Be sure you use the 1.0.1 versions.

Continued testing and feedback on libhugetlfs is definitely welcome, consider joining the mailing list on the sourceforge project and contribute! Continued usability and manageability improvements are being worked on.

A quick note on terminology.

  • The Linux community refers to these 16MB pages as "Huge Pages", while the Power community refers to the 16MB pages as "Large Pages". For consistency in this exercise, the phrase "16MB large pages" will be used.
  • Also, NUMA means different things to different people. For this particular Power 5 system, as explained in the Redbook above, there is memory local to each DCM, and there is memory remote to each DCM. There are two DCMs in this system. Remote in this case is not that far away, but the Linux kernel works to recognize local and remote memory and will try to optimize memory placement with respect to the executing threads.

Stream

Let's begin.

Building stream "normally"

Building and using stream is easy. First, go to http://www.cs.virginia.edu/stream/FTP/Code/ and download stream.c

You'll need the IBM compilers installed to take advantage of the OpenMP features. The IBM compilers get installed in /opt/ibmcmp/vac/8.0/bin by default, and is assumed to be in the PATH environment variable. So to compile and run the first test...

# xlc stream.c -o stream
# ./stream

The output should be something like the following as an initial "out of the box" first test.

-------------------------------------------------------------
STREAM version $Revision: 5.6 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity appears to be less than one microsecond.
Each test below will take on the order of 16435 microseconds.
   (= 2147483647 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2213.9372       0.0145       0.0145       0.0145
Scale:       2093.1619       0.0153       0.0153       0.0154
Add:         2719.7108       0.0177       0.0176       0.0177
Triad:       2540.9116       0.0189       0.0189       0.0189
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Reading the results
In the above results, the Rate (MB/s) value is used. This is the memory transfer rate as measured by the program for each of the samples. When discussing results, people often want "just one number" to focus on, so the Triad measurement can be used. In this case, the program "out of the box" was able to drive access to memory at a rate of a little over 2GB per second, somewhat less than the 19GB per second expected from this specific hardware system (as explained in the IBM Redbook above). By the end of the exercise, the 19GB/sec range for Triad will be demonstrated.

What does stream do??

Each of the functions in stream are pretty basic. Three arrays are declared...

# define N      2000000
# define OFFSET 0
static double   a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

With 2,000,000 default entries, each a double (8 bytes), the stream program by default will use 45.78 MB of memory. The stream program outputs a message which says how much memory is being used.

 (2,000,000 * 8 ) / (1024 * 1024) = 15.26MB for each array

 15.26 MB * 3 = 45.78 MB for the three arrays

With each array being 15.26MB, each array "could" fit into its own 16MB large page, and using the 16MB large pages is the piece to be tested. So next in the exercise, a key step will be making "N" bigger to be sure more 16MB large pages are used.
Each of the arrays is initialized before timing measurements are made, so all of the pages are touched at least once before measuring.

   for (j=0; j<N; j++) {
      a[j] = 1.0;
      b[j] = 2.0;
      c[j] = 0.0;
      }

Copy simply copies elements from one array to another array.

   for (j=0; j<N; j++)
      c[j] = a[j];

Scale uses a "scalar" (a multiplier) to take an element from one array, multiply by the scalar (in this case 3.0), and save the result in the corresponding element in another array.

   double scalar;
   scalar = 3.0;
   ...

   for (j=0; j<N; j++)
      b[j] = scalar*c[j];

Add adds corresponding elements from two arrays and stores the result in the third array.

   for (j=0; j<N; j++)
      c[j] = a[j]+b[j];

And finally, Triad combines the three operations...

   for (j=0; j<N; j++)
      a[j] = b[j]+scalar*c[j];

In stream.c, there's timing and checking code, and there's a special "TUNED" mode where you're allowed to do all sorts of tricks to get the best possible timings, but this level of detail is not covered here. The focus is on what users tend to see "out of the box" on Linux on POWER systems.

Make the array size bigger

The array size (declared as "N" in the program) is a value which can be changed to be sure the memory being used is big enough to keep the work out of the processor caches.

Based on experience, this array size is too small (as shown earlier, the default only uses about 45MB of memory), so add an additional zero to the #define N line, making "N" 20,000,000 entries. That'll make the memory usage about 457MB.

# define N      20000000

That's seven zeros. Rebuild. Rerun.

# xlc stream.c -o stream
# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.6 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 20000000, Offset = 0
Total memory required = 457.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity appears to be less than one microsecond.
Each test below will take on the order of 171136 microseconds.
   (= 2147483647 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2161.2569       0.1493       0.1481       0.1497
Scale:       1981.0208       0.1616       0.1615       0.1617
Add:         2581.1598       0.1860       0.1860       0.1861
Triad:       2465.2829       0.1948       0.1947       0.1950
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Notice the Total memory required (in the output) changed to almost half a gig of memory (457.8 MB), and the results were about the same.

Use a better compiler optimization level

The next step is to try different optimization levels to improve the code being generated... Compiling with the -O5 level, the results are improved substantially:

# xlc -O5 stream.c -o stream
# ./stream
... <extra output text deleted from now on> ...
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        5748.9689       0.0557       0.0557       0.0557
Scale:       5531.2390       0.0579       0.0579       0.0580
Add:         6435.2640       0.0747       0.0746       0.0748
Triad:       5851.3095       0.0821       0.0820       0.0821
-------------------------------------------------------------

Leverage OpenMP support

Next, by default, only running one thread on one processor is used to process the arrays. The stream code is already instrumented with OMP controls. For example, in stream.c, do a quick search for "pragma"...

#pragma omp parallel for
    for (j=0; j<N; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
        }

Telling the IBM compilers to leverage this, add the -qsmp=omp compiler directive, which gives the warning message that -qthreaded should be used as well.

# xlc -O5 -qsmp=omp  stream.c -o stream
1506-1354 (W) Option -qsmp should be used with option -qthreaded.

So adding the -qthreaded option, compile and run. Notice that when run, a "stream program" informational message will be displayed for each thread created, one for each processor seen on the system. In this exercise, running on a 4-core system with SMT turned on, so the 8 logical processors will be used by 8 threads automatically generated.

# xlc -O5 -qsmp=omp -qthreaded stream.c -o stream
# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.6 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 20000000, Offset = 0
Total memory required = 457.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 8
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity appears to be less than one microsecond.
Each test below will take on the order of 19936 microseconds.
   (= 2147483647 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       14406.6086       0.0223       0.0222       0.0225
Scale:      13649.5844       0.0236       0.0234       0.0237
Add:        15961.7058       0.0301       0.0301       0.0302
Triad:      16836.1425       0.0286       0.0285       0.0287
-------------------------------------------------------------
Solution Validates
------------------------------------------------------------

When run again, fairly different results are seen.

# ./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       12945.5076       0.0249       0.0247       0.0251
Scale:      12207.7155       0.0263       0.0262       0.0264
Add:        14397.1304       0.0335       0.0333       0.0337
Triad:      15229.8620       0.0316       0.0315       0.0318
-------------------------------------------------------------

With SMT on, results can vary with this type of workload as the threads are dispatched to the various processors. OpenMP and the IBM compilers provide controls to bind the threads to the processors and improve the repeatability of the results.

Binding threads to processors

To bind the threads to the processors, simply set an environment variable and re-run.

# export XLSMPOPTS="STARTPROC=0:STRIDE=1"

With the downloaded trial compilers, the compiler-runtime controlled binding feature currently does not work when SMT is turned off. When SMT is off, STRIDE must be set to 2 (for processors 0,2,4,6), but the processor numbering convention isn't handled correctly. The runtime controller gets mixed up by processor numbers (ie: 6) being higher than the number of processors available (ie: 4), and stops trying to attach threads to processors. This is already fixed in the supported compilers, but isn't yet available in the trial download package.
Re-running stream with the binding controls set, repeat'able, consistent, and fairly nice results are acheived.

# export XLSMPOPTS="STARTPROC=0:STRIDE=1"
# ./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       17111.3144       0.0187       0.0187       0.0188
Scale:      16069.9379       0.0200       0.0199       0.0200
Add:        18550.6590       0.0259       0.0259       0.0260
Triad:      19761.1496       0.0244       0.0243       0.0247
-------------------------------------------------------------

With a 19GB/sec result for Triad, this is the target range hoped for. The exercise will continue on to see if using 16MB large pages can improve this number even more.


So let's look at large page exploitations available with libhugetlbfs from sourceforge.

The next steps will first try modifying the stream program to use malloc since leveraging malloc is pretty simple with libhugetlbfs.

After the malloc test, go back and re-link the existing stream code (which has the statically declared arrays) to have libhugetlbfs put the .bss segment into the 16MB large pages. A .bss segment is essentially the un-initialized static data arrays declared in the program. See this page for a quick definition from the web.


Modify stream to use malloc

It's simple to modify the stream.c program to take advantage of malloc calls instead of predefining the arrays. Copy stream.c to a new program stream-malloc.c.

Then change

static double   a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

to

static double *a, *b, *c;

Then add the mallocs before the a,b,c arrays are initialized.

    a = (double *)malloc(sizeof(double) * (N + OFFSET));
    b = (double *)malloc(sizeof(double) * (N + OFFSET));
    c = (double *)malloc(sizeof(double) * (N + OFFSET));

Re-compile and run.

# xlc -O5 -qsmp=omp -qthreaded stream-malloc.c -o stream-malloc

# export XLSMPOPTS="STARTPROC=0:STRIDE=1"
# ./stream-malloc
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       15768.9864       0.0203       0.0203       0.0203
Scale:      15093.5325       0.0212       0.0212       0.0212
Add:        17051.6047       0.0282       0.0281       0.0282
Triad:      17581.1124       0.0274       0.0273       0.0274
-------------------------------------------------------------

What is very interesting is that malloc performance costs about an 8% - 12% penalty from the statically defined arrays. We'll dig into this in a separate Wiki Page in the near future. This is believed to be a known Linux malloc problem (which can be tuned around) for malloc's which are bigger than 128KB each. For example, an [] from last year hints at the problem.

Using libhugetlbfs

To enable and use libhugetlbfs, there are a series of simple steps needed.

  1. Use SLES 10 or RHEL 5 (or for the more adventerous, download and build a recent mainline kernel...)
  2. Get and build libhugetlbfs
  3. Allocate enough large pages
  4. Try libhugetlbfs with malloc
  5. Try libhugetlbfs with bss segment

Need 2.6.16 kernel

The SLES 10 and the RHEL 5 GA'ed kernels have the correct support with no additional changes required. We recommend using the latest service packs and updates.

Get and build libhugetlbfs

Get the libhugetlbfs library from sourceforge.net into a working directory, for example /usr/local/src. There's a HOWTO included in the tar ball, but here's the quick and simple steps to install the Version 1 level of libhugetlbfs.

# tar -zxf libhugetlbfs-1.0.1.tar.gz
# ls
.  ..  libhugetlbfs-1.0.1  libhugetlbfs-1.0..1tar.gz
# cd libhugetlbfs-1.0.1
# make
# make install PREFIX=/usr

Allocate some large pages

First, reserve some large pages to be used by malloc. Since the program is using about 457.8 for the arrays, 457.8MB / 16MB is about 28.6. For the sake of easier round numbers, 30 large pages were allocated.

When using libhugetlbfs 1.0, more huge pages than needed were allocated. Be sure to use the latest libhugetlbfs (at least 1.0.1).

# cat /proc/meminfo | grep HugePage
HugePages_Total:   0
HugePages_Free:    0


# echo  0 > /proc/sys/vm/nr_hugepages
# echo 30 > /proc/sys/vm/nr_hugepages

# cat /proc/meminfo | grep HugePage
HugePages_Total:   30
HugePages_Free:    30
Clear hugepages before allocating more

Based on experience, we now recommend that you set the number of hugepages to Zero (0) before changing the hugepage count. By freeing all of the hugepages, this allows the newly reserved hugepages to be laid out better for NUMA placement with respect to the program about to be executed.

Setup and re-run with libhugetlbfs

Then re-run the malloc'ed version of stream. But first mount the virtual hugetlbfs filesystem, and set some environment control variables for libhugetlbfs.

# mkdir /libhugetlbfs
# mount -t hugetlbfs hugetlbfs /libhugetlbfs

# export LD_PRELOAD=libhugetlbfs.so
# export HUGETLB_MORECORE=yes
# ./stream_malloc

or, you can combine setting the environment variables on the command line....
# xlc -O5 -qsmp=omp -qthreaded stream-malloc.c -o stream-malloc
# export XLSMPOPTS="STARTPROC=0:STRIDE=1
# LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes ./stream-malloc
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       15892.7826       0.0203       0.0201       0.0208
Scale:      15697.7963       0.0206       0.0204       0.0209
Add:        17721.3193       0.0272       0.0271       0.0275
Triad:      18547.0702       0.0260       0.0259       0.0265
-------------------------------------------------------------

Note that this improves the previous malloc result about 5%. For Triad, (18547 - 17581) / 17581 => 5.5%
But malloc is still slower than the .bss segment in this experiment. More on this in a later technical paper.

Use stream as-is with libhugetlbfs .bss segment

With the basic stream program, with statically declared arrays, libhugetlbfs can be used to load the .bss segment into 16MB large pages. When using the IBM Compilers, you must re-define which loader to use. It's easy... just add three options

# xlc -O5 -qsmp=omp -qthreaded -B/usr/share/libhugetlbfs/ -tl -Wl,--hugetlbfs-link=B stream.c -o stream-lp

Where -tl (dash tee ell) tells the compiler to use the path specified by the -B option for the ld command, and -Wl, (dash W ell comma) specifies the link options to use.
Then re-build and re-link.... (actually, just a re-link of the object files is needed... the libhugetlbfs library does not require a re-compile).

Then re-run... Note that LD_PRELOAD and HUGETLB_MORECORE environment variables are not needed.

# export XLSMPOPTS="STARTPROC=0:STRIDE=1"
# ./stream-lp
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       17150.8910       0.0188       0.0187       0.0190
Scale:      14457.3530       0.0226       0.0221       0.0230
Add:        18395.7340       0.0263       0.0261       0.0267
Triad:      19522.5786       0.0248       0.0246       0.0250
-------------------------------------------------------------

Interesting. Not much improvement from the original best-case stream.
To confirm, execute the original stream which was not linked for large pages.
# export XLSMPOPTS="STARTPROC=0:STRIDE=1"
./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       17126.1615       0.0187       0.0187       0.0188
Scale:      16045.7312       0.0201       0.0199       0.0206
Add:        18577.8767       0.0260       0.0258       0.0260
Triad:      19762.8954       0.0244       0.0243       0.0245
-------------------------------------------------------------

The hypothesis would be that large pages may not help with smaller memory sizes. To test, let's see what happens if 10 times the memory is used for the arrays.

Try much larger array size to exploit large pages

Since relatively small array sizes do not seem to benefit much from large pages for this program, change N to be 200,000,000 (eight zeros) in stream.c, rebuild it to create stream and stream-lp, and then re-run them. The memory used now will be 4577MB (4.5 GB)

Note that stream must be compiled in 64-bit mode because of the size of the arrays being used. For the xlc compiler, that's the "-q64" option.

first, change N to 200000000 in stream.c
(that's 8 zeros)

# xlc -q64 -O5 -qsmp=omp -qthreaded stream.c -o stream
# export XLSMPOPTS="STARTPROC=0:STRIDE=1"
# ./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       16104.7617       0.1989       0.1987       0.1992
Scale:      15943.3205       0.2010       0.2007       0.2013
Add:        18817.8556       0.2559       0.2551       0.2570
Triad:      18747.4303       0.2567       0.2560       0.2588
-------------------------------------------------------------

And repeat with the large page enabled executable. Be sure to reserve more large pages in the system. 4577MB / 16MB = 286+ large pages needed. After echo'ing, confirm that the large pages were allocated, Linux gives no indication on the success or failure of setting nr_hugepages when echo'ing values into the variable.
# echo   0 > /proc/sys/vm/nr_hugepages
# echo 300 > /proc/sys/vm/nr_hugepages

# cat /proc/meminfo | grep HugePage
HugePages_Total:   300
HugePages_Free:    300

# xlc -q64 -O5 -qsmp=omp -qthreaded -B/usr/share/libhugetlbfs/ -tl -Wl,--hugetlbfs-link=B stream.c -o stream-lp
# export XLSMPOPTS="STARTPROC=0:STRIDE=1"
# ./stream-lp
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       16629.6899       0.1936       0.1924       0.1944
Scale:      16577.0481       0.1938       0.1930       0.1949
Add:        20940.0445       0.2297       0.2292       0.2301
Triad:      21123.6765       0.2276       0.2272       0.2285
-------------------------------------------------------------

Using 16 MB large pages in this case demonstrates about a 12.7% "(21123-18747)/18747" performance gain when pushing the .bss segment into 16MB large pages. This 21GB/sec is the best result achieved in this exercise, so we'll wrap up here.

Summary so far

A working example of a workload using 16MB large pages with the executable and libhugetlbfs is demonstrated. It's easy to get quick gains with performance with some simple steps:

  1. Obviously, compile with better optimizations. "xlc -O5" creates nicely tuned executables for ppc64 platforms.
  2. For OpenMP threaded applications, take advantage of the xlc provided "binding capability" where threads are automatically bound to the available processors. This improves both consistency and performance.
  3. For some applications and in some cases, leveraging the 16MB large pages available by loading the .bss segment into them can provide gains. The sourceforge libhugetlbfs project provides an easy way to do this for 2.6.16 kernels and above, and supported on SLES 10 and RHEL 5.

Here's a quick summary of the performance gains and hits that the various options provided:

Challenges still left

In the course of this exercise, several known challenges were highlighted which still need to be addressed...

  1. In this example, using three large malloc's instead of static arrays cost about 8%-12%. This is likely related to a known problem with Linux when using malloc's greater than 128KB. More on this later.
  2. Binding threads to processors doesn't work w/OpenMP and the IBM compiler Trial Version when SMT is off. This is a known and fixed IBM compiler problem. Customers of course can get an updated compiler from IBM when needed. In the exercise shown here, testing was focused with SMT on.
  3. For smaller arrays (450MB) in this program, loading the .bss segment into 16MB large pages is slightly slower than just normal pages. It was demonstrated that larger arrays (across 4.5GB memory) did make a nice positive difference. This serves to highlight that not all workloads and not all cases will benefit from using large 16MB pages with the .bss segment.


 
    About IBM Privacy Contact