IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > Linux for Power Architecture > ... > Performance Tuning > Tuning stream with libhugetlbfs > Information > Page Comparison
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
Tuning stream with libhugetlbfs
Version 23 by wburos
on Mar 11, 2008 14:23.


compared with
Current by wburos
on Mar 11, 2008 14:35.

(show comment)
 
Key
These lines were removed. This word was removed.
These lines were added. This word was added.

View page history


There are 70 changes. View first change.

 As a reminder, the postings on this [Wiki site|http://www-941.ibm.com/collaboration/wiki/display/LinuxP/Home]solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management. Be sure to read the Terms of Use for the wiki. Comments, corrections and clarifications to the contents of this page are welcome on the [libhugetlbfs mailing list|http://sourceforge.net/mail/?group_id=156936] or ([mailto:wburos@us.ibm.com]}.
 {info:title=Related reading}
 For those common questions, see the [libhugetlbfs FAQ|http://www-941.ibm.com/collaboration/wiki/display/LinuxP/libhugetlbfs+FAQs] page.
  
 For a short, sweet, and simple usage guide, try [libhuge short and simple|http://www-941.ibm.com/collaboration/wiki/display/LinuxP/libhuge+short+and+simple].
  
 For usage characteristics, another paper has been made available on IBM's developerWorks [web site|http://www-128.ibm.com/developerworks/systems/library/es-lop-leveragepages/]
 {info}
 This page introduces a working example of tuning and analyzing a Linux on Power memory intensive workload (stream) with incremental steps for improving the performance and consistency of the runs. This page assumes your Power system is defined with "using all resources" available in the partition. Partitions on a Power system can be easily defined with subsets of the CPU and memory available, but for the sake of performance measurements we use the system as a whole. This approach is recommended for the first time through in "proofing" your runs. In the examples below, we are running on a 4-core system, so we see 8 logical processors when running with SMT on. If your system has more cores, you should leverage them all, and not artificially limit the runs to 8 threads as specified in the examples.
  
  
 {tip:title=Updated for SLES 10, RHEL 5 and libhugetlbfs 1.0.1}
 libhugetlbfs is available on SourceForge and is supported on SLES 10 and RHEL 5 for several hardware platforms. In this example, we focus on using libhugetlbfs on a Power system, but we've also used libhugetlbfs for performance improvements with Linux on AMD and Intel systems.
 {tip}
 We have found it important to remind users that libhugetlbfs is a niche performance enhancement approach. In some cases, using transparent huge pages can provide up to a 10% performance gain on selected work loads.
  
 *Contents*
 
 {toc}\\
  
 ----
 h1. Introduction
  
 Stream is an easy and simple benchmark workload developed by John Mccalpin which measures the sustainable memory bandwidth in high performance computers. The workload is available on the web at the [stream web site|http://www.cs.virginia.edu/stream]. The version downloaded and used was [stream.c|http://www.cs.virginia.edu/stream/FTP/Code/stream.c] version 5.6.
 
 {noformat}
 /*-----------------------------------------------------------------------*/
 /* Program: Stream */
 /* Revision: $Id: stream.c,v 5.6 2005/10/04 00:19:59 mccalpin Exp mccalpin $ */
 /* Original code developed by John D. McCalpin */
 /* Programmers: John D. McCalpin */
 /* Joe R. Zagar */
 /* */
 /* This program measures memory transfer rates in MB/s for simple */
 /* computational kernels coded in C. */
 /*-----------------------------------------------------------------------*/
 /* Copyright 1991-2005: John D. McCalpin */
 /*-----------------------------------------------------------------------*/
{noformat}\\ In this easy exercise, several emerging Linux Community features (libhugetlbfs and the 2.6.16 kernel) and existing IBM Compiler products were leveraged for improved performance on the Linux on Power base. Two methods of using the sourceforge project libhugetlbfs (for malloc'ed memory and .bss segments) will show how these can improve performance of a workload like stream.
  {noformat}\\
 In this easy exercise, several emerging Linux Community features (libhugetlbfs and the 2.6.16 kernel) and existing IBM Compiler products were leveraged for improved performance on the Linux on Power base. Two methods of using the sourceforge project libhugetlbfs (for malloc'ed memory and .bss segments) will show how these can improve performance of a workload like stream.
 * For the best performance, and OpenMP support, the IBM XL C/C+\+ Compilers available for Linux on POWER are used. This compiler provides built-in support for OpenMP directives which allows threads to be automatically created on the system, one for each processor seen by the operating system. Trial versions of the IBM compiler can be downloaded from the [IBM web site|http://www-306.ibm.com/software/awdtools/xlcpp/features/linux/xlcpp-linux.html].
  
 * [libhugetlbfs|http://www.sourceforge.net/projects/libhugetlbfs] is a sourceforge community project emerging for applications to more easily exploit larger page sizes on a system. In this case, two approaches are leveraged on Linux on Power systems. The first approach backs malloc calls with 16MB large pages, and the second approach loads the .bss segment (in this case, the un-initialized arrays) into 16MB large pages.
  
 * The libhugetlbfs project depends on a 2.6.16 or later kernel. In our example, we just use SLES 10. RHEL 5 also will work fine, but the differences between using 16MB huge pages and normal mode will be much smaller on Power 5\+ since RHEL 5 was built to use 64KB pages as the default.
  
 * The tests were run on a new IBM p5 55A system (see the IBM [RedBook|http://www.redbooks.ibm.com/redpapers/abstracts/redp9113.html] for details on the system - note the RedBook is for the original p5 550 - a newer and faster system was used in this example). The system was a 4-core 2.10 Power 5\+ system with 32GB balanced memory. Tests were run with SLES 10 booted with the defaults (SMT turned on), so 8 logical CPUs were running. The IBM C/C+\+ Version 8 compilers were downloaded from the web and installed. Values will obviously vary depending on your system configuration.
  
 libhugetlbfs continues to be actively tested.
 {tip:title=libhugetlbfs is ready for customers to use in production!}
 libhugetlbfs on sourceforge has been released as Version 1 and can be supported for customers using SLES 10 and RHEL 5. Various improvements have been made to the library and steps have been simplified. At the time of this writing, libhugetlbfs 1.0.1 is the recommended level for both SLES 10 and RHEL 5. Each distribution provides rpm files for loading. Be sure you use the 1.0.1 versions.
  
 Continued testing and feedback on libhugetlfs is definitely welcome, consider joining the [mailing list|http://sourceforge.net/mail/?group_id=156936] on the sourceforge project and contribute\! Continued usability and manageability improvements are being worked on.
 {tip}
\\
  
 A quick note on terminology.
 * The Linux community refers to these 16MB pages as "Huge Pages", while the Power community refers to the 16MB pages as "Large Pages". For consistency in this exercise, the phrase "16MB large pages" will be used.
  
 * Also, NUMA means different things to different people. For this particular Power 5 system, as explained in the Redbook above, there is memory local to each DCM, and there is memory remote to each DCM. There are two DCMs in this system. Remote in this case is not that far away, but the Linux kernel works to recognize local and remote memory and will try to optimize memory placement with respect to the executing threads.
  
 h1. Stream
  
 Let's begin.
  
 h2. Building stream "normally"
  
 Building and using stream is easy. First, go to [http://www.cs.virginia.edu/stream/FTP/Code/] and download stream.c
  
 You'll need the IBM compilers installed to take advantage of the OpenMP features. The IBM compilers get installed in {{/opt/ibmcmp/vac/8.0/bin}} by default, and is assumed to be in the PATH environment variable. So to compile and run the first test...
 
 {noformat}
 # xlc stream.c -o stream
 # ./stream
{noformat}\\ The output should be something like the following as an initial "out of the box" first test.
  
 {noformat}
The output should be something like the following as an initial "out of the box" first test.
 {noformat}
 -------------------------------------------------------------
 STREAM version $Revision: 5.6 $
 -------------------------------------------------------------
 This system uses 8 bytes per DOUBLE PRECISION word.
 -------------------------------------------------------------
 Array size = 2000000, Offset = 0
 Total memory required = 45.8 MB.
 Each test is run 10 times, but only
 the *best* time for each is used.
 -------------------------------------------------------------
 Printing one line per active thread....
 -------------------------------------------------------------
 Your clock granularity appears to be less than one microsecond.
 Each test below will take on the order of 16435 microseconds.
  (= 2147483647 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 -------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 2213.9372 0.0145 0.0145 0.0145
 Scale: 2093.1619 0.0153 0.0153 0.0154
 Add: 2719.7108 0.0177 0.0176 0.0177
 Triad: 2540.9116 0.0189 0.0189 0.0189
 -------------------------------------------------------------
 Solution Validates
 -------------------------------------------------------------
{noformat}\\ *Reading the results*
  {noformat}
  
*Reading the results*
 In the above results, the *Rate (MB/s)* value is used. This is the memory transfer rate as measured by the program for each of the samples. When discussing results, people often want "just one number" to focus on, so the Triad measurement can be used. In this case, the program "out of the box" was able to drive access to memory at a rate of a little over 2GB per second, somewhat less than the 19GB per second expected from this specific hardware system (as explained in the IBM Redbook above). By the end of the exercise, the 19GB/sec range for Triad will be demonstrated.
  
 h2. What does stream do??
  
 Each of the functions in stream are pretty basic. Three arrays are declared...
 
 {noformat}
 # define N 2000000
 # define OFFSET 0
 static double a[N+OFFSET],
  b[N+OFFSET],
  c[N+OFFSET];
{noformat}\\ With 2,000,000 default entries, each a double (8 bytes), the stream program by default will use 45.78 MB of memory. The stream program outputs a message which says how much memory is being used.
  
 {noformat}
With 2,000,000 default entries, each a double (8 bytes), the stream program by default will use 45.78 MB of memory. The stream program outputs a message which says how much memory is being used.
 {noformat}
  (2,000,000 * 8 ) / (1024 * 1024) = 15.26MB for each array
  
  15.26 MB * 3 = 45.78 MB for the three arrays
{noformat}\\ With each array being 15.26MB, each array "could" fit into its own 16MB large page, and using the 16MB large pages is the piece to be tested. So next in the exercise, a key step will be making "N" bigger to be sure more 16MB large pages are used.
  
  {noformat}
 With each array being 15.26MB, each array "could" fit into its own 16MB large page, and using the 16MB large pages is the piece to be tested. So next in the exercise, a key step will be making "N" bigger to be sure more 16MB large pages are used.
 Each of the arrays is initialized before timing measurements are made, so all of the pages are touched at least once before measuring.
 
 {noformat}
  for (j=0; j<N; j++) {
  a[j] = 1.0;
  b[j] = 2.0;
  c[j] = 0.0;
  }
{noformat}\\ *Copy* simply copies elements from one array to another array.
  {noformat}
  
*Copy* simply copies elements from one array to another array.
 {noformat}
  for (j=0; j<N; j++)
  c[j] = a[j];
{noformat}\\ *Scale* uses a "scalar" (a multiplier) to take an element from one array, multiply by the scalar (in this case 3.0), and save the result in the corresponding element in another array.
  {noformat}
  
*Scale* uses a "scalar" (a multiplier) to take an element from one array, multiply by the scalar (in this case 3.0), and save the result in the corresponding element in another array.
 {noformat}
  double scalar;
  scalar = 3.0;
  ...
  
  for (j=0; j<N; j++)
  b[j] = scalar*c[j];
{noformat}\\ *Add* adds corresponding elements from two arrays and stores the result in the third array.
  {noformat}
  
*Add* adds corresponding elements from two arrays and stores the result in the third array.
 {noformat}
  for (j=0; j<N; j++)
  c[j] = a[j]+b[j];
{noformat}\\ And finally, *Triad* combines the three operations...
  
 {noformat}
And finally, *Triad* combines the three operations...
 {noformat}
  for (j=0; j<N; j++)
  a[j] = b[j]+scalar*c[j];
{noformat}\\ In stream.c, there's timing and checking code, and there's a special "TUNED" mode where you're allowed to do all sorts of tricks to get the best possible timings, but this level of detail is not covered here. The focus is on what users tend to see "out of the box" on Linux on POWER systems.
  {noformat}
 In stream.c, there's timing and checking code, and there's a special "TUNED" mode where you're allowed to do all sorts of tricks to get the best possible timings, but this level of detail is not covered here. The focus is on what users tend to see "out of the box" on Linux on POWER systems.
  
 h2. Make the array size bigger
  
 The array size (declared as "N" in the program) is a value which can be changed to be sure the memory being used is big enough to keep the work out of the processor caches.
  
 Based on experience, this array size is too small (as shown earlier, the default only uses about 45MB of memory), so add an additional zero to the #define N line, making "N" 20,000,000 entries. That'll make the memory usage about 457MB.
 
 {noformat}
 # define N 20000000
{noformat}\\ That's seven zeros. Rebuild. Rerun.
  
 {noformat}
That's seven zeros. Rebuild. Rerun.
 {noformat}
 # xlc stream.c -o stream
 # ./stream
 -------------------------------------------------------------
 STREAM version $Revision: 5.6 $
 -------------------------------------------------------------
 This system uses 8 bytes per DOUBLE PRECISION word.
 -------------------------------------------------------------
 Array size = 20000000, Offset = 0
 Total memory required = 457.8 MB.
 Each test is run 10 times, but only
 the *best* time for each is used.
 -------------------------------------------------------------
 Printing one line per active thread....
 -------------------------------------------------------------
 Your clock granularity appears to be less than one microsecond.
 Each test below will take on the order of 171136 microseconds.
  (= 2147483647 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 -------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 2161.2569 0.1493 0.1481 0.1497
 Scale: 1981.0208 0.1616 0.1615 0.1617
 Add: 2581.1598 0.1860 0.1860 0.1861
 Triad: 2465.2829 0.1948 0.1947 0.1950
 -------------------------------------------------------------
 Solution Validates
 -------------------------------------------------------------
{noformat}\\ Notice the Total memory required (in the output) changed to almost half a gig of memory (457.8 MB), and the results were about the same.
  {noformat}
 Notice the Total memory required (in the output) changed to almost half a gig of memory (457.8 MB), and the results were about the same.
  
 h2. Use a better compiler optimization level
  
 The next step is to try different optimization levels to improve the code being generated... Compiling with the \-O5 level, the results are improved substantially:
 
 {noformat}
 # xlc -O5 stream.c -o stream
 # ./stream
 ... <extra output text deleted from now on> ...
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 5748.9689 0.0557 0.0557 0.0557
 Scale: 5531.2390 0.0579 0.0579 0.0580
 Add: 6435.2640 0.0747 0.0746 0.0748
 Triad: 5851.3095 0.0821 0.0820 0.0821
 -------------------------------------------------------------
{noformat}\\
  {noformat}
 h2. Leverage OpenMP support
  
 Next, by default, only running one thread on one processor is used to process the arrays. The stream code is already instrumented with OMP controls. For example, in stream.c, do a quick search for "pragma"...
 
 {noformat}
 #pragma omp parallel for
  for (j=0; j<N; j++) {
  a[j] = 1.0;
  b[j] = 2.0;
  c[j] = 0.0;
  }
{noformat}\\ Telling the IBM compilers to leverage this, add the \-qsmp=omp compiler directive, which gives the warning message that \-qthreaded should be used as well.
  
 {noformat}
Telling the IBM compilers to leverage this, add the \-qsmp=omp compiler directive, which gives the warning message that \-qthreaded should be used as well.
 {noformat}
 # xlc -O5 -qsmp=omp stream.c -o stream
 1506-1354 (W) Option -qsmp should be used with option -qthreaded.
{noformat}\\ So adding the \-qthreaded option, compile and run. Notice that when run, a "stream program" informational message will be displayed for each thread created, one for each processor seen on the system. In this exercise, running on a 4-core system with SMT turned on, so the 8 logical processors will be used by 8 threads automatically generated.
  
 {noformat}
So adding the \-qthreaded option, compile and run. Notice that when run, a "stream program" informational message will be displayed for each thread created, one for each processor seen on the system. In this exercise, running on a 4-core system with SMT turned on, so the 8 logical processors will be used by 8 threads automatically generated.
 {noformat}
 # xlc -O5 -qsmp=omp -qthreaded stream.c -o stream
 # ./stream
 -------------------------------------------------------------
 STREAM version $Revision: 5.6 $
 -------------------------------------------------------------
 This system uses 8 bytes per DOUBLE PRECISION word.
 -------------------------------------------------------------
 Array size = 20000000, Offset = 0
 Total memory required = 457.8 MB.
 Each test is run 10 times, but only
 the *best* time for each is used.
 -------------------------------------------------------------
 Number of Threads requested = 8
 -------------------------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 -------------------------------------------------------------
 Your clock granularity appears to be less than one microsecond.
 Each test below will take on the order of 19936 microseconds.
  (= 2147483647 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 -------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 14406.6086 0.0223 0.0222 0.0225
 Scale: 13649.5844 0.0236 0.0234 0.0237
 Add: 15961.7058 0.0301 0.0301 0.0302
 Triad: 16836.1425 0.0286 0.0285 0.0287
 -------------------------------------------------------------
 Solution Validates
 ------------------------------------------------------------
{noformat}\\ When run again, fairly different results are seen.
  
 {noformat}
When run again, fairly different results are seen.
 {noformat}
 # ./stream
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 12945.5076 0.0249 0.0247 0.0251
 Scale: 12207.7155 0.0263 0.0262 0.0264
 Add: 14397.1304 0.0335 0.0333 0.0337
 Triad: 15229.8620 0.0316 0.0315 0.0318
 -------------------------------------------------------------
{noformat}\\ With SMT on, results can vary with this type of workload as the threads are dispatched to the various processors. OpenMP and the IBM compilers provide controls to bind the threads to the processors and improve the repeatability of the results.
  {noformat}
 With SMT on, results can vary with this type of workload as the threads are dispatched to the various processors. OpenMP and the IBM compilers provide controls to bind the threads to the processors and improve the repeatability of the results.
  
 h2. Binding threads to processors
  
 To bind the threads to the processors, simply set an environment variable and re-run.
 
 {noformat}
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1"
{noformat}\\ (!) With the downloaded trial compilers, the compiler-runtime controlled binding feature currently does not work when SMT is turned off. When SMT is off, STRIDE must be set to 2 (for processors 0,2,4,6), but the processor numbering convention isn't handled correctly. The runtime controller gets mixed up by processor numbers (ie: 6) being higher than the number of processors available (ie: 4), and stops trying to attach threads to processors. This is already fixed in the supported compilers, but isn't yet available in the trial download package.
  
  {noformat}
 (!) With the downloaded trial compilers, the compiler-runtime controlled binding feature currently does not work when SMT is turned off. When SMT is off, STRIDE must be set to 2 (for processors 0,2,4,6), but the processor numbering convention isn't handled correctly. The runtime controller gets mixed up by processor numbers (ie: 6) being higher than the number of processors available (ie: 4), and stops trying to attach threads to processors. This is already fixed in the supported compilers, but isn't yet available in the trial download package.
 Re-running stream with the binding controls set, repeat'able, consistent, and fairly nice results are acheived.
 
 {noformat}
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1"
 # ./stream
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 17111.3144 0.0187 0.0187 0.0188
 Scale: 16069.9379 0.0200 0.0199 0.0200
 Add: 18550.6590 0.0259 0.0259 0.0260
 Triad: 19761.1496 0.0244 0.0243 0.0247
 -------------------------------------------------------------
{noformat}\\ (*g) (*g) With a 19GB/sec result for Triad, this is the target range hoped for. The exercise will continue on to see if using 16MB large pages can improve this number even more.
  {noformat}
 (*g) (*g) With a 19GB/sec result for Triad, this is the target range hoped for. The exercise will continue on to see if using 16MB large pages can improve this number even more.
  
 ----
 So let's look at large page exploitations available with libhugetlbfs from sourceforge.
  
 The next steps will first try modifying the stream program to use malloc since leveraging malloc is pretty simple with libhugetlbfs.
  
 After the malloc test, go back and re-link the existing stream code (which has the statically declared arrays) to have libhugetlbfs put the .bss segment into the 16MB large pages. A .bss segment is essentially the un-initialized static data arrays declared in the program. See [this page|http://www.elook.org/computing/block-started-by-symbol.htm] for a quick definition from the web.
  
 ----
 h2. Modify stream to use malloc
  
 It's simple to modify the stream.c program to take advantage of malloc calls instead of predefining the arrays. Copy stream.c to a new program stream-malloc.c.
  
 Then change
 
 {noformat}
 static double a[N+OFFSET],
  b[N+OFFSET],
  c[N+OFFSET];
{noformat}\\ to
  
 {noformat}
to
 {noformat}
 static double *a, *b, *c;
{noformat}\\ Then add the mallocs before the a,b,c arrays are initialized.
  
 {noformat}
Then add the mallocs before the a,b,c arrays are initialized.
 {noformat}
  a = (double *)malloc(sizeof(double) * (N + OFFSET));
  b = (double *)malloc(sizeof(double) * (N + OFFSET));
  c = (double *)malloc(sizeof(double) * (N + OFFSET));
{noformat}\\ Re-compile and run.
  
 {noformat}
Re-compile and run.
 {noformat}
 # xlc -O5 -qsmp=omp -qthreaded stream-malloc.c -o stream-malloc
  
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1"
 # ./stream-malloc
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 15768.9864 0.0203 0.0203 0.0203
 Scale: 15093.5325 0.0212 0.0212 0.0212
 Add: 17051.6047 0.0282 0.0281 0.0282
 Triad: 17581.1124 0.0274 0.0273 0.0274
 -------------------------------------------------------------
{noformat}\\ (x) What is very interesting is that malloc performance costs about an 8% - 12% penalty from the statically defined arrays. We'll dig into this in a separate Wiki Page in the near future. This is believed to be a known Linux malloc problem (which can be tuned around) for malloc's which are bigger than 128KB each. For example, an [] from last year hints at the problem.
  {noformat}
  
(x) What is very interesting is that malloc performance costs about an 8% - 12% penalty from the statically defined arrays. We'll dig into this in a separate Wiki Page in the near future. This is believed to be a known Linux malloc problem (which can be tuned around) for malloc's which are bigger than 128KB each. For example, an \[\] from last year hints at the problem.
  
 h1. Using libhugetlbfs
  
 To enable and use libhugetlbfs, there are a series of simple steps needed.
 # Use SLES 10 or RHEL 5 (or for the more adventerous, download and build a recent mainline kernel...)
 # Get and build libhugetlbfs
 # Allocate enough large pages
 # Try libhugetlbfs with malloc
 # Try libhugetlbfs with bss segment
  
 h2. Need 2.6.16 kernel
  
The SLES 10 and the RHEL 5 GA'ed kernels have the correct support with no additional changes required.
  The SLES 10 and the RHEL 5 GA'ed kernels have the correct support with no additional changes required. We recommend using the latest service packs and updates.
  
 h2. Get and build libhugetlbfs
  
 Get the libhugetlbfs library from [sourceforge.net|http://sourceforge.net/projects/libhugetlbfs] into a working directory, for example {{/usr/local/src}}. There's a HOWTO included in the tar ball, but here's the quick and simple steps to install the Version 1 level of libhugetlbfs.
 
 {noformat}
 # tar -zxf libhugetlbfs-1.0.1.tar.gz
 # ls
 . .. libhugetlbfs-1.0.1 libhugetlbfs-1.0..1tar.gz
 # cd libhugetlbfs-1.0.1
 # make
 # make install PREFIX=/usr
{noformat}\\
  {noformat}
 h2. Allocate some large pages
  
 First, reserve some large pages to be used by malloc. Since the program is using about 457.8 for the arrays, 457.8MB / 16MB is about 28.6. For the sake of easier round numbers, 30 large pages were allocated.
  
 When using libhugetlbfs 1.0, more huge pages than needed were allocated. Be sure to use the latest libhugetlbfs (at least 1.0.1).
 
 {noformat}
 # cat /proc/meminfo | grep HugePage
 HugePages_Total: 0
 HugePages_Free: 0
  
 
 # echo 0 > /proc/sys/vm/nr_hugepages
 # echo 30 > /proc/sys/vm/nr_hugepages
  
 # cat /proc/meminfo | grep HugePage
 HugePages_Total: 30
 HugePages_Free: 30
{noformat}\\
  {noformat}
  
 {info:title=Clear hugepages before allocating more}
 Based on experience, we now recommend that you set the number of hugepages to Zero (0) before changing the hugepage count. By freeing all of the hugepages, this allows the newly reserved hugepages to be laid out better for NUMA placement with respect to the program about to be executed.
 {info}
  
  
  
 h2. Setup and re-run with libhugetlbfs
  
 Then re-run the malloc'ed version of stream. But first mount the virtual hugetlbfs filesystem, and set some environment control variables for libhugetlbfs.
 
 {noformat}
 # mkdir /libhugetlbfs
 # mount -t hugetlbfs hugetlbfs /libhugetlbfs
  
 # export LD_PRELOAD=libhugetlbfs.so
 # export HUGETLB_MORECORE=yes
 # ./stream_malloc
{noformat}\\ or, you can combine setting the environment variables on the command line....
  
  {noformat}\\
 or, you can combine setting the environment variables on the command line....
 {noformat}
 # xlc -O5 -qsmp=omp -qthreaded stream-malloc.c -o stream-malloc
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1
 # LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes ./stream-malloc
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 15892.7826 0.0203 0.0201 0.0208
 Scale: 15697.7963 0.0206 0.0204 0.0209
 Add: 17721.3193 0.0272 0.0271 0.0275
 Triad: 18547.0702 0.0260 0.0259 0.0265
 -------------------------------------------------------------
{noformat}\\ Note that this improves the previous malloc result about 5%. For Triad, (18547 - 17581) / 17581 => 5.5%
  
  {noformat}\\
 Note that this improves the previous malloc result about 5%. For Triad, (18547 - 17581) / 17581 => 5.5%
 But malloc is still slower than the .bss segment in this experiment. More on this in a later technical paper.
  
 h2. Use stream as-is with libhugetlbfs .bss segment
  
 With the basic stream program, with statically declared arrays, libhugetlbfs can be used to load the .bss segment into 16MB large pages. When using the IBM Compilers, you must re-define which loader to use. It's easy... just add three options
 
 {noformat}
 # xlc -O5 -qsmp=omp -qthreaded -B/usr/share/libhugetlbfs/ -tl -Wl,--hugetlbfs-link=B stream.c -o stream-lp
{noformat}\\ Where \-tl (dash tee ell) tells the compiler to use the path specified by the \-B option for the ld command, and \-Wl, (dash W ell comma) specifies the link options to use.
  
  {noformat}\\
 Where \-tl (dash tee ell) tells the compiler to use the path specified by the \-B option for the ld command, and \-Wl, (dash W ell comma) specifies the link options to use.
 Then re-build and re-link.... (actually, just a re-link of the object files is needed... the libhugetlbfs library does not require a re-compile).
  
 Then re-run... Note that LD_PRELOAD and HUGETLB_MORECORE environment variables are not needed.
 
 {noformat}
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1"
 # ./stream-lp
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 17150.8910 0.0188 0.0187 0.0190
 Scale: 14457.3530 0.0226 0.0221 0.0230
 Add: 18395.7340 0.0263 0.0261 0.0267
 Triad: 19522.5786 0.0248 0.0246 0.0250
 -------------------------------------------------------------
{noformat}\\ Interesting. Not much improvement from the original best-case stream.
  
  {noformat}\\
 Interesting. Not much improvement from the original best-case stream.
 To confirm, execute the original stream which was *not* linked for large pages.
 
 {noformat}
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1"
 ./stream
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 17126.1615 0.0187 0.0187 0.0188
 Scale: 16045.7312 0.0201 0.0199 0.0206
 Add: 18577.8767 0.0260 0.0258 0.0260
 Triad: 19762.8954 0.0244 0.0243 0.0245
 -------------------------------------------------------------
{noformat}\\ The hypothesis would be that large pages may not help with smaller memory sizes. To test, let's see what happens if 10 times the memory is used for the arrays.
  {noformat}\\
 The hypothesis would be that large pages may not help with smaller memory sizes. To test, let's see what happens if 10 times the memory is used for the arrays.
  
 h2. Try much larger array size to exploit large pages
  
 Since relatively small array sizes do not seem to benefit much from large pages for this program, change N to be 200,000,000 (eight zeros) in stream.c, rebuild it to create stream and stream-lp, and then re-run them. The memory used now will be 4577MB (4.5 GB)
  
 Note that stream must be compiled in 64-bit mode because of the size of the arrays being used. For the xlc compiler, that's the "{{\-q64}}" option.
 
 {noformat}
 first, change N to 200000000 in stream.c
 (that's 8 zeros)
  
 # xlc -q64 -O5 -qsmp=omp -qthreaded stream.c -o stream
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1"
 # ./stream
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 16104.7617 0.1989 0.1987 0.1992
 Scale: 15943.3205 0.2010 0.2007 0.2013
 Add: 18817.8556 0.2559 0.2551 0.2570
 Triad: 18747.4303 0.2567 0.2560 0.2588
 -------------------------------------------------------------
{noformat}\\ And repeat with the large page enabled executable. Be sure to reserve more large pages in the system. 4577MB / 16MB = 286\+ large pages needed. After echo'ing, confirm that the large pages were allocated, Linux gives no indication on the success or failure of setting {{nr_hugepages}} when echo'ing values into the variable.
  
  {noformat}\\
 And repeat with the large page enabled executable. Be sure to reserve more large pages in the system. 4577MB / 16MB = 286\+ large pages needed. After echo'ing, confirm that the large pages were allocated, Linux gives no indication on the success or failure of setting {{nr_hugepages}} when echo'ing values into the variable.
 {noformat}
# echo 0 > /proc/sys/vm/nr_hugepages
 # echo 300 > /proc/sys/vm/nr_hugepages
  
 # cat /proc/meminfo | grep HugePage
 HugePages_Total: 300
 HugePages_Free: 300
  
 # xlc -q64 -O5 -qsmp=omp -qthreaded -B/usr/share/libhugetlbfs/ -tl -Wl,--hugetlbfs-link=B stream.c -o stream-lp
 # export XLSMPOPTS="STARTPROC=0:STRIDE=1"
 # ./stream-lp
 -------------------------------------------------------------
 Function Rate (MB/s) Avg time Min time Max time
 Copy: 16629.6899 0.1936 0.1924 0.1944
 Scale: 16577.0481 0.1938 0.1930 0.1949
 Add: 20940.0445 0.2297 0.2292 0.2301
 Triad: 21123.6765 0.2276 0.2272 0.2285
 -------------------------------------------------------------
 {noformat}\\ Using 16 MB large pages in this case demonstrates about a 12.7% "(21123-18747)/18747" performance gain when pushing the .bss segment into 16MB large pages. This 21GB/sec is the best result achieved in this exercise, so we'll wrap up here.
  {noformat}\\
 Using 16 MB large pages in this case demonstrates about a 12.7% "(21123-18747)/18747" performance gain when pushing the .bss segment into 16MB large pages. This 21GB/sec is the best result achieved in this exercise, so we'll wrap up here.
  
 h1. Summary so far
  
 A working example of a workload using 16MB large pages with the executable and libhugetlbfs is demonstrated. It's easy to get quick gains with performance with some simple steps:
 # (*g) Obviously, compile with better optimizations. "xlc \-O5" creates nicely tuned executables for ppc64 platforms.
 # (*g) For OpenMP threaded applications, take advantage of the xlc provided "binding capability" where threads are automatically bound to the available processors. This improves both consistency and performance.
 # (*g) For some applications and in some cases, leveraging the 16MB large pages available by loading the .bss segment into them can provide gains. The sourceforge *libhugetlbfs* project provides an easy way to do this for 2.6.16 kernels and above, and supported on SLES 10 and RHEL 5.
  
 Here's a quick summary of the performance gains and hits that the various options provided:
  
 !StreamTriad.jpg!
  
 h2. Challenges still left
  
 In the course of this exercise, several known challenges were highlighted which still need to be addressed...
 # (x) In this example, using three large malloc's instead of static arrays cost about 8%-12%. This is likely related to a known problem with Linux when using malloc's greater than 128KB. More on this later.
 # (!) Binding threads to processors doesn't work w/OpenMP and the IBM compiler Trial Version when SMT is off. This is a known and fixed IBM compiler problem. Customers of course can get an updated compiler from IBM when needed. In the exercise shown here, testing was focused with SMT on.
 # (!) For smaller arrays (450MB) in this program, loading the .bss segment into 16MB large pages is slightly slower than just normal pages. It was demonstrated that larger arrays (across 4.5GB memory) did make a nice positive difference. This serves to highlight that not all workloads and not all cases will benefit from using large 16MB pages with the .bss segment.

 
    About IBM Privacy Contact