Using HPC Challenge with SLES10 sp2 on Power
Contents
In this paper we introduce how to take advantage of SUSE's latest operating system level (SLES10 sp2) with a standard clustered HPC workload called HPC Challenge. This paper highlights HPC software components like OpenMPI which comes as a free download with SLES10 sp2 and leveraging either the standard gcc which comes with Linux or the IBM commercial compilers available for Power systems.
The paper will cover installing, building, running, basic tuning, and how to analyze and compare the results. We highlight the easy and significant performance gains which are possible when leveraging IBM's compilers and ESSL.
We took advantage of an IBM Power 575 system with 32 POWER6 cores running at 4.70Ghz with 128GB memory for these engineering examples.
Introduction
The HPC Challenge benchmark consists of seven separate workloads building on the success of the top500.org
Linpack HPL based workload. The benchmark was designed to better measure the overall performance of high end HPC (High Performance Computing) systems running kernels with complex memory patterns. For example, HPC Challenge measures the performance of the system's processors, memory, network bandwidth, and network latency.
HPC Challenge was written by Jack Dongarra, who is part of the Innovative Computing Laboratory (ICL) at the University of Tennessee, along with the following contributers: David Bailey, Jeremy Kepner, David Koester, Bob Lucas, John McCalpin, Antoine Petitet, Rolf Rabenseifner, Daisuke Takahashi, and R. Clint Whaley. To get more information, including biographies and webpage links, see HPCC Collaborators
.
The seven workloads include:
- HPL, or High-Performance Linpack Benchmark - measures a system's floating point rate of execution for solving a linear system of equations
- DGEMM - measures a system's floating point rate of execution for solving double precision real matrix-matrix multiplication
- STREAM - measures a system's sustainable memory bandwidth and computation rate for simple vector kernels
- PTRANS, or Parallel Matrix Transpose - measures the communications capacity of the network
- RandomAccess - measures a system's rate of integer random updates of memory
- FFT, or Fast Fourier Transform - measures a system's floating point rate of execution for solving double precision complex one-dimensional Discrete Fourier Transform
- Communication bandwidth and latency tests - measures a system's network latency and bandwidth
To find out more information about the HPC Challenge benchmark, see http://icl.cs.utk.edu/hpcc/
We have found the benchmark easy to quickly build and get results. Tuning and interpreting the basic results were straight-forward, but we observed that detailed tunings and optimizations will take more time to learn.
Installing
To download the latest version of HPC Challenge, click the download link here: http://icl.cs.utk.edu/hpcc/software/index.html
Un-tar the source code tar ball and it will create a hpcc-<ver> directory. For example, in our case, we've downloaded Version 1.2 from the hpcc web site into /usr/local.
Inside this directory there are directories for the DGEMM, FFT, PTRANS, RandomAccess, and STREAM workloads which contain source code and header files. Also under this top level directory is a README file and a file called _hpccinf.txt. This file is a sample input file for the workload, which is very similar to the HPL.dat file used for tuning Linpack. Tuning will be discussed in depth after we build and run the basic benchmark workloads.
In the hpl directory, there is a README, INSTALL, and TUNING file. There is also a www directory under hpl, which contains many helpful files that provide information about links, references, results, scalability, software, tuning, etc. Inside the setup directory are all of the example makefiles provided with the workload. Using these makefiles will be discussed in more detail below.
Dependencies
In order to run the HPC Challenge benchmark, there are a few things that must be installed on your system. First, you must have some implementation of either BLAS (Basic Linear Algebra Subprograms)
or VSIPL (Vector Signal Image Processing Library)
. You are allowed to use optimized versions of BLAS that are architecture dependent, such as IBM's ESSL (Engineering Scientific Subroutine Library)
. While ESSL must be obtained from IBM, an open-source version of BLAS can be found here
.
Quick guide for building BLAS
To build and use BLAS for a Power system, first obtain the tar ball blas.tgz from here
. Un-tar blas.tgz and it will create a BLAS directory with various Fortran programs, a Makefile, and a file called make.inc.
You will need to edit make.inc to create the needed 32 and 64 bit BLAS libraries. This example uses gfortran which is provided by SLES10 sp2. You can then use 'make' to create an archive file and will then need to create the libblas shared object file as follows:
For more information on gcc/gfortran tuning options, see Tuning options to consider with gcc
.
If you were able to acquire ESSL, we assume here that you've installed the latest version on your system.
Also, you will need an implementation of MPI (Message Passing Interface)
installed. We used OpenMPI
, which is an open source high performance message passing library. OpenMPI 1.2.5 is provided as a free download with SLES10 sp2 here
.
When installing OpenMPI, make sure to install both the openmpi and openmpi-devel packages so you get the needed header files. There are both 32 and 64 bit builds available. Be sure and install all of the OpenMPI dependencies as well. For example, on our test system, the following dependencies were installed with OpenMPI:
In addition to the software mentioned above, you will need a compiler installed.
- The third option option of compilers based on gcc and provided specifically for Power users is the Advance Toolchain. Details on the Advance Toolchain can be found here
. The Advance Toolchain provides a newer version of the gcc compiler and the tool chain libraries.
Our examples are on a single node, but we show you one way to extend this testing onto multiple machines (small clusters) later in this article.
In the following text, we now assume both compilers were installed, BLAS and ESSL were installed, and OpenMPI was installed.
Setup
Before you can run HPC Challenge, you must first provide the input file which passes the needed parameters to the workload. The input file for HPC Challenge is called hpccinf.txt and is very similar to the input file used for the Linpack benchmark. The HPC Challenge tar ball provides a sample input file named _hpccinf.txt in the hpcc-<ver> directory. For testing purposes, it is sufficient to just copy this sample input file to a file named "hpccinf.txt" without any changes. Tuning the workload through this file will be discussed in the Tuning section.
The additional step is to create a hostfile for your MPI (Message Passing Interface), even if you are only using one system. This file can reside anywhere on the system as you will specify its location in the run command. Below is an example hostfile (which we arbitrarily put in the /etc directory). For this hostfile, you can use IPs or long hostnames, but do not use 'localhost'.
Building
First, to build the HPC Challenge workload, you need to create a makefile specific to your architecture and environment. There are some sample makefiles under hpl/setup. If your setup is similar to one of the makefiles provided, use that make file as a starting point. The most commonly changed parameters in the makefile are to setup MPI, BLAS, and the compilers. Below is an example of each of these sections of the makefile:
In our case, we'll be setting up MPI with OpenMPI. In the examples that follow, we'll be modifying this section.
We'll be using BLAS and ESSL. Again, this is the section that will be modified when specifying the math libraries.
And for the compilers, we'll provide examples for gcc and the IBM Compilers. The following flags are specified when modifying the compiler designation.
For example, for an IBM Power 575 we used Make.PWRPC_FBLAS as our starting point, changing these sections as necessary. Below are some example configurations for 32 bit builds and there are comments provided about what to change for 64 bit builds, provided you have the corresponding 64 bit installations of the dependencies.
We will provide four common easy examples.
- First, using blas with gcc.
- Then blas with xlc.
- Then ESSL with gcc.
- And finally, ESSL with xlc.
In the later sections, we'll show what needs to be changed in each makefile. The makefiles reside in the hpl sub-directory. You will edit the files in the hpl directory.
The "make" command is executed from the hpcc-1.2.0 directory. But first, you'll need to edit these four makefiles.
Once the build has finished, it will create an executable called hpcc. The wrapper scripts and variables needed to run this executable are covered in each of the example configuration sections below. "hpcc" isn't runnable as it is, by itself.
Example Configurations
BLAS, OpenMPI, gcc
Below is an example of the changes to the makefile (Make.ppc64.blas.gcc) for the Power 575 machine using a generic BLAS library, OpenMPI, and the gcc compilers. This build changes the "F2CDEFS" variable in the sample makefile as well as the normal sections.
Search for the following sections and edit the file for the variables being set.
Using this makefile, issue the following commands to clean/setup, build and run. The minimum number of threads required by the HPC Challenge benchmark is 4 on the -np directive.
BLAS, OpenMPI, IBM Compilers
Below is an example of the changes to the makefile (Make.ppc64.blas.xlc) for the Power 575 machine using a generic BLAS library, OpenMPI, and the IBM XL compilers. This build changes the "F2CDEFS" variable in the sample makefile as well as the normal sections.
Search for the following sections and edit the file for the variables being set.
Using this makefile, issue the following commands to clean/setup, build and run. The minimum number of threads required by the HPC Challenge benchmark is 4 on the -np directive. For this configuration, you need to define a few flags for the openmpi wrapper compiler mpicc as well as the run command. These flags override the compiler wrapper defaults, which is assumed to be gcc in this case. See the OpenMPI FAQs
on this topic for more information:
IBM's ESSL, OpenMPI, gcc
Below is an example of the changes to the makefile (Make.ppc64.essl.gcc) for the IBM Power 575 machine using IBM's ESSL, OpenMPI, and the gcc compilers. For this configuration, you will need to link in some IBM XL compiler pieces as ESSL will use them, however you can still compile the workload with the gcc compilers as described below.
Search for the following sections and edit the file for the variables being set.
Using this makefile, issue the following commands to clean/setup, build and run. The minimum number of threads required by the HPC Challenge benchmark is 4 on the -np directive.
IBM's ESSL, OpenMPI, IBM Compilers
Below is an example of the changes to the makefile (Make.ppc64.essl.xlc) for the IBM Power 575 machine using IBM's ESSL, OpenMPI, and the IBM XL compilers.
Search for the following sections and edit the file for the variables being set.
Using this makefile, issue the following commands to clean/setup, build and run. The minimum number of threads required by the HPC Challenge benchmark is 4 on the -np directive. For this configuration, you need to define a few flags for the openmpi wrapper compiler mpicc as well as the run command. These flags override the compiler wrapper defaults, which is assumed to be gcc in this case. See the OpenMPI FAQs
on this topic for more information:
After the workload is finished, there will be an output file named hpccoutf.txt under the hpcc-<ver> directory on the control node (or whatever machine is listed first in the hostfile normally). This is the results file generated by the benchmark and will be discussed in more detail later.
Warnings
You may see these warnings:
To fix the libibverbs warning, load the following ib modules: ib_mthca and rdma_ucm. Or simply make sure you have the following packages, which are all provided as a free download with SLES10 sp2 here
, installed: openib, libmthca, libibverbs, libibverbs-devel, librdmacm, librdmacm-devel.
The next two warnings about OpenIB not being able to find and HCAs and uDAPL not being able to find any NICs are simply OpenMPI letting you know that you are not using a high speed network that it assumes is built in. To suppress these warnings, insert the MCA parameter "--mca btl ^openib,udapl" in your mpirun command. This tells OpenMPI you are not using uDAPL or IB. For example:
To learn more about theses warnings, see the related thread in the OpenMPI User's Mailing List Archives
.
You may also see these type of warnings:
These are warning you that there are entries in the uDAPL registry file (/etc/dat.conf) that are not being used. For the setup described in this paper, uDAPL is not used. To suppress these warnings insert the MCA parameter "--mca btl ^udapl" in your mpirun command. This tells OpenMPI that you are not using uDAPL. Alternatively, you can also list only the communications you are using. For example:
To learn more about MCA parameters, see the OpenMPI FAQs
.
Tuning
There are two types of runs for HPC Challenge. First there are baseline runs where the user can tune various parameters, and then there are optimized runs where the user can tune various things and can make certain code modifications to the benchmark. This paper will discuss a little about baseline tuning. To learn more, or to learn about optimized run tuning, see the HPCC rules
.
There are several things you can change to tune the benchmark to work towards achieving a better score. First there are the software pieces. Obviously getting an optimized version of your chosen compiler, MPI, linker, and BLAS or VSIPL libraries will improve performance.
Optimized Math Libraries
We experienced a very large gain between building with a general BLAS library compared to building with IBM's ESSL. The Linpack component for an out-of-the-box run on an IBM Power 575 with no other tuning significantly increased from the general BLAS to IBM's carefully tuned ESSL product.
Transparent Large Pages
Also, you are allowed to use compiler and load options to increase your result. Another software piece you may consider is libhugetlbfs
, which will make 16MB large pages available to the workload and can boost the score in some cases.
Tuning HPC-Challenge
Another big piece to improving your score is tuning the input file (hpccinf.txt). The input file is described in depth in the file under the hpl directory called TUNING and also in hpl/www/tuning.html. These files explain line by line what each of the input file parameters are and also provide some guidelines at the bottom.
Another good source of guidelines for these parameters are the HPPC FAQs
or the file hpl/www/faqs.html.
There are a few main parameters that you can start with.
- First there is N (line 6) which is your problem size, or in other words your matrix dimension for HPL (Linpack).
- Next, there is your NB (line 8) which is your block size, or in other words your sub matrix size.
- Then there is your P (line 11) and Q (line 12) which are the number of process rows and columns you want to run (PxQ). This will correspond to the number of processors you choose to run with.
The input file makes it easy to try out multiple configurations during one run by allowing you to specify how many N's to try (line 5), how many NB's to try (line 7), how many process grids to try (line 10), how many additional N's to try for PTRANS (line 33), and how many additional NB's to try for PTRANS (line 35), etc. Refer to the above mentioned files/links to learn about the other parameters and to read some guidelines for setting these values.
Infiniband Performance
One other thing to consider is your network and I/O performance. You can try using InfiniBand
, which is a very high speed I/O technology that can improve the performance of high performance computing systems. To configure InfiniBand, first install the following packages, which are provided as an extra download with SLES10 sp2 here
: ofed, libibcm, libibcommon, libibmad, libibumad, libibverbs, dapl, opensm, librdmacm, ibutils, libehca, libmthca, and libsdp. Then issue the following commands:
This starts the openib daemon and loads the openib kernel modules. At this point, the IB kernel module and device driver are loaded and the IB connection is established. The following commands make use of the IPoIB protocol and are useful for verifying the IB connection, although they are not necessary to establish and use the IB connection with OpenMPI.
Analyzing and Comparing Results
The results file (hpccoutf.txt) contains a lot of information. First, the input file is printed for reference along with some information about the workload, the hostname, etc. Then there are the full results for all the workloads.
HPC Challenge consists of three different types of tests.
Local Runs
First there are Single runs, which are also referred to as "Local". These tests are run on a single processor.
Star Runs
Next, there are Star runs, also referred to as "EP" or "Embarrassingly Parallel". For these runs, each processor is doing computation in parallel, but the processors are not communicating with each other explicitly. Lastly there are MPI runs, also referred to as "Global". For these runs, each processor is doing computation in parallel and the processors are explicitly communicating with each other.
The workloads are run in the following order: PTRANS, HPL, StarDGEMM, SingleDGEMM, StarSTREAM, SingleSTREAM, MPIRandomAccess, StarRandomAccess, SingleRandomAccess, MPIFFT, StarFFT, SingleFFT, and finally LatencyBandwidth. Some of the workloads will provide a short summary after the results stating how many tests were measured, how many failed, and how many were skipped.
After the full results reports, there is a summary section that consists of a long list of variables that contain system information, workload scores, workload times, etc. If you were to upload and submit your results to the HPCC website, they take your hpccoutf.txt file and some system information and create a much easier to read report.
See the HPCC upload
link for more information.
- To see an example of the "results" file that is created after submission, click on any of the links in the System Information field. There is a very nice feature for comparing results on the HPCC website under the Kiviat Diagram
link. You simply check the runs you want to compare and scroll down and hit the Graph button and it creates a radar graph comparing each component of all the runs selected.
For more information on how to compare the published results to your own, or why your results are different, see HPCC FAQs
.
Multi-machine setup
- For multi-machine configuration, simply build the workload on each machine as described above and then include each machine's IP/hostname in the hostfile on the controller machine. You only need to issue the run command on the control machine and it is good practice to list the control machine first in the hostfile so that the hpcc output file is created there.
- Also, for a multi-machine setup, you need to setup your ssh keys so that OpenMPI can remotely startup processes without any password prompts. To do this, enter the following commands on one of the nodes (as root):
Note: This master node is often called the control node in HPC applications. Additional nodes are generally referred to as the compute nodes. In our example here, we keep things simple by using no passphrase, which is not recommended for production environments.
Then copy the .ssh directory to all other nodes under /root. Make sure to do a chmod 700 /root/.ssh/ on all nodes if those permissions are not already set. Verify that you can now ssh between any of the nodes without a password.
References
HPCC website, http://icl.cs.utk.edu/hpcc/
IBM's ESSL (Engineering Scientific Subroutine Library), http://www-304.ibm.com/jct03004c/systems/p/software/essl/index.html
IBM XL Fortran Advanced Edition for Linux, http://www-306.ibm.com/software/awdtools/fortran/xlfortran/features/linux/xlf-linux.html
IBM XL C/C++ Advanced Edition for Linux, http://www-306.ibm.com/software/awdtools/xlcpp/features/linux/xlcpp-linux.html
InfiniBand, http://www.infinibandta.org/home
Libhugetlbfs, http://sourceforge.net/projects/libhugetlbfs
BLAS (Basic Linear Algebra Subprograms), http://www.netlib.org/blas/index.html
VSIPL (Vector Signal Image Processing Library), http://www.vsipl.org/
MPI (Message Passing Interface), http://www-unix.mcs.anl.gov/mpi/
OpenMPI, http://www.open-mpi.org/
GCC, the GNU Compiler Collection, http://gcc.gnu.org/