High performance applications can be developed with Open MPI [1], InfiniBand [2] and Libhugetlbfs [3] under SLES 10 SP2 in a single system or a cluster environment. There is a growing trend of developing parallel and distributed applications as parallel computers, clusters and heterogeneous networks are increasingly used in business, science and engineering to increase productivity and aid discovery. In this article, we will use the compilation and running of Linpack [4] on a cluster of four IBM Power 575 systems as an example to show how these components can be used together.
The rest of the article is organized as follows. We will describe different ways of compiling and linking to generate executable code for Linpack. We will provide methods to back bss/data/text segments and dynamic memory by 16MB pages via Libhugetlfs. Some key Open MPI options will be provided to bind MPI processes to available processors and make use of the InfiniBand interconnect. The scalability of a cluster having four IBM Power 575 systems will be provided. The performance gain of using 16MB pages over 4KB pages for backing the matrix of Linpack will be given. We will have some concluding remarks at the end.
Generating Executable Code for Linpack without Libhugetlbfs
In this section, we will provide methods to generate executable code for linpack using the IBM XLC/XLF compilers [5, 6] and the gcc/gfortran compilers. The mathematical functions are provided by IBM Engineering and Scientific Subroutine Library (ESSL in short) [7], in which Linpack mainly uses one mathematical function called Double-precision General Matrix Multiply (DGEMM in short) and this function consumes most of the computing resources in the whole Linpack run. Note that the gcc/gfortran compilers are not officially supported by IBM ESSL.
Message Passing Interface (MPI in short) [8] is a popular parallel programming model in the high performance computing community. We will add a few words later in the article. We choose Open MPI for process communication. Open MPI uses a few compiler wrappers, namely mpicc, mpif77 and mpif90, to build MPI applications. They are not real compilers but can invoke the real ones for compiling and linking purposes, and they know where to find the Open MPI libraries at link time. In the following explanation, two sets of instructions will be given for the IBM XLC/XLF compilers and the gcc/gfortran compilers.
Note that the 64-bit compiler wrappers, namely mpicc, mpif77 and mpif90, were not provided with the Open MPI 1.2.5 package made available by SLES 10 SP2. We can still use the 32-bit compiler wrappers to build 64-bit applications, although special link options would be needed.
To build Linpack, we can run make -e arch=ppc64 so that Make.ppc64 that comes with the Linpack package is selected. The various modifications to Make.ppc64 is given below.
If we use IBM XLC/XLF compilers, we set the following environment variables.
Note:
- FC corresponds to Fortran 90. If Fortran 77 is desired, set OMPI_F77 and OMPI_F77FLAGS.
- IBM XLC 10.1 [6] and XLF 12.1 [7] are used here to generate 64-bit code with -q64.
In the Make.ppc64 file, we set the following parameters.
Note:
- CC ties to environment variables OMPI_CC and OMPI_CFLAGS which we just set above, and thus mpicc actually means "xlc -q64" during compilation. The same is true for FC.
- -qarch and -qtune are used to optimize code for POWER6 architecture. -O4 or -O5 can be used for a higher degree of optimization.
- At the link step, -lessl is used for resolving mathematical function calls against IBM ESSL.
- Since mpif90 is a 32-bit compiler wrapper, it knows about the 32-bit Open MPI libraries, but not the 64-bit libraries. The use of "-L/usr/lib64/mpi/gcc/openmpi/lib64" helps the linker to find the right libraries and resolve symbols there. The use of "-R/usr/lib64/mpi/gcc/openmpi/lib64" helps the executable to find the right libraries at run-time.
If we use the gcc and gfortran compilers, set the following parameters in Make.ppc64. Note that if IBM ESSL is used with the gcc/gfortran compilers, the IBM XL Fortran Runtime Environment for Linux must be downloaded and installed.
Note:
- "-m64" is specified for all three FLAGS to generate 64-bit code and link 64-bit libraries.
- "CCNOOPT=-m64" is specifically set for HPL_dlamch.c to build 64-bit binary code.
- The use of "-lxlf90_r -lxlomp_ser -lxl -lxlfmath" lists out all the needed IBM XL Fortran Runtime libraries for IBM ESSL. The options of "-L/opt/ibmcmp/xlsmp/1.8/lib64 -L/opt/ibmcmp/xlf/12.1/lib64" helps the linker to find those libraries and resolve the symbols there.
Backing Bss, Data and Text Segments by 16MB Pages via Libhugetlbfs
By default, the base page size of SLES 10 SP2 is 4KB. However, we can back the bss/data/text segments and dynamic memory by 16MB pages through the use of Libhugetlbfs. (Libhugetlbfs 1.0.1 is included with SLES 10 SP2.) Performance of your applications might improve with larger pages due to lower overhead of address translation, higher memory bandwidth, and probably more efficient handling of page faults. You might reference the author's article entitled A Performance Evaluation of 64KB Pages on Linux for Power Systems
[9] regarding the performance benefits of large pages. For detailed information on installing and configuring Libhugetlbfs, you might reference Bill Buros' article entitled Leverage transparent huge pages on Linux on POWER
[10].
If we use the IBM XLC/XLF compilers, we need to add the following link options to the LINKER parameter specified earlier.
If we use the gcc/gfortran compilers, we need to add the following link options to the LINKER parameter.
Note:
- The directory /usr/share/libhugetlbfs contains a special linker script from Libhugetlbfs. That script, called ld, is selected to generate the final code.
- --hugetlbfs-link=BDT indicates that we want to back bss, data and text segments by 16MB pages.
Backing Dynamic Memory by 16MB Pages via Libhugetlbfs
In general, once an application is linked to Libhugetlbfs, we can do export HUGETLB_MORECORE=yes to back dynamic memory (via malloc) by 16MB pages as well. However, there is a requirement. The malloc call must be resolved against the GNU C library (commonly called glibc) made available by the distro. The reason is that there is a HUGETLB_MORECORE hook in the malloc implementation of glibc. Once the setting of HUGETLB_MORECORE=yes is recognized, Libhugetlbfs is called by glibc to allocate and manage 16MB pages for the data. However, if the linker resolves malloc to any malloc implementations that do not support HUGETLB_MORECORE, the dynamic memory would still be backed by the default pages, e.g., 4KB pages or 64KB pages, of the system.
Note that, at the link step, the linker would resolve symbols by matching against libraries one after another, based on its default list of libraries and those specified with -l. Indeed, -L is used to specify library paths which the linker does not know otherwise. Different linkers might have different default list of libraries. In other words, the order of libraries for symbol resolution may not be the same for different linkers and as a result the symbols could be resolved differently.
Since we use the compiler wrapper mpicc, mpif77 or mpif90 for compiling the source code, we tend to use them to do linking as well. Since Open MPI has its own malloc implementation and mpicc, mpif70 and mpif90 would put Open MPI libraries ahead of glibc in symbol resolution, the malloc implementation of Open MPI is selected for the malloc call. But the malloc implementation of Open MPI does not support HUGETLB_MORECORE. As mentioned above, dynamic memory is still backed by 4KB pages, not 16MB pages, on SLES 10 SP2. And it has a significant performance impact. We provide one method to tackle this problem.
The following change is made in Make.ppc64, regardless of which compiler is being used.
By placing glibc ahead of all the other Open MPI libraries explicitly in the link command, we can guarantee that the malloc implementation of glibc is resolved for the malloc calls.
Some Key Open MPI Parameters in Parallel and Distributed Environments
Message Passing Interface (MPI in short) is the dominant parallel programming model widely used in the high performance computing community. Its goals include high performance, scalability, and portability. The basic idea is to partition a large and complex problem into multiple subproblems and have them run in parallel by a number of sequential processes which communicate by passing messages to one another, i.e., sending and receiving messages. This programming model can be used on a single system or on a large cluster. Open MPI is one of the many MPI implementations available in the industry.
We will only touch on two performance areas here: process affinity and low latency in clusters using InfiniBand.
Binding a process to a physical processor is a common and important method to avoid cache line bouncing and potential remote memory accesses. We want to use process and memory affinity as much as possible in any computing environments. We can achieve this with a machine file and an Open MPI parameter.
The following machine file can be used for four processes running on two machines NodeA and NodeB.
The following Open MPI parameter is for process affinity.
MCA stands for Modular Component Architecture in Open MPI. The machine file indicates that two processes will run on NodeA and two on NodeB. If there are more more processes than node names in your machine file, say eight processes, the 5th and 7th processes would search back to the top of the file and run on NodeA and NodeB, respectively. In addition, they are going to be bound (or affinitized) to some particular processors of the systems. Note that if we use a job scheduler, e.g., Maui Cluster Scheduler [11], to select an appropriate set of processors for the run, the machine file is no longer needed.
In the MPI programming model, processes communicate by passing messages to one another. It could be through point-to-point (two-party) message passing, or collective (multi-party) communication. Processes may run on the same node or are far away from each other in a large cluster. Among many interconnect technologies, InfiniBand is widely selected to help improve bandwidth and lower inter-node latencies, which are two key performance metrics for distributed computing. For uniformity, we may use the InfiniBand interface to communicate all processes in a cluster. However, the shared memory interface would definitely less costly for processes on a single node. The following provides one Open MPI parameter to achieve that.
BTL stands for Byte Transport Layer of Open MPI. The option means we use shared memory (sm) for processes running on the same node and InfiniBand (openib) for processes running on different nodes. The parameter self is for loopback communication allowing a process to send messages to itself. Based on these parameters, it excludes IP over IB at the same time. Note that running TCP/IP over InfiniBand would incur significant performance penalty.
Performance Results
We measure linpack on a cluster of four IBM Power 575 systems running SLES 10 SP2. Each system has 32 POWER6 cores running at 4.7 GHz and 128 GB of DDR2 memory at 667 MHz. We use Open MPI 1.2.5, OFED 1.3 and Libhugetlbfs 1.0.1 that come with the distro.
In this evaluation, we make use of IBM ESSL 4.4 and compile the linpack code with the gcc/gfortan compilers.
Each IBM Power 575 system is installed with an IBM Dual 2 Port 4x Host Channel Adapter, which is an IBM InfiniBand adapter. Two InfiniBand links out of four are used to connect to a QLogic Silverstorm 9024 switch. Thus, there are a total of eight InfiniBand links used for four IBM Power 575 systems.
The key linpack parameters are as follows:
| Parameter |
1 node |
4 nodes |
| N |
100000 |
200000 |
| NB |
120 |
120 |
| P |
4 |
8 |
| Q |
16 |
32 |
The following command is used to run Linpack on a single node:
Sixty four processes (-np 64) run on the same local node with each process affinitized to a particular processor. Since there is only one node involved in the run, by default, shared memory is used.
The following command is used to run Linpack on four nodes:
We have the following results with bss/data/text segments and dynamic memory backed by 16MB pages via Libhugetlbfs:
| Number of IBM Power 575 systems |
Linpack Score (Gflops) |
| 1 |
488 |
| 4 |
1975 |
The scaling factor is about 4, demonstrating a near linear scaling.
The following provides the performance boost of backing the matrix from 4KB pages to 16MB pages, while bss/data/text segments remain on 16MB pages.
| Page Size |
Linpack Score (Gflops) |
| 4KB |
1640 |
| 16MB |
1975 |
The performance improvement of using 16MB pages is 20%. Therefore, the use of gcc to resolve malloc against glibc is an important method to achieve high performance with SLES 10 SP2.
Conclusion
High performance applications can be developed with Open MPI, InfiniBand and Libhugetlbfs under SLES 10 SP2 in a single system or a cluster environment. In this initial study, our data indicate a near linear scaling from one IBM Power p575 system to four such systems for running Linpack. We also have a linking method to allow the backing of dynamic memory by 16MB pages via Libhugetlbfs which shows a 20% performance gain over 4KB pages.
We recognize that Linpack does not put a lot of stress on communication subsystems. The near linear scaling may not be realized for workloads with intense communication. We will continue examining these components on a larger cluster and other workloads to quantify cluster scalability.
References
[1] Open MPI, http://www.open-mpi.org/
.
[2] OpenFabrics Alliance, http://www.openfabrics.org/
.
[3] Libhugetlbfs, http://sourceforge.net/projects/libhugetlbfs
.
[4] Linpack benchmark, http://www.top500.org/project/linpack
.
[5] IBM XL C and C++ Compilers, http://www-01.ibm.com/software/awdtools/xlcpp/
.
[6] IBM XL Fortran Compiler, http://www-01.ibm.com/software/awdtools/fortran/
.
[7] IBM Engineering and Scientific Subroutine Library, http://www-03.ibm.com/systems/p/software/essl/index.html
.
[8] Pacheco, Peter S., Parallel Programming with MPI, Morgan Kaufmann, 1997.
[9] Wong, Peter, A Performance Evaluation of 64KB Pages on Linux for Power Systems (http://www.ibm.com/developerworks/wikis/display/hpccentral/A+Performance+Evaluation+of+64KB+Pages+on+Linux+for+Power+Systems
].
[10] Buros, B., Leverage transparent huge pages on Linux on POWER (http://www.ibm.com/developerworks/systems/library/es-lop-leveragepages/
).
[11] Maui Cluster Scheduler, http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php
.
Acknowledgement
The author would like to thank Brad Benton for his advice on various aspects of Open MPI, Shirley Ma, George Chochia, Stefan Roscher, John Lewars and Pradeep Satyanarayan for their helps on configuring and monitoring InfiniBand, and Farid Parpia for many discussions on process affinity in clusters. The comments from Bill Buros, Richard Treumann, Jeroen van Hoof and Dan Jones are appreciated for enhancing the quality of the article.
Author
Peter W. Wong is a member of the Linux Performance Team in IBM. He has been a performance analyst for fourteen years in the areas of Java graphics, graphical user interface, data warehousing and high performance computing. He holds a Ph.D. degree in computer science from The Ohio State University.