An example of using OpenMPI with SPECmpi2007
For discussions or questions...
To start a discussion or get a question answered, consider posting on the Linux for Power Architecture forum.
Additional Linux on Power performance information is available on the Performance page
- Software Components
- The Hardware System
- OpenMPI configuration (public key)
- Configure IB & hostfile
- Example config file
- File System Node
- Compiling and Running SPEC MPI2007
- Using SLURM for remote process exectution
This page is intended to provide an easy introduction and the steps needed to leverage OpenMPI for an example set of MPI applications. In this case, SPEC.org's SPEC MPI2007 benchmark suite.
We demonstrate how to compile and execute the SPEC MPI2007 benchmark using IBM's eHCA InfiniBand interconnect, IBM XL compilers for C/C++ and Fortran, and OpenMPI on IBM POWER6 systems. This example utilizes one shared file system node and two execution nodes. However, these steps can be applied to any number of systems to extend the cluster across more execution nodes. In this example, OpenMPI and InfiniBand are used across the execution nodes, and NFS is used to access the shared file system node.
Additionally, example instructions for using SLURM to allocate system resources are provided as an alternative remote resource manager.
This paper is not intended as a description of the SPEC MPI2007 benchmark suite, but is used as an example set of MPI applications. See the SPEC.org site (listed in the reference section) for more details about SPEC MPI2007.
In our example, the execution nodes as well as the shared file system node are assumed to be running Red Hat Enterprise Linux 5.2 including the following software components and their dependencies:
- NFS server
- SPEC MPI2007 Benchmark
- IBM XL C/C++ Advanced Edition for Linux, V9.0
- IBM XL Fortran Advanced Edition for Linux, V11.1
- IBM XL C/C++ and Fortran Runtime Environments
- openmpi including devel packages (both 32bit and 64bit versions)
- libehca (both 32bit and 64bit versions)
NOTE: The complete compiler product is only required on the control node which is the primary execution node. That is, the node which will issue the "runspec" command to build and run the SPEC MPI2007 benchmark. The other execution compute node(s) need only install the appropriate IBM Compiler Linux Runtime Environment. All of the execution nodes access the shared file system hosted by the file system node.
To install only the runtime components, you will simply need the five "rte" rpm files:
New users are often looking for mpirun and the mpicc commands. After you install the 32-bit and 64-bit openmpi and openmpi-devel rpm files, you'll find these files as follows. Note later in the SPECmpi config file example, these paths are specified.
# find /usr -name mpirun /usr/lib64/openmpi/1.2.5-gcc/bin/mpirun /usr/lib/openmpi/1.2.5-gcc/bin/mpirun # find /usr -name mpicc /usr/lib64/openmpi/1.2.5-gcc/bin/mpicc /usr/lib/openmpi/1.2.5-gcc/bin/mpic
Then to tie the IBM compilers to these scripts, you set:
export ENV_OMPI_CC = /opt/ibmcmp/vac/9.0/bin/xlc export ENV_OMPI_CXX = /opt/ibmcmp/vacpp/9.0/bin/xlC export ENV_OMPI_F77 = /opt/ibmcmp/xlf/11.1/bin/xlf export ENV_OMPI_FC = /opt/ibmcmp/xlf/11.1/bin/xlf90
The steps documented below were performed on IBM POWER6 systems described here.
In this example, the shared file system was hosted on an IBM System p 550 connected to the execution nodes using gigabit Ethernet adapters. Each of the execution nodes have gigabit Ethernet adapters for access to the shared NFS file system.
Two IBM Power 575 systems were used for the control node and the compute node. The systems were interconnected using the IBM InfiniBand adapters for the execution mode.
OpenMPI uses ssh, by default, for remote startup of processes. Normally a password is required for authentication to the remote host. To run the SPEC MPI2007 benchmark without interactive password prompting, it is desirable to configure DSA authentication.
This step may be omitted if SLURM is used to manage remote execution.
Caveat emptor: This example uses root and DSA keys that are not passphrase protected. Both of these practices are highly discouraged in production environments.
On the control node execute as root.
- select defaults for key file and empty passphrase
- cd /root/.ssh
- cat id_dsa.pub >> authorized_keys
- copy the .ssh directory to the /root directory on each compute node in the cluster.
- Ensure the .ssh directory permissions are 700.
- Verify you are able to ssh to each host without providing a password.
In these steps, host A is the control node, while host B is the first compute node. The steps apply to each of the execution nodes.
- Install the openib package included with the Red Hat Enterprise Linux 5.2 distribution.
Start the openib daemon
service openibd start
- The OpenIB kernel modules will be loaded when openibd is started.
- service openibd start
Configure the InfiniBand network
- on host A execute ifconfig ib0 192.168.2.1
- on host B execute ifconfig ib0 192.168.2.2
After adding the new interface you may need to disable the firewall.
- execute iptables -F on both systems
Check your connection (you'll need the libibverbs-utils package for this)
- on host A ibv_rc_pingpong
- on host B ibv_rc_pingpong 192.168.2.1
On host A create the /etc/hostfile with the following contents:
NOTE 1: The exact name and location of this hostfile is unimportant, but it must be consistent with the ENV_MP_HOSTFILE parameter specified in the example SPEC MPI2007 config file. OpenMPI processing will assign processes (ranks) round-robin through the hostfile until the ranks specified have all been allocated.
NOTE 2: The hostfile may be omitted if using SLURM to manage remote execution. This is explained in more detail later on this page.
NOTE 3: If you have trouble configuring the ib0 devices, we have found that there is an alternative approach to defining the devices which is more tolerant of the hardware configurations. For example, if the adapter is not setup with physical lines for all of the ports, it may not establish the connections. The following steps remove the driver, and by specifying nr_ports=-1, directs the driver to be more tolerant.
# rmmod ib_ehca # modprobe ib_ehca nr_ports=-1 # ibv_devinfo
# Example SPEC MPI2007 config file # output_format = all tune = base env_vars = 1 ext = LoP64 allow_extension_override = yes ENV_MP_HOSTFILE = /etc/hostfile # host file path # submit (formatted in multiple lines.. should be one line) submit = /usr/lib/openmpi/1.2.5-gcc/bin/mpirun --mca btl openib,self --mca mpi_yield_when_idle 1 --mca btl_openib_warn_default_gid_prefix 0 --mca mpi_paffinity_alone 1 --hostfile $ENV_MP_HOSTFILE -np $ranks $command # # Compiler invocations. # ENV_OMPI_CC = /opt/ibmcmp/vac/9.0/bin/xlc ENV_OMPI_CXX = /opt/ibmcmp/vacpp/9.0/bin/xlC ENV_OMPI_F77 = /opt/ibmcmp/xlf/11.1/bin/xlf ENV_OMPI_FC = /opt/ibmcmp/xlf/11.1/bin/xlf90 ENV_OMPI_CPPFLAGS = -I/usr/lib/openmpi/1.2.5-gcc/include -I/usr/lib/openmpi/1.2.5-gcc/include/openmpi ENV_OMPI_CFLAGS = ENV_OMPI_CXXFLAGS = ENV_OMPI_FFLAGS = ENV_OMPI_FCFLAGS = ENV_OMPI_LDFLAGS = ENV_OMPI_LIBS = -L/usr/lib/openmpi/1.2.5-gcc/lib -lmpi -lopen-rte -lopen-pal -lmpi_f77 -lmpi_f90 -R/usr/lib/openmpi/1.2.5-gcc/lib CC = /usr/lib/openmpi/1.2.5-gcc/bin/mpicc CXX = /usr/lib/openmpi/1.2.5-gcc/bin/mpicxx FC = /usr/lib/openmpi/1.2.5-gcc/bin/mpif90 # # Base Level Optimizations. # default=base=default=default: FOPTIMIZE = -O4 -q32 -qarch=pwr6 -qipa=noobject -qipa=threads -qalias=nostd COPTIMIZE = -O4 -q32 -qarch=pwr6 -qipa=noobject -qipa=threads -qipa=level=1 CXXOPTIMIZE = -O4 -q32 -qarch=pwr6 -qipa=noobject -qipa=threads -qstrict default=default=default=default: # # # Portability Flags. # # Only language-level flags, data-type selection, # and data-space sizing are allowed here. # 104.milc=default=default=default: 107.leslie3d=default=default=default: FPORTABILITY = -qfixed 113.GemsFDTD=default=default=default: 115.fds4=default=default=default: FPORTABILITY = -qfixed CPORTABILITY = -DSPEC_MPI_LC_NO_TRAILING_UNDERSCORE 121.pop2=default=default=default: 122.tachyon=default=default=default: 126.lammps=default=default=default: 127.wrf2=default=default=default: CPORTABILITY = -DNOUNDERSCORE 128.GAPgeofem=default=default=default: 129.tera_tf=default=default=default: 130.socorro=default=default=default: FPORTABILITY = -qzerosize CPORTABILITY = -DSPEC_NO_UNDERSCORE -qcpluscmt 132.zeusmp2=default=default=default: FPPPORTABILITY = -DSPEC_SINGLE_UNDERSCORE FPORTABILITY = -qfixed 137.lu=default=default=default: FPORTABILITY = -qfixed
The SPEC MPI2007 benchmark requires all systems in the cluster to share a single file system containing the installed directory tree.
The SPEC MPI2007 benchmark must be installed on the File Server node. The directory containing the benchmark should be exported such that it is read/write-able to the nodes in the cluster that will be executing the benchmark. Assuming the benchmark is installed in /specmpi2007, the directory may be exported using the following entry in /etc/exports.
/specmpi2007 HOST_A(async,rw,no_root_squash) HOST_B(async,rw,no_root_squash)
Replace HOST_A and HOST_B with appropriate hostnames and start the NFS server.
You can verify the directory is exported using the command.
Copy the example config file to /specmpi2007/config/my-mpi-test.cfg.
From each node in the cluster, mount the directory exported from the File Server node using the same mount point on each node. The rest of this example assumes the exported directory has been mounted to /specmpi2007. The following command sequence executed on Host A, will compile the SPEC MPI2007 benchmark and execute 128 ranks (processes) on your two node cluster.
- Extract the SPEC MPI2007 kit to NFS exported directory /specmpi2007
- change directory to /specmpi2007
- run: ./install.sh
- source in the shrc file: . ./shrc
- change directory to: /specmpi2007/config
- run: runspec --config my-mpi-test.cfg --action validate --tune base --iterations 3 --ranks 128 all
The results will be located in /specmpi2007/result
The number of ranks is distributed across all of the nodes in the cluster (as specified earlier in /etc/hostfile. In this example, each execution node has 32 cores, and with SMT on each execution node has 64 CPUs being controlled by the Linux scheduler. So by requesting 128 ranks OpenMPI would end up distributing the 128 processes across the two execution nodes. A user could turn off SMT (so 32 CPUs for 32 cores on each node), in which case the number of ranks for the two-node cluster would be 64. This would allow you to easily compare SMT on and SMT off processing for the components.
NOTE: The stack limit may need to be increased on each execution node. This may be done by adding the following line to /etc/security/limits.conf.
root soft stack 524288
- For our example, we ran 64 processes on each of two IBM Power 575 systems configured with 32 cores using simultaneous multi-threading (SMT) turned on. For comparison, we also obtained results running 32 processes on each node with simultaneous multi-threading disabled. While most SPEC MPI2007 components achieved a higher score using simultaneous multi-threading, two components (121.pop2 and 130.socorro) showed a performance degradation with SMT enabled.
- During execution an intermittent problem was discovered where a random component of the SPEC MPI2007 benchmark suite would run without finishing. Analysis of the system revealed most of the execution time for the component was taking place in the mca_btl_sm.so library. The OpenMPI --mca btl openib,self parameter was introduced to the submit command in the SPEC MPI2007 config file to explicitly utilize the Infiniband interface BTL when sending messages to processes on the same node.
The installation and configuration of SLURM is beyond the scope of this document.
- When used in conjunction with SLURM, OpenMPI will use SLURM-native mechanisms to launch remote processes making it unnecessary to create SSH keys and setup DSA authentication.
- The hostfile is not used and can be omitted.
- The submit line in the sample config file should be changed as follows:
# Formatted in multiple lines - should be single line submit = /usr/bin/salloc -n $ranks /usr/lib/openmpi/1.2.5-gcc/bin/mpirun --mca btl openib,self --mca mpi_yield_when_idle 1 --mca btl_openib_warn_default_gid_prefix 0 --mca mpi_paffinity_alone 1 $command
- run: runspec --config my-mpi-test.cfg --action validate --tune base --iterations 3 --ranks 128 all