IBM Support

How do I check open-mpi 1.1.2 (infiniband support) using the INTEL MPI Benchmark?

Troubleshooting


Problem

How do I check open-mpi 1.1.2 (infiniband support) using the INTEL MPI Benchmark?

Resolving The Problem

How do I check the openmpi install using intel mpi benchmark?You can get the latest intel mpi benchmark here:
https://software.intel.com/en-us/articles/intel-mpi-benchmarks/
This is a tar.gz archive and it will be expanded in a directory called IMB_3.0.
Please note there is a Readme_first in the root dir and a ReadMe_IMB.txt in the doc directory as well.
Pdf documentation is available in the doc directory.

Here we'll give an example of how to check the openmpi install using intel mpi benchmark
To recompile it, we did the following :
- go to the src directory and :
- cp -p make_mpich make_ompi
- add the correct value for MPI_HOME : MPI_HOME=/share/apps/openmpi-1.1.4
- build the binary : make -f make_ompi

And to run it, we did (from the master) :
- For tcp : mpirun --mca btl tcp -np 2 -machinefile hosts --prefix /share/apps/openmpi-1.1.4 ./IMB-MPI1 | tee bench_tcp.txt
- For infiniband : mpirun --mca btl mvapi -np 2 -machinefile hosts --prefix /share/apps/openmpi-1.1.4 ./IMB-MPI1 | tee bench_ib.txt

The file hosts2 contains 2 hosts that were free at the time we ran the test.
Also, by default (if you don't specify anything for mca), you will use the infiniband interconnect.
As you can see, you can tell for sure which interconnect was used :
- For tcp :

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------

 

#bytes repetitions t[usec] Mbytes/sec
0 1000 52.87 0.00
1 1000 52.28 0.02
2 1000 52.45 0.04
4 1000 51.73 0.07
8 1000 52.61 0.15
16 1000 53.49 0.29
32 1000 55.22 0.55
64 1000 57.54 1.06
128 1000 62.74 1.95
256 1000 75.00 3.26
512 1000 98.59 4.95
1024 1000 146.96 6.65
2048 1000 217.79 8.97
4096 1000 310.34 12.59
8192 1000 499.64 15.64
16384 1000 876.98 17.82
32768 1000 1633.44 19.13
65536 640 3361.44 18.59
131072 320 6354.76 19.67
262144 160 12208.69 20.48
524288 80 23755.64 21.05
1048576 40 45732.45 21.87
2097152 20 90069.25 22.21
4194304 10 178885.54 22.36
- For infiniband :
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------

 

#bytes #repetitions t[usec] Mbytes/sec
0 1000 5.40 0.00
1 1000 5.51 0.17
2 1000 5.54 0.34
4 1000 5.79 0.66
8 1000 5.67 1.35
16 1000 5.79 2.63
32 1000 5.98 5.10
64 1000 7.06 8.65
128 1000 7.43 16.43
256 1000 7.90 30.92
512 1000 9.01 54.16
1024 1000 11.25 86.79
2048 1000 13.98 139.69
4096 1000 18.88 206.91
8192 1000 28.49 274.21
16384 1000 58.90 265.29
32768 1000 97.15 321.67
65536 640 173.16 360.95
131072 320 305.61 409.01
262144 160 583.30 428.59
524288 80 1150.77 434.49
1048576 40 2190.19 456.58
2097152 20 4428.65 451.60
4194304 10 8703.80 459.57

Then, we did submit four jobs this way through LSF/HPC:

[mbozzore@dr10 src]$ bsub -o%J.out -a openmpi -n 2 -R "span[ptile=1]" mpirun.lsf --mca btl tcp --prefix /share/apps/openmpi-1.1.4 ./IMB-MPI1
Job <4188> is submitted to default queue .
[mbozzore@dr10 src]$ bsub -o%J.out -a openmpi -n 2 -R "span[ptile=1]" mpirun.lsf --mca btl mvapi --prefix /share/apps/openmpi-1.1.4 ./IMB-MPI1
Job <4189> is submitted to default queue .
[mbozzore@dr10 src]$ bsub -o%J.out -a openmpi -n 2 mpirun.lsf --mca btl mvapi --prefix /share/apps/openmpi-1.1.4 ./IMB-MPI1
Job <4190> is submitted to default queue .
[mbozzore@dr10 src]$ bsub -o%J.out -a openmpi -n 2 mpirun.lsf --mca btl tcp --prefix /share/apps/openmpi-1.1.4 ./IMB-MPI1
Job <4191> is submitted to default queue .

Our compute nodes are dual cpu (not dual core).
This is why we launched 4 jobs. For the 2 first jobs, we asked for 1 cpu per node ( -R "span[ptile=1]" ).
We didn't specify anything for the 2 other jobs and they were executed on the same node.
An interesting thing is that if you look at the results, you will see that :
- intranode communication for IB is pretty bad,
- intranode communication for tcp is good,
- internode communication for IB is "normal",
- internode communication for tcp is, as usual, pretty bad.

The potential causes are:
For case 1 : hca loopback is used , bandwidth is limited by the pci bus
For case 2 : I think it is because it uses a shared memory model, instead of loopback

The files are attached :
bench_ib.txt : benchmark result using infiniband
bench_tcp.txt : benchmark result using tcp
make_ompi : the makefile used to recompile the benchmark using openmpi 1.1.4
4188.out : output file for job 4188
4189.out : output file for job 4189
4190.out : output file for job 4190
4191.out : output file for job 4191

[{"Product":{"code":"SSDV85","label":"Platform Cluster Manager"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"4.1.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

More support for:
Platform Cluster Manager

Software version:
4.1.1

Document number:
672815

Modified date:
09 September 2018

UID

isg3T1014151