Question & Answer
Question
You find that an MPI communication routine (MPI_Bcast) implemented in Spectrum_MPI/10.1.0 gives poor performance on Paragon (IBM Power 8 ppc64le) when compared to same Intel_MPI routine on ScaffelPike (x86 system).
Steps to reproduce:
Load Spectrum_MPI and profile the following code:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
int v1[1];
int i;
for(i=0; i<1000000; i++)
{
v1[0]=i;
MPI_Bcast(v1,1,MPI_INT,0,MPI_COMM_WORLD);
}
// Print off a hello world message
if (v1[0]==99) printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
For the benchmark provided, there is high variability in the execution time per rank. This variability should not be seen in general.
Furthermore, results on your Intel platform and on the Power8 platform when using the MXM communication library are in line with the expectation.
Only results on Power with the default communication layer (PAMI) are off, by a significant factor.
Data should be obtained over multiple nodes and with a high number of ranks to see the problem. You are testing with 96 ranks over 6 nodes (16 node per rank).
Attached is the output of mpitrace, an MPI tracing library that was developed internally at IBM and is available in the latest version of Spectrum MPI. But any tracing library will suffice for this experiment.
A summary of the variance is highlighted below:
- Spectrum MPI with IBM's PAMI communication layer:
Histogram of times spent in MPI
time-bin #ranks
7.940 1
8.746 0
9.551 5
10.356 0
11.162 0
11.967 0
12.772 9
13.577 9
14.383 12
15.188 18
15.993 18
16.798 0
17.604 6
18.409 0
19.214 18
- Spectrum MPI with OpenMPI's MXM communication layer:
Histogram of times spent in MPI
time-bin #ranks
3.402 8
3.404 15
3.407 9
3.409 0
3.412 1
3.414 33
3.416 14
3.419 0
3.421 0
3.423 0
3.426 0
3.428 0
3.431 0
3.433 12
3.435 4
Answer
1) PAMI is the point to point and one side communication layer of SMPI; it uses libColl as its collectives algorithm. PAMI has better point to point and one side communication performance than others like openib. But MXM is a special one, if your MPI job calls bcast with mxm the job actually is called into HCOLL/FCA. It's the optimized collectives from the Mellanox. The optimized work for HCOLL/FCA is done mostly on the Infiniband switch so it's difficult to beat it at any message length of collectives.
2) MXM and HCOLL/FCA are not opensource. They are Mellanox products released in Mellanox OFED drivers.
3) This has been tested on power 9 platform with newer version of Spectrum MPI but looks like it has same result comparing with power 8 platform's.
Here is the test environment and result for your reference.
host: power 9
cpu version: 2.2
spectrum_mpi version: 10.2.0.09rtm2
application used: Intel IMB
You can download or clone Intel IMB from https://github.com/intel/mpi-benchmarks
Command line used:
for pami: mpirun -pami -hostlist f3n17:20,f3n18:20 -aff=on IMB-MPI1 -npmin 40 bcast
for mxm: mpirun -mxm -hostlist f3n17:20,f3n18:20 -aff=on IMB-MPI1 -npmin 40 bcast
Result:
You can see from the test result, if using pami the message length < 2k the min/max value has some variation and this variation more clear than mxm. But after the message size > 2k pami is doing better than mxm.
PAMI
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 40
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.07 0.16 0.10
1 1000 4.09 8.75 7.11
2 1000 2.29 8.63 5.72
4 1000 1.66 8.39 5.26
8 1000 1.71 8.41 5.23
16 1000 1.71 8.45 5.24
32 1000 1.77 8.66 5.37
64 1000 2.94 8.59 6.35
128 1000 2.89 8.50 6.33
256 1000 2.81 8.61 6.55
512 1000 2.80 8.64 6.69
1024 1000 2.86 9.35 7.10
2048 1000 2.97 10.27 7.96
4096 1000 6.66 14.38 11.68
8192 1000 8.02 18.21 14.80
16384 1000 10.80 26.97 21.03
32768 1000 15.74 40.48 32.35
65536 640 14.89 47.97 35.51
131072 320 16.87 61.10 49.32
262144 160 24.97 117.87 89.29
524288 80 37.60 202.70 166.48
1048576 40 60.34 500.64 328.01
2097152 20 108.74 769.06 615.53
4194304 10 279.38 1807.67 1231.41
MXM
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 40
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.07 0.15 0.09
1 1000 4.25 6.55 5.14
2 1000 4.09 6.53 5.09
4 1000 4.22 6.59 5.16
8 1000 4.25 6.45 5.10
16 1000 4.02 6.35 4.93
32 1000 4.07 5.89 4.79
64 1000 4.01 5.21 4.41
128 1000 4.07 5.44 4.86
256 1000 4.37 5.74 5.09
512 1000 4.53 6.59 5.51
1024 1000 4.82 8.14 6.26
2048 1000 5.13 14.68 9.83
4096 1000 5.97 23.16 15.39
8192 1000 10.41 31.22 22.62
16384 1000 6.93 40.84 26.38
32768 1000 7.51 67.98 43.99
65536 640 8.43 119.95 79.02
131072 320 10.39 226.55 150.71
262144 160 14.64 445.97 295.92
524288 80 21.70 875.40 593.36
1048576 40 36.36 1734.54 1195.67
2097152 20 76.11 3442.37 2394.19
4194304 10 361.14 6851.70 4816.63
Was this topic helpful?
Document Information
Modified date:
30 December 2018
UID
ibm10792467