IBM Support

Performance issue with Spectrum_MPI

Question & Answer


Question

You find that an MPI communication routine (MPI_Bcast) implemented in Spectrum_MPI/10.1.0 gives poor performance on Paragon (IBM Power 8 ppc64le) when compared to same Intel_MPI routine on ScaffelPike (x86 system).

Steps to reproduce:
Load Spectrum_MPI and profile the following code:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);

// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);

int v1[1];
int i;
for(i=0; i<1000000; i++)
{
v1[0]=i;
MPI_Bcast(v1,1,MPI_INT,0,MPI_COMM_WORLD);
}

// Print off a hello world message
if (v1[0]==99) printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);

// Finalize the MPI environment.
MPI_Finalize();
}

For the benchmark provided, there is high variability in the execution time per rank. This variability should not be seen in general.
Furthermore, results on your Intel platform and on the Power8 platform when using the MXM communication library are in line with the expectation.

Only results on Power with the default communication layer (PAMI) are off, by a significant factor.

Data should be obtained over multiple nodes and with a high number of ranks to see the problem. You are testing with 96 ranks over 6 nodes (16 node per rank).

Attached is the output of mpitrace, an MPI tracing library that was developed internally at IBM and is available in the latest version of Spectrum MPI. But any tracing library will suffice for this experiment.

A summary of the variance is highlighted below:

- Spectrum MPI with IBM's PAMI communication layer:
Histogram of times spent in MPI
   time-bin   #ranks
      7.940        1
      8.746        0
      9.551        5
     10.356        0
     11.162        0
     11.967        0
     12.772        9
     13.577        9
     14.383       12
     15.188       18
     15.993       18
     16.798        0
     17.604        6
     18.409        0
     19.214       18

- Spectrum MPI with OpenMPI's MXM communication layer:
Histogram of times spent in MPI
   time-bin   #ranks
      3.402        8
      3.404       15
      3.407        9
      3.409        0
      3.412        1
      3.414       33
      3.416       14
      3.419        0
      3.421        0
      3.423        0
      3.426        0
      3.428        0
      3.431        0
      3.433       12
      3.435        4

 

 

Answer

1) PAMI is the point to point and one side communication layer of SMPI; it uses libColl as its collectives algorithm. PAMI has better point to point and one side communication performance than others like openib. But MXM is a special one, if your MPI job calls bcast with mxm the job actually is called into HCOLL/FCA. It's the optimized collectives from the Mellanox. The optimized work for HCOLL/FCA is done mostly on the Infiniband switch so it's difficult to beat it at any message length of collectives.
 
2) MXM and HCOLL/FCA are not opensource. They are Mellanox products released in Mellanox OFED drivers.
 
3) This has been tested on power 9 platform with newer version of Spectrum MPI but looks like it has same result comparing with power 8 platform's.

Here is the test environment and result for your reference.
 
host: power 9
cpu version: 2.2
spectrum_mpi version: 10.2.0.09rtm2
application used: Intel IMB
 
You can download or clone Intel IMB from https://github.com/intel/mpi-benchmarks
 
Command line used:
for pami: mpirun -pami -hostlist f3n17:20,f3n18:20 -aff=on IMB-MPI1 -npmin 40 bcast
for mxm: mpirun -mxm -hostlist f3n17:20,f3n18:20 -aff=on IMB-MPI1 -npmin 40 bcast
 
Result:
 
You can see from the test result, if using pami the message length < 2k the min/max value has some variation and this variation more clear than mxm. But after the message size > 2k pami is doing better than mxm. 
 
PAMI
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 40
#----------------------------------------------------------------
      #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
           0         1000         0.07         0.16         0.10
           1         1000         4.09         8.75         7.11
           2         1000         2.29         8.63         5.72
           4         1000         1.66         8.39         5.26
           8         1000         1.71         8.41         5.23
          16         1000         1.71         8.45         5.24
          32         1000         1.77         8.66         5.37
          64         1000         2.94         8.59         6.35
         128         1000         2.89         8.50         6.33
         256         1000         2.81         8.61         6.55
         512         1000         2.80         8.64         6.69
        1024         1000         2.86         9.35         7.10
        2048         1000         2.97        10.27         7.96
        4096         1000         6.66        14.38        11.68
        8192         1000         8.02        18.21        14.80
       16384         1000        10.80        26.97        21.03
       32768         1000        15.74        40.48        32.35
       65536          640        14.89        47.97        35.51
      131072          320        16.87        61.10        49.32
      262144          160        24.97       117.87        89.29
      524288           80        37.60       202.70       166.48
     1048576           40        60.34       500.64       328.01
     2097152           20       108.74       769.06       615.53
     4194304           10       279.38      1807.67      1231.41

MXM
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 40
#----------------------------------------------------------------
      #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
           0         1000         0.07         0.15         0.09
           1         1000         4.25         6.55         5.14
           2         1000         4.09         6.53         5.09
           4         1000         4.22         6.59         5.16
           8         1000         4.25         6.45         5.10
          16         1000         4.02         6.35         4.93
          32         1000         4.07         5.89         4.79
          64         1000         4.01         5.21         4.41
         128         1000         4.07         5.44         4.86
         256         1000         4.37         5.74         5.09
         512         1000         4.53         6.59         5.51
        1024         1000         4.82         8.14         6.26
        2048         1000         5.13        14.68         9.83
        4096         1000         5.97        23.16        15.39
        8192         1000        10.41        31.22        22.62
       16384         1000         6.93        40.84        26.38
       32768         1000         7.51        67.98        43.99
       65536          640         8.43       119.95        79.02
      131072          320        10.39       226.55       150.71
      262144          160        14.64       445.97       295.92
      524288           80        21.70       875.40       593.36
     1048576           40        36.36      1734.54      1195.67
     2097152           20        76.11      3442.37      2394.19
     4194304           10       361.14      6851.70      4816.63

 

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSZTET","label":"IBM Spectrum MPI"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"10.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
30 December 2018

UID

ibm10792467