IBM Support

How do I run open MPI jobs over Myrinet MX10g, Gig-E, and Infiniband?

Troubleshooting


Problem

How do I run open MPI jobs over Myrinet MX10g, Gig-E, and Infiniband?

Resolving The Problem

How do I run open MPI jobs over Myrinet MX10g, Gig-E, and Infiniband?

 

1. Open MPI over Myrinet MX10g
When the user requests more MPI processes than there are compute nodes, they need to make sure shared memory interconnect is included as one of the BTLs. Here's a sample invocation and the resulting error without it:

$ mpirun -np 4 --prefix $MPIHOME --hostfile ~/hostfile --mca btl mx,self /opt/hpl/openmpi-hpl/bin/xhpl
[compute-0-2.local:05740] *** An error occurred in MPI_Send
[compute-0-2.local:05740] *** on communicator MPI_COMM_WORLD
[compute-0-2.local:05740] *** MPI_ERR_INTERN: internal error
[compute-0-2.local:05740] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 5739 on node "compute-0-2" exited on signal 15.

Solution: make sure sm BTL is added for multiple local processes to be able to talk to each other:
$ mpirun -np 4 --prefix $MPIHOME --hostfile ~/hostfile --mca btl mx,sm,self /opt/hpl/openmpi-hpl/bin/xhpl

2. Open MPI over Ethernet, including Gig-E
When the user tries running OMPI jobs over the Ethernet on nodes which have infiniband interfaces configured, OMPI may try to establish a TCP socket connection using the IPoIB interfaces instead of the desired eth0/eth1 ones. Here's a sample invocation and the truncated error as a result:

$ mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello
Hello from MPI test program
Process 0 on headnode out of 2
Hello from MPI test program
Process 1 on compute-0-0.local out of 2
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xdebdf8
[0] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a9587e0e5]
[1] func:/lib64/tls/libpthread.so.0 [0x3d1a00c430]
[2] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a95880729]
[3] func:/opt/openmpi/1.1.4/lib/libopal.so.0(_int_free+0x24a) [0x2a95880d7a]
[4] func:/opt/openmpi/1.1.4/lib/libopal.so.0(free+0xbf) [0x2a9588303f]
[5] func:/opt/openmpi/1.1.4/lib/libmpi.so.0 [0x2a955949ca]
[6] func:/opt/openmpi/1.1.4/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_component_close+0x34f)
[0x2a988ee8ef]
[7] func:/opt/openmpi/1.1.4/lib/libopal.so.0(mca_base_components_close+0xde)
[0x2a95872e1e]
[8] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_btl_base_close+0xe9)
[0x2a955e5159]
[9] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_bml_base_close+0x9)
[0x2a955e5029]
[10] func:/opt/openmpi/1.1.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_component_close+0x25)
[0x2a97f4dc55]
[11] func:/opt/openmpi/1.1.4/lib/libopal.so.0(mca_base_components_close+0xde)
[0x2a95872e1e]
[12] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_pml_base_close+0x69)
[0x2a955ea3e9]
[13] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(ompi_mpi_finalize+0xfe)
[0x2a955ab57e]
[14] func:/root/testdir/hello(main+0x7b) [0x4009d3]
[15] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3d1951c3fb]
[16] func:/root/testdir/hello [0x4008ca]
*** End of error message ***

Solution: The user must explicitly exclude Infiniband interfaces on the command line to ensure that they don't get used for TCP socket communication
$ mpirun --prefix $MPIHOME -hostfile ~/testdir/hosts --mca btl tcp,self --mca btl_tcp_if_exclude ib0,ib1 ~/testdir/hello

[{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"4.4.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":null,"Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
30 August 2019

UID

isg3T1014278