Troubleshooting
Problem
How do I run open MPI jobs over Myrinet MX10g, Gig-E, and Infiniband?
Resolving The Problem
How do I run open MPI jobs over Myrinet MX10g, Gig-E, and Infiniband?
When the user requests more MPI processes than there are compute nodes, they need to make sure shared memory interconnect is included as one of the BTLs. Here's a sample invocation and the resulting error without it:
$ mpirun -np 4 --prefix $MPIHOME --hostfile ~/hostfile --mca btl mx,self /opt/hpl/openmpi-hpl/bin/xhpl
[compute-0-2.local:05740] *** An error occurred in MPI_Send
[compute-0-2.local:05740] *** on communicator MPI_COMM_WORLD
[compute-0-2.local:05740] *** MPI_ERR_INTERN: internal error
[compute-0-2.local:05740] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 5739 on node "compute-0-2" exited on signal 15.
Solution: make sure sm BTL is added for multiple local processes to be able to talk to each other:
$ mpirun -np 4 --prefix $MPIHOME --hostfile ~/hostfile --mca btl mx,sm,self /opt/hpl/openmpi-hpl/bin/xhpl
When the user tries running OMPI jobs over the Ethernet on nodes which have infiniband interfaces configured, OMPI may try to establish a TCP socket connection using the IPoIB interfaces instead of the desired eth0/eth1 ones. Here's a sample invocation and the truncated error as a result:
$ mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello
Hello from MPI test program
Process 0 on headnode out of 2
Hello from MPI test program
Process 1 on compute-0-0.local out of 2
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xdebdf8
[0] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a9587e0e5]
[1] func:/lib64/tls/libpthread.so.0 [0x3d1a00c430]
[2] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a95880729]
[3] func:/opt/openmpi/1.1.4/lib/libopal.so.0(_int_free+0x24a) [0x2a95880d7a]
[4] func:/opt/openmpi/1.1.4/lib/libopal.so.0(free+0xbf) [0x2a9588303f]
[5] func:/opt/openmpi/1.1.4/lib/libmpi.so.0 [0x2a955949ca]
[6] func:/opt/openmpi/1.1.4/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_component_close+0x34f)
[0x2a988ee8ef]
[7] func:/opt/openmpi/1.1.4/lib/libopal.so.0(mca_base_components_close+0xde)
[0x2a95872e1e]
[8] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_btl_base_close+0xe9)
[0x2a955e5159]
[9] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_bml_base_close+0x9)
[0x2a955e5029]
[10] func:/opt/openmpi/1.1.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_component_close+0x25)
[0x2a97f4dc55]
[11] func:/opt/openmpi/1.1.4/lib/libopal.so.0(mca_base_components_close+0xde)
[0x2a95872e1e]
[12] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_pml_base_close+0x69)
[0x2a955ea3e9]
[13] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(ompi_mpi_finalize+0xfe)
[0x2a955ab57e]
[14] func:/root/testdir/hello(main+0x7b) [0x4009d3]
[15] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3d1951c3fb]
[16] func:/root/testdir/hello [0x4008ca]
*** End of error message ***
Solution: The user must explicitly exclude Infiniband interfaces on the command line to ensure that they don't get used for TCP socket communication
$ mpirun --prefix $MPIHOME -hostfile ~/testdir/hosts --mca btl tcp,self --mca btl_tcp_if_exclude ib0,ib1 ~/testdir/hello
Was this topic helpful?
Document Information
Modified date:
30 August 2019
UID
isg3T1014278