Troubleshooting
Problem
How do I recompile and install charm++ with Open-MPI?
Resolving The Problem
How do I recompile and install charm++ with Open-MPI?
The following procedure will build charm++ successfully.
Please note that if you build the mpi-linux distribution, the build will complete gracefully, the basic test as well, but the megatest will fail with the following error :
-------------------------------------
[toor3@frontend megatest]$ mpirun -np 2 --machinefile hosts --prefix $MPIHOME ./pgm
Megatest is running on 2 processors.
test 0: initiated [bitvector (jbooth)]
test 0: completed (0.00 sec)
test 1: initiated [immediatering (gengbin)]
test 1: completed (0.16 sec)
test 2: initiated [callback (olawlor)]
[compute-1-5:15906] *** Process received signal ***
[compute-1-5:15906] Signal: Segmentation fault (11)
[compute-1-5:15906] Signal code: Address not mapped (1)
[compute-1-5:15906] Failing at address: 0xffffffffbfffef18
[compute-1-5:15906] [ 0] /lib64/tls/libpthread.so.0 [0x37b470c420]
[compute-1-5:15906] [ 1] ./pgm [0x52958c]
[compute-1-5:15906] [ 2] ./pgm(qt_args+0x8b) [0x529666]
[compute-1-5:15906] *** End of error message ***
mpirun noticed that job rank 0 with PID 15906 on node compute-1-5 exited on signal 11 (Segmentation fault).
1 additional process aborted (not shown)
------------------------------------------
Megatest is running on 2 processors.
test 0: initiated [bitvector (jbooth)]
test 0: completed (0.00 sec)
test 1: initiated [immediatering (gengbin)]
test 1: completed (0.16 sec)
test 2: initiated [callback (olawlor)]
[compute-1-5:15906] *** Process received signal ***
[compute-1-5:15906] Signal: Segmentation fault (11)
[compute-1-5:15906] Signal code: Address not mapped (1)
[compute-1-5:15906] Failing at address: 0xffffffffbfffef18
[compute-1-5:15906] [ 0] /lib64/tls/libpthread.so.0 [0x37b470c420]
[compute-1-5:15906] [ 1] ./pgm [0x52958c]
[compute-1-5:15906] [ 2] ./pgm(qt_args+0x8b) [0x529666]
[compute-1-5:15906] *** End of error message ***
mpirun noticed that job rank 0 with PID 15906 on node compute-1-5 exited on signal 11 (Segmentation fault).
1 additional process aborted (not shown)
------------------------------------------
It is suggested to avoid the use of the charmrun script and leverage Lava or LSF/HPC for the scheduling of jobs in your cluster to exploit better efficiency of your cluster usage.
Details Procedure:
1- load the openmpi (gnu) module
module load hpc/ompi12-gnu
2- Modify the mpi-linux related file (for the -lmpich link flag) :
The file is : include/conv-mach.sh
And here is the diff :
------------------------------------------
[toor3@frontend charm-5.9]$ diff include/conv-mach.sh include/conv-mach.sh.orig
9c9
< CMK_LIBS='-lckqt '
---
> CMK_LIBS='-lckqt -lmpich '
9c9
< CMK_LIBS='-lckqt '
---
> CMK_LIBS='-lckqt -lmpich '
------------------------------------------
3- Build the correct tree for make (using pthreads):
./build charm++ mpi-linux pthreads
4- Build the distribution :
cd mpi-linux-pthreads ; make
5- Test the build :
cd mpi-linux-pthreads/tests/charm++/megatest
make
And then, for example, using a ../hosts file with 2 free compute nodes :
mpirun -np 2 --machinefile ../hosts --prefix $MPIHOME ./pgm
-------------------------------------------
Megatest is running on 2 processors.
test 0: initiated [bitvector (jbooth)]
test 0: completed (0.00 sec)
test 1: initiated [immediatering (gengbin)]
test 1: completed (0.16 sec)
test 2: initiated [callback (olawlor)]
test 2: completed (0.00 sec)
test 3: initiated [reduction (olawlor)]
test 3: completed (0.00 sec)
test 4: initiated [inherit (olawlor)]
test 4: completed (0.04 sec)
test 5: initiated [templates (milind)]
test 5: completed (0.00 sec)
test 6: initiated [statistics (olawlor)]
test 6: completed (0.00 sec)
test 7: initiated [rotest (milind)]
test 7: completed (0.00 sec)
test 8: initiated [priotest (mlind)]
test 8: completed (0.00 sec)
test 9: initiated [priomsg (fang)]
test 9: completed (0.00 sec)
test 10: initiated [marshall (olawlor)]
test 10: completed (0.04 sec)
test 11: initiated [migration (jackie)]
test 11: completed (0.00 sec)
test 12: initiated [queens (jackie)]
test 12: completed (0.03 sec)
test 13: initiated [packtest (fang)]
test 13: completed (0.00 sec)
test 14: initiated [tempotest (fang)]
test 14: completed (0.00 sec)
test 15: initiated [arrayring (fang)]
test 15: completed (0.04 sec)
test 16: initiated [fib (jackie)]
test 16: completed (0.01 sec)
test 17: initiated [synctest (mjlang)]
test 17: completed (0.03 sec)
test 18: initiated [nodecast (milind)]
test 18: completed (0.00 sec)
test 19: initiated [groupcast (mjlang)]
test 19: completed (0.00 sec)
test 20: initiated [varraystest (milind)]
test 20: completed (0.00 sec)
test 21: initiated [varsizetest (mjlang)]
test 21: completed (0.00 sec)
test 22: initiated [nodering (milind)]
test 22: completed (0.02 sec)
test 23: initiated [groupring (milind)]
test 23: completed (0.02 sec)
test 24: initiated [multi immediatering (gengbin)]
test 24: completed (0.20 sec)
test 25: initiated [multi callback (olawlor)]
test 25: completed (0.00 sec)
test 26: initiated [multi reduction (olawlor)]
test 26: completed (0.01 sec)
test 27: initiated [multi statistics (olawlor)]
test 27: completed (0.01 sec)
test 28: initiated [multi priotest (mlind)]
test 28: completed (0.00 sec)
test 29: initiated [multi priomsg (fang)]
test 29: completed (0.00 sec)
test 30: initiated [multi marshall (olawlor)]
test 30: completed (0.13 sec)
test 31: initiated [multi migration (jackie)]
test 31: completed (0.01 sec)
test 32: initiated [multi packtest (fang)]
test 32: completed (0.00 sec)
test 33: initiated [multi tempotest (fang)]
test 33: completed (0.01 sec)
test 34: initiated [multi arrayring (fang)]
test 34: completed (0.08 sec)
test 35: initiated [multi fib (jackie)]
test 35: completed (0.07 sec)
test 36: initiated [multi synctest (mjlang)]
test 36: completed (0.09 sec)
test 37: initiated [multi nodecast (milind)]
test 37: completed (0.00 sec)
test 38: initiated [multi groupcast (mjlang)]
test 38: completed (0.00 sec)
test 39: initiated [multi varraystest (milind)]
test 39: completed (0.00 sec)
test 40: initiated [multi varsizetest (mjlang)]
test 40: completed (0.00 sec)
test 41: initiated [multi nodering (milind)]
test 41: completed (0.03 sec)
test 42: initiated [multi groupring (milind)]
test 42: completed (0.03 sec)
test 43: initiated [all-at-once]
test 43: completed (0.19 sec)
All tests completed, exiting
End of program
test 0: initiated [bitvector (jbooth)]
test 0: completed (0.00 sec)
test 1: initiated [immediatering (gengbin)]
test 1: completed (0.16 sec)
test 2: initiated [callback (olawlor)]
test 2: completed (0.00 sec)
test 3: initiated [reduction (olawlor)]
test 3: completed (0.00 sec)
test 4: initiated [inherit (olawlor)]
test 4: completed (0.04 sec)
test 5: initiated [templates (milind)]
test 5: completed (0.00 sec)
test 6: initiated [statistics (olawlor)]
test 6: completed (0.00 sec)
test 7: initiated [rotest (milind)]
test 7: completed (0.00 sec)
test 8: initiated [priotest (mlind)]
test 8: completed (0.00 sec)
test 9: initiated [priomsg (fang)]
test 9: completed (0.00 sec)
test 10: initiated [marshall (olawlor)]
test 10: completed (0.04 sec)
test 11: initiated [migration (jackie)]
test 11: completed (0.00 sec)
test 12: initiated [queens (jackie)]
test 12: completed (0.03 sec)
test 13: initiated [packtest (fang)]
test 13: completed (0.00 sec)
test 14: initiated [tempotest (fang)]
test 14: completed (0.00 sec)
test 15: initiated [arrayring (fang)]
test 15: completed (0.04 sec)
test 16: initiated [fib (jackie)]
test 16: completed (0.01 sec)
test 17: initiated [synctest (mjlang)]
test 17: completed (0.03 sec)
test 18: initiated [nodecast (milind)]
test 18: completed (0.00 sec)
test 19: initiated [groupcast (mjlang)]
test 19: completed (0.00 sec)
test 20: initiated [varraystest (milind)]
test 20: completed (0.00 sec)
test 21: initiated [varsizetest (mjlang)]
test 21: completed (0.00 sec)
test 22: initiated [nodering (milind)]
test 22: completed (0.02 sec)
test 23: initiated [groupring (milind)]
test 23: completed (0.02 sec)
test 24: initiated [multi immediatering (gengbin)]
test 24: completed (0.20 sec)
test 25: initiated [multi callback (olawlor)]
test 25: completed (0.00 sec)
test 26: initiated [multi reduction (olawlor)]
test 26: completed (0.01 sec)
test 27: initiated [multi statistics (olawlor)]
test 27: completed (0.01 sec)
test 28: initiated [multi priotest (mlind)]
test 28: completed (0.00 sec)
test 29: initiated [multi priomsg (fang)]
test 29: completed (0.00 sec)
test 30: initiated [multi marshall (olawlor)]
test 30: completed (0.13 sec)
test 31: initiated [multi migration (jackie)]
test 31: completed (0.01 sec)
test 32: initiated [multi packtest (fang)]
test 32: completed (0.00 sec)
test 33: initiated [multi tempotest (fang)]
test 33: completed (0.01 sec)
test 34: initiated [multi arrayring (fang)]
test 34: completed (0.08 sec)
test 35: initiated [multi fib (jackie)]
test 35: completed (0.07 sec)
test 36: initiated [multi synctest (mjlang)]
test 36: completed (0.09 sec)
test 37: initiated [multi nodecast (milind)]
test 37: completed (0.00 sec)
test 38: initiated [multi groupcast (mjlang)]
test 38: completed (0.00 sec)
test 39: initiated [multi varraystest (milind)]
test 39: completed (0.00 sec)
test 40: initiated [multi varsizetest (mjlang)]
test 40: completed (0.00 sec)
test 41: initiated [multi nodering (milind)]
test 41: completed (0.03 sec)
test 42: initiated [multi groupring (milind)]
test 42: completed (0.03 sec)
test 43: initiated [all-at-once]
test 43: completed (0.19 sec)
All tests completed, exiting
End of program
[{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"4.1.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":null,"Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Was this topic helpful?
Document Information
More support for:
IBM Spectrum Cluster Foundation
Software version:
4.1.1
Document number:
701989
Modified date:
11 September 2018
UID
isg3T1014102