Intel MPI blaunch job sometimes will be hang if the request core number is larger than 100. The job will be hang randomly. If you can find many processes shown the following status: "[blaunch]
The MPI job hang in the cluster without done status.
Intel MPI edition is low
Resolving The Problem
You should upgrade the InterlMPI to the latest edition which above the 4.1.0.036 edition.
The following is an example of submitting the MPI job.
#BSUB -n t #t is the number of the slots requirement
#BSUB -e intelmpi_%J.err
#BSUB -o intelmpi_%J.out
#BSUB -R "span[ptile=n]" #n is the process number run on the each host
export I_MPI_HYDRA_BRANCH_COUNT=m #m is number of hosts
The variable I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1 is necessary. I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1, intelMPI use single blaunch -z, which should be recommended LSF + intelMPI usage.
I_MPI_LSF_USE_COLLECTIVE_LAUNCH=0, intelMPI use multiple blaunch -n, which is not stable in the current LSF + intelMPI usage. Do not suggest use it.
17 June 2018