IBM Support

How to ensure OpenMPI follow LSF's affinity decisions

Troubleshooting


Problem

When running a OpenMPI job in LSF with affinities specified through LSF options. LSF makes the affinity decision and send to OpenMPI, in order for OpenMPI applications to follow the decision, the following procedure needs to be followed.

Resolving The Problem

If the procedure was not followed, two different OpenMPI jobs could end up using the same cores on a host while LSF scheduled them to use different cores. Below is an example.


bash-4.1$ cat job.sh
#!/bin/bash
#BSUB -n 4
#BSUB -R "affinity[core(1)]"
#BSUB -m "gordonc-3"
#BSUB -e %J.e
#BSUB -o %J.o
/usr/local/bin/mpirun --report-bindings ./a.out

bash-4.1$ bsub < job.sh ;bsub < job.sh
Job <4106> is submitted to default queue <normal>.
Job <4107> is submitted to default queue <normal>.

bash-4.1$ bjobs -aff -l # bjobs shows different cores are used for the 2 jobs
Job <4106>, User <gordonc>, Project <default>, Status <RUN>, Queue <normal>, Co
...
AFFINITY:
CPU BINDING MEMORY BINDING
------------------------ --------------------
HOST TYPE LEVEL EXCL IDS POL NUMA SIZE
gordonc-3.gss.platf core - - /0/0/0 - - -
gordonc-3.gss.platf core - - /0/0/1 - - -
gordonc-3.gss.platf core - - /0/1/0 - - -
gordonc-3.gss.platf core - - /0/1/1 - - -
------------------------------------------------------------------------------
Job <4107>, User <gordonc>, Project <default>, Status <RUN>, Queue <normal>, Co
...
AFFINITY:
CPU BINDING MEMORY BINDING
------------------------ --------------------
HOST TYPE LEVEL EXCL IDS POL NUMA SIZE
gordonc-3.gss.platf core - - /0/2/0 - - -
gordonc-3.gss.platf core - - /0/2/1 - - -
gordonc-3.gss.platf core - - /0/3/0 - - -
gordonc-3.gss.platf core - - /0/3/1 - - -

bash-4.1$ cat 4106.e # mpirun --report-bindings shows the same cores are used for the 2 jobs
[gordonc-3:22446] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.][./.][./.]
[gordonc-3:22446] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B][./.][./.]
[gordonc-3:22446] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.][./.][./.]
[gordonc-3:22446] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B][./.][./.][./.]
bash-4.1$ cat 4107.e
[gordonc-3:22499] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B][./.][./.][./.]
[gordonc-3:22499] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.][./.][./.]
[gordonc-3:22499] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B][./.][./.]
[gordonc-3:22499] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.][./.][./.]



Following "Best practices Using Affinity Scheduling in IBM Platform LSF" (https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/99245193-fced-40e5-90df-a0e9f50a0fb0/page/359ab0d9-7849-4c6a-8cb8-7a62050b5222/attachment/7ba985b5-006f-4f06-bcfb-dafa92c4a713/media/Platform_BPG_Affinity.pdf) section "Configuring LSF to load the affinity scheduling plugin" and "Example 5: Automatically binding OpenMPI tasks"), LSF can generate rank file for OpenMPI and thus enforce the affinity.
 
$ cat $LSF_ENVDIR/lsbatch/gc101/configdir/lsb.applications
...
Begin Application
NAME = openmpi
DESCRIPTION = OpenMPI 1.8.3
DJOB_ENV_SCRIPT = openmpi_rankfile.sh
End Application
$ cat job.sh
#!/bin/bash
#BSUB -n 4
#BSUB -R "affinity[core(1)]"                                           
#BSUB -m "gordonc-3"
#BSUB -app openmpi
/usr/local/bin/mpirun -rf $LSB_RANK_HOSTFILE ./a.out
$ bsub < job.sh ;bsub < job.sh
Job <4091> is submitted to default queue <normal>.
Job <4092> is submitted to default queue <normal>.
$ bjobs

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
4091    gordonc RUN   normal     gordonc-3.g gordonc-3.g *E ./a.out Sep 26 16:07
                                             gordonc-3.gss.platformlab.ibm.com
                                             gordonc-3.gss.platformlab.ibm.com
                                             gordonc-3.gss.platformlab.ibm.com
4092    gordonc RUN   normal     gordonc-3.g gordonc-3.g *E ./a.out Sep 26 16:07
                                             gordonc-3.gss.platformlab.ibm.com
                                             gordonc-3.gss.platformlab.ibm.com
                                             gordonc-3.gss.platformlab.ibm.com
$ ps -ef|grep a.out

gordonc   1164  1159  1 16:07 ?        00:00:00 /usr/local/bin/mpirun -rf /home/gordonc/.lsbatch/1506456435.4091.hostRankFile ./a.out
gordonc   1165  1158  0 16:07 ?        00:00:00 /usr/local/bin/mpirun -rf /home/gordonc/.lsbatch/1506456435.4092.hostRankFile ./a.out
gordonc   1169  1165 99 16:07 ?        00:00:05 ./a.out
gordonc   1170  1164 99 16:07 ?        00:00:05 ./a.out
gordonc   1171  1165  3 16:07 ?        00:00:00 ./a.out
gordonc   1172  1164  4 16:07 ?        00:00:00 ./a.out
gordonc   1173  1165 99 16:07 ?        00:00:05 ./a.out
gordonc   1176  1164 99 16:07 ?        00:00:05 ./a.out
gordonc   1177  1165 99 16:07 ?        00:00:05 ./a.out
gordonc   1184  1164 99 16:07 ?        00:00:05 ./a.out
gordonc   1209 32315  0 16:07 pts/0    00:00:00 grep a.out
$ for i in {1169,1170,1171,1172,1173,1176,1177,1184}; do taskset -p $i; done
# use taskset to check real CPU affinity
pid 1169's current affinity mask: 10
pid 1170's current affinity mask: 1
pid 1171's current affinity mask: 20
pid 1172's current affinity mask: 2
pid 1173's current affinity mask: 40
pid 1176's current affinity mask: 4
pid 1177's current affinity mask: 80
pid 1184's current affinity mask: 8   # core 1,2,4,8 are for job 4091 and 10,20,40,80 are for job 4092

[{"Product":{"code":"SSETD4","label":"Platform LSF"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"Version Independent","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 June 2018

UID

isg3T1025826