Submit jobs with affinity resource requirements

Submit jobs for CPU and memory affinity scheduling by specifying an affinity[] section either in the bsub -R command, to a queue defined in the lsb.queues file or to an application profile with a RES_REQ parameter containing and affinity[] section.

Tip: Starting in Fix Pack 14, you can configure the LSF_CGROUP_CORE_AUTO_CREATE parameter set to Y to enable LSF to automatically create Linux cgroups for a job, without the need to specify affinity[] requirements. With this setting, LSF automatically adds the "affinity[core(1)]" resource requirement string to the bsub -R command whenever jobs are submitted.

The affinity[] resource requirement string controls job slot and processor unit allocation and distribution within a host.

See Affinity string for detailed syntax of the affinity[] resource requirement string.

If the JOB_INCLUDE_POSTPROC=Y parameter is set in the lsb.params file, or the LSB_JOB_INCLUDE_POSTPROC=Y environment variable is set in the job environment, LSF does not release affinity resources until post-execution processing has finished, since slots are still occupied by the job during post-execution processing. For interactive jobs, the interactive job will be finished before the post-execution completes.

Examples: processor unit allocation requests

The following examples illustrate affinity jobs that request specific processor unit allocations and task distributions.

The following job asks for six slots and runs within single host. Each slot maps to one core. LSF tries to pack six cores as close as possible on single NUMA or socket. If the task distribution cannot be satisfied, the job can not be started.

bsub -n 6 –R "span[hosts=1] affinity[core(1):distribute=pack]" myjob

The following job asks for six slots and runs within single host. Each slot maps to one core, but in this case it must be packed into a single socket, otherwise, the job remains pending:

bsub -n 6 –R "span[hosts=1] affinity[core(1):distribute=pack(socket=1)]" myjob

The following Job asks for two slots on a single host. Each slot maps to two cores. Two cores for a single slot (task) must come from the same socket; however, the other two cores for second slot (task) must be on different socket:

bsub -n 2 –R "span[hosts=1] affinity[core(2, same=socket, exclusive=(socket, injob))]" myjob

The following job specifies that each task in the job requires two cores from the same socket. The allocated socket will be marked exclusive for all other jobs. The task will be CPU bound to socket level. LSF attempts to distribute the tasks of the job so that they are balanced across all cores:

bsub -n 4 -R "affinity[core(2, same=socket, exclusive=(socket, alljobs)): cpubind=socket:distribute=balance]" myjob

Examples: CPU and memory binding requests

You can submit affinity jobs with CPU various binding and memory binding options. The following examples illustrate this.

In the following job, both tasks require five cores in the same NUMA node and binds the tasks on the NUMA node with memory mandatory binding:

bsub -n 2 -R "affinity[core(5,same=numa):cpubind=numa:membind=localonly]" myjob

The following job binds a multithread job on a single NUMA node:

bsub -n 2 -R "affinity[core(3,same=numa):cpubind=numa:membind=localprefer]" myjob

The following job distributes tasks across sockets. Each task needs two cores from the same socket and binds each task at the socket level. The allocated socket is exclusive, so no other tasks can use it:

bsub -n 2 -R "affinity[core(2,same=socket,exclusive=(socket,injob|alljobs)): cpubind=socket]" myjob

The following job packs job tasks in one NUMA node:

bsub -n 2 -R "affinity[core(1,exclusive=(socket,injob)):distribute=pack(numa= 1)]" myjob

Each task needs one core and no other tasks from the same job will allocate CPUs from the same socket. LSF attempts to pack all tasks in the same job to one NUMA node.

Job execution environment for affinity jobs

LSF sets several environment variables in the execution environment of each job and task. These are designed to integrate and work with IBM Parallel Environment, and IBM Spectrum LSF MPI. However, these environment variables are available to all affinity jobs and could potentially be used by other applications. Because LSF provides the variables expected by both IBM Parallel Environment and LSF MPI, there is some redundancy: environment variables prefixed by RM_ are implemented for compatibility with IBM Parallel Environment, although LSF MPI uses them as well, while those prefixed with LSB_ are only used by LSF MPI. The two types of variable provide similar information, but in different formats.

The following variables are set in the job execution environment:
  • LSB_BIND_CPU_LIST
  • LSB_BIND_MEM_LIST
  • LSB_BIND_MEM_POLICY
  • RM_CPUTASKn
  • RM_MEM_AFFINITY
  • OMP_NUM_THREADS

Application integration

For single-host applications the application itself does not need to do anything, and only the OMP_NUM_THREADS variable is relevant.

For the first execution host of a multi-host parallel application LSF MPI running under LSF will select CPU resources for each task, start up the LSF MPI agent (mpid) to bind mpid to all allocated CPUs and memory policies. Corresponding environment variables are set including RM_CPUTASKn. LSF MPI reads RM_CPUTASKn on each host, and does the task-level binding. LSF MPI follows the RM_CPUTASKn setting and binds each task to the selected CPU list per task. This is the default behavior when LSF MPI runs under LSF.

To support IBM Parallel Operating Environment jobs, LSF starts the PMD program, binds the PMD process to the allocated CPUs and memory nodes on the host, and sets RM_CPUTASKn, RM_MEM_AFFINITY, and OMP_NUM_THREADS. The IBM Parallel Operating Environment will then do the binding for individual tasks.

OpenMPI provides a rank file as the interface for users to define CPU binding information per task. The rank file includes MPI rank, host, and CPU binding allocations per rank. LSF provides a simple script to generate an OpenMPI rank file based on LSB_AFFINITY_HOSTFILE . The following is an example of an OpenMPI rankfile corresponding to the affinity hostfile in the description of LSB_AFFINITY_HOSTFILE:
Rank 0=Host1 slot=0,1,2,3
Rank 1=Host1 slot=4,5,6,7
Rank 2=Host2 slot=0,1,2,3
Rank 3=Host2 slot=4,5,6,7
Rank 4=Host3 slot=0,1,2,3
Rank 5=Host4 slot=0,1,2,3

The script (openmpi_rankfile.sh) is located in $LSF_BINDIR. Use the DJOB_ENV_SCRIPT parameter in an application profile in lsb.applications to configure the path to the script.

For distributed applications that use blaunch directly to launch tasks or agent per slot (not per host) by default, LSF binds the task to all allocated CPUs and memory nodes on the host. That is, the CPU and memory node lists are generated at the host level. Certain distributed application may need to generate the binding lists on a task-by-task basis. This behaviour is configurable in either job submission environment or an application profile as an environment variable named LSB_DJOB_TASK_BIND=Y | N. N is the default. When this environment variable is set, the binding list will be generated on a task per task basis.

Examples

The following examples assume that the cluster comprises only hosts with the following topology:
Host[64.0G] HostN
    NUMA[0: 0M / 32.0G]          NUMA[1: 0M / 32.0G]
        Socket0                      Socket0
            core0(0 22)                  core0(1 23)
            core1(2 20)                  core1(3 21)
            core2(4 18)                  core2(5 19)
            core3(6 16)                  core3(7 17)
            core4(8 14)                  core4(9 15)
            core5(10 12)                 core5(11 13)
        Socket1                      Socket1
            core0(24 46)                 core0(25 47)
            core1(26 44)                 core1(27 45)
            core2(28 42)                 core2(29 43)
            core3(30 40)                 core3(31 41)
            core4(32 38)                 core4(33 39)
            core5(34 36)                 core5(35 37)
Each host has 64 GB of memory split over two NUMA nodes, each node containing two processor sockets with 6 cores each, and each core having two threads. Each of the following examples consists of the following:
  • A bsub command line with an affinity requirement
  • An allocation for the resulting job displayed as in bjobs
  • The same allocation displayed as in bhosts
  • The values of the job environment variables above once the job is dispatched
The examples cover some of the more common examples: serial and parallel jobs with simple CPU and memory requirements, as well as the effect of the exclusive clause of the affinity resource requirement string.
  1. bsub -R "affinity[core(1)]" is a serial job asking for a single core.

    The allocation shown in bjobs:
    ...
                         CPU BINDING                          MEMORY BINDING
                         ------------------------             --------------------
     HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
     Host1               core   -      -      /0/0/0          -     -    -
    ...
    In bhosts (assuming no other jobs are on the host):
    ...
    Host[64.0G] Host1
        NUMA[0: 0M / 32.0G]          NUMA[1: 0M / 32.0G]
            Socket0                      Socket0
                core0(*0 *22)                core0(1 23)
                core1(2 20)                  core1(3 21)
                core2(4 18)                  core2(5 19)
                core3(6 16)                  core3(7 17)
                core4(8 14)                  core4(9 15)
                core5(10 12)                 core5(11 13)
            Socket1                      Socket1
                core0(24 46)                 core0(25 47)
                core1(26 44)                 core1(27 45)
                core2(28 42)                 core2(29 43)
                core3(30 40)                 core3(31 41)
                core4(32 38)                 core4(33 39)
                core5(34 36)                 core5(35 37)
    ...
    Contents of affinity host file:
    Host1 0,22
    Job environment variables:
    LSB_BIND_CPU_LIST=0,22
    RM_CPUTASK1=0,22
  2. bsub -R "affinity[socket(1)]" is a serial job asking for an entire socket.

    The allocation shown in bjobs:
    ...
                         CPU BINDING                          MEMORY BINDING
                         ------------------------             --------------------
     HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
     Host1               socket -      -      /0/0            -     -    -
    ...
    In bhosts (assuming no other jobs are on the host):
    ...
    Host[64.0G] Host1
        NUMA[0: 0M / 32.0G]          NUMA[1: 0M / 32.0G]
            Socket0                      Socket0
                core0(*0 *22)                core0(1 23)
                core1(*2 *20)                core1(3 21)
                core2(*4 *18)                core2(5 19)
                core3(*6 *16)                core3(7 17)
                core4(*8 *14)                core4(9 15)
                core5(*10 *12)               core5(11 13)
            Socket1                      Socket1
                core0(24 46)                 core0(25 47)
                core1(26 44)                 core1(27 45)
                core2(28 42)                 core2(29 43)
                core3(30 40)                 core3(31 41)
                core4(32 38)                 core4(33 39)
                core5(34 36)                 core5(35 37)
    ...
    Contents of affinity host file:
    Host1 0,2,4,6,8,10,12,14,16,18,20,22
    Job environment variables:
    LSB_BIND_CPU_LIST=0,2,4,6,8,10,12,14,16,18,20,22
    RM_CPUTASK1=0,2,4,6,8,10,12,14,16,18,20,22
  3. bsub -R “affinity[core(4):membind=localonly] rusage[mem=2048]” is a multi-threaded single-task job requiring 4 cores and 2 GB of memory.

    The allocation shown in bjobs:
    ...
                        CPU BINDING                          MEMORY BINDING
                        ------------------------             --------------------
    HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
    Host1               core   -      -      /0/0/0          local 0    2.0GB
                                             /0/0/1       
                                             /0/0/2        
                                             /0/0/3       
    ...
    In bhosts (assuming no other jobs are on the host):
    ...
    Host[64.0G] Host1
        NUMA[0: 2.0G / 32.0G]        NUMA[1: 0M / 32.0G]
            Socket0                      Socket0
                core0(*0 *22)                core0(1 23)
                core1(*2 *20)                core1(3 21)
                core2(*4 *18)                core2(5 19)
                core3(*6 *16)                core3(7 17)
                core4(8 14)                  core4(9 15)
                core5(10 12)                 core5(11 13)
            Socket1                      Socket1
                core0(24 46)                 core0(25 47)
                core1(26 44)                 core1(27 45)
                core2(28 42)                 core2(29 43)
                core3(30 40)                 core3(31 41)
                core4(32 38)                 core4(33 39)
                core5(34 36)                 core5(35 37)
    ...
    Contents of affinity host file:
    Host1 0,2,4,6,16,18,20,22 0 1
    Job environment variables:
    LSB_BIND_CPU_LIST=0,2,4,6,16,18,20,22
    LSB_BIND_MEM_LIST=0
    LSB_BIND_MEM_POLICY=localonly
    RM_MEM_AFFINITY=yes
    RM_CPUTASK1=0,2,4,6,16,18,20,22
    OMP_NUM_THREADS=4
    Note: OMP_NUM_THREADS is now present because the only task in the job asked for 4 cores.
  4. bsub -n 2 -R "affinity[core(2)] span[hosts=1]" is a multi-threaded parallel job asking for 2 tasks with 2 cores each running on the same host.

    The allocation shown in bjobs:
    ...
                        CPU BINDING                          MEMORY BINDING
                        ------------------------             --------------------
    HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
    Host1               core   -      -      /0/0/0          -     -    -
                                             /0/0/1       
    Host1               core   -      -      /0/0/2          -     -    -
                                             /0/0/3       
    ...
    In bhosts (assuming no other jobs are on the host):
    ...
    Host[64.0G] Host1
        NUMA[0: 0M / 32.0G]          NUMA[1: 0M / 32.0G]
            Socket0                      Socket0
                core0(*0 *22)                core0(1 23)
                core1(*2 *20)                core1(3 21)
                core2(*4 *18)                core2(5 19)
                core3(*6 *16)                core3(7 17)
                core4(8 14)                  core4(9 15)
                core5(10 12)                 core5(11 13)
            Socket1                      Socket1
                core0(24 46)                 core0(25 47)
                core1(26 44)                 core1(27 45)
                core2(28 42)                 core2(29 43)
                core3(30 40)                 core3(31 41)
                core4(32 38)                 core4(33 39)
                core5(34 36)                 core5(35 37)
    ...
    Contents of affinity host file:
    Host1 0,2,4,6
    Host1 16,18,20,22
    Job environment variables set for each of the two tasks:
    LSB_BIND_CPU_LIST=0,2,4,6,16,18,20,22
    RM_CPUTASK1=0,2,4,6
    RM_CPUTASK2=16,18,20,22
    OMP_NUM_THREADS=2
    Note: Each task sees RM_CPU_TASK1 and RM_CPU_TASK2 and that LSB_BIND_CPU_LIST is the combined list of all the CPUs allocated to the job on this host.
    If you run the job through the blaunch command and set the LSB_DJOB_TASK_BIND parameter, then everything is the same except that the job environment variables of the two tasks are different for each task:
    • Task 1:
      LSB_BIND_CPU_LIST=0,2,20,22
      RM_CPUTASK1=0,2,20,22
      OMP_NUM_THREADS=2
    • Task 2:
      LSB_BIND_CPU_LIST=4,6,16,18
      RM_CPUTASK1=4,6,16,18
      OMP_NUM_THREADS=2
  5. bsub -n 2 -R "affinity[core(2)] span[ptile=1]" is a multi-threaded parallel job asking for a 2 tasks with 2 cores each running on a different host. This is almost identical to the previous example except that the allocation is across two hosts.

    The allocation shown in bjobs:
    ...
                        CPU BINDING                          MEMORY BINDING
                        ------------------------             --------------------
    HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
    Host1               core   -      -      /0/0/0          -     -    -
                                             /0/0/1       
    Host2               core   -      -      /0/0/0          -     -    -
                                             /0/0/1       
    ...
    In bhosts (assuming no other jobs are on the host), each of Host1 and Host2 would be allocated as:
    ...
    Host[64.0G] Host{1,2}
        NUMA[0: 0M / 32.0G]          NUMA[1: 0M / 32.0G]
            Socket0                      Socket0
                core0(*0 *22)                core0(1 23)
                core1(*2 *20)                core1(3 21)
                core2(4 18)                  core2(5 19)
                core3(6 16)                  core3(7 17)
                core4(8 14)                  core4(9 15)
                core5(10 12)                 core5(11 13)
            Socket1                      Socket1
                core0(24 46)                 core0(25 47)
                core1(26 44)                 core1(27 45)
                core2(28 42)                 core2(29 43)
                core3(30 40)                 core3(31 41)
                core4(32 38)                 core4(33 39)
                core5(34 36)                 core5(35 37)
    ...
    Contents of affinity host file:
    Host1 0,2,20,22
    Host2 0,2,20,22
    Job environment variables set for each of the two tasks:
    LSB_BIND_CPU_LIST=0,2,20,22
    RM_CPUTASK1=0,2,20,22
    OMP_NUM_THREADS=2
    Note: Each task only sees RM_CPU_TASK1. This is the same as LSB_BIND_CPU_LIST because only one task is running on each host. Setting DJOB_TASK_BIND=Y would have no effect in this case.
  6. bsub -R "affinity[core(1,exclusive=(socket,alljobs))]" is an example of a single threaded serial job asking for a core that it would like to have exclusive use of a socket across all jobs. Compare this with examples (1) and (2) above of a jobs simply asking for a core or socket.

    The allocation shown in bjobs is the same as the job asking for a core except for the EXCL column:
    ...
                         CPU BINDING                          MEMORY BINDING
                         ------------------------             --------------------
     HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
     Host1               core   -      socket /0/0/0          -     -    -
    ...
    In bhosts, however, the allocation is the same as the job asking for a socket because it needs to reserve it all:
    ...
    Host[64.0G] Host1
        NUMA[0: 0M / 32.0G]          NUMA[1: 0M / 32.0G]
            Socket0                      Socket0
                core0(*0 *22)                core0(1 23)
                core1(*2 *20)                core1(3 21)
                core2(*4 *18)                core2(5 19)
                core3(*6 *16)                core3(7 17)
                core4(*8 *14)                core4(9 15)
                core5(*10 *12)               core5(11 13)
            Socket1                      Socket1
                core0(24 46)                 core0(25 47)
                core1(26 44)                 core1(27 45)
                core2(28 42)                 core2(29 43)
                core3(30 40)                 core3(31 41)
                core4(32 38)                 core4(33 39)
                core5(34 36)                 core5(35 37)
    ...
    The affinity hosts file, however, shows that the job is only bound to the allocated core when it runs
    Host1 0,22
    This is also reflected in the job environment:
    LSB_BIND_CPU_LIST=0,22
    RM_CPUTASK1=0,22

    From the point of view of what is available to other jobs (that is, the allocation counted against the host), the job has used an entire socket. However in all other aspects the job is only binding to a single core.

  7. bsub -R "affinity[core(1):cpubind=socket]" asks for a core but asks for the binding to be done at the socket level. Contrast this with the previous case where the core wanted exclusive use of the socket.

    Again, the bjobs allocation is the same as example (1), but this time the LEVEL column is different:
    ...
                         CPU BINDING                          MEMORY BINDING
                         ------------------------             --------------------
     HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
     Host1               core   socket -      /0/0/0          -     -    -
    ...
    In bhosts, the job just takes up a single core, rather than the whole socket like the exclusive job:
    ...
    Host[64.0G] Host1
        NUMA[0: 0M / 32.0G]          NUMA[1: 0M / 32.0G]
            Socket0                      Socket0
                core0(*0 *22)                core0(1 23)
                core1(2 20)                  core1(3 21)
                core2(4 18)                  core2(5 19)
                core3(6 16)                  core3(7 17)
                core4(8 14)                  core4(9 15)
                core5(10 12)                 core5(11 13)
            Socket1                      Socket1
                core0(24 46)                 core0(25 47)
                core1(26 44)                 core1(27 45)
                core2(28 42)                 core2(29 43)
                core3(30 40)                 core3(31 41)
                core4(32 38)                 core4(33 39)
                core5(34 36)                 core5(35 37)
    ...

    The view from the execution side though is quite different: from here the list of CPUs that populate the job's binding list on the host is the entire socket.

    Contents of the affinity host file:
    Host1 0,2,4,6,8,10,12,14,16,18,20,22
    Job environment:
    LSB_BIND_CPU_LIST=0,2,4,6,8,10,12,14,16,18,20,22
    RM_CPUTASK1=0,2,4,6,8,10,12,14,16,18,20,22

    Compared to the previous example, from the point of view of what is available to other jobs (that is, the allocation counted against the host), the job has used a single core. However in terms of the binding list, the job process will be free to use any CPU in the socket while it is running.