Order string

The order string allows the selected hosts to be sorted according to the values of resources. The values of r15s, r1m, and r15m used for sorting are the normalized load indices that are returned by lsload -N.

The order string is used for host sorting and selection. The ordering begins with the rightmost index in the order string and proceeds from right to left. The hosts are sorted into order based on each load index, and if more hosts are available than were requested, the LIM drops the least desirable hosts according to that index. The remaining hosts are then sorted by the next index.

After the hosts are sorted by the leftmost index in the order string, the final phase of sorting orders the hosts according to their status, with hosts that are currently not available for load sharing (that is, not in the ok state) listed at the end.

Because the hosts are sorted again for each load index, only the host status and the leftmost index in the order string actually affect the order in which hosts are listed. The other indices are only used to drop undesirable hosts from the list.

When sorting is done on each index, the direction in which the hosts are sorted (increasing versus decreasing values) is determined by the default order returned by lsinfo for that index. This direction is chosen such that after sorting, by default, the hosts are ordered from best to worst on that index.

When used with a cu string, the preferred compute unit order takes precedence. Within each compute unit hosts are ordered according to the order string requirements.

Syntax

[!] [-]resource_name [:[-]resource_name]...

You can specify any built-in or external load index or static resource.

The syntax ! sorts the candidate hosts. It applies to the entire order [] section. After candidate hosts are selected and sorted initially, they are sorted again before a job is scheduled by all plug-ins. ! is the first character in the merged order [] string if you specify it.

! only works with consumable resources because resources can be specified in the order [] section and their value may be changed in schedule cycle (for example, slot or memory). For the scheduler, slots in RUN, SSUSP, USUP and RSV may become free in different scheduling phases. Therefore, the slot value may change in different scheduling cycles.

Using slots to order candidate hosts may not always improve the utilization of whole cluster. The utilization of the cluster depends on many factors.

When an index name is preceded by a minus sign ‘-’, the sorting order is reversed so that hosts are ordered from worst to best on that index.

In the following example, LSF first tries to pack jobs on to hosts with the least slots. Three serial jobs and one parallel job are submitted.

HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV

hostA ok - 4 0 0 0 0 0

hostB ok - 4 0 0 0 0 0

The three serial jobs are submitted:

  • bsub -R "order[-slots]" job1

  • bsub -R "order[-slots]" job2

  • bsub -R "order[-slots]" job3

The parallel job is submitted:

  • bsub -n 4 -R "order[-slots] span[hosts=1]" sleep 1000

The serial jobs are dispatched to one host (hostA). The parallel job is dispatched to another host.

Change the global LSF default sorting order

You can change the global LSF system default sorting order of resource requirements so the scheduler can find the right candidate host. This makes it easier to maintain a single global default order instead of having to set a default order in the lsb.queues file for every queue defined in the system. You can also specify a default order to replace the default sorting value of r15s:pg, which could impact performance in large scale clusters.

To set the default order, you can use the DEFAULT_RESREQ_ORDER parameter in lsb.params. For example, you can pack jobs onto hosts with the fewest free slots by setting DEFAULT_RESREQ_ORDER=-slots:-maxslots. This will dispatch jobs to the host with the fewest free slots and secondly to hosts with the smallest number of jobs slots defined (MXJ). This will leave larger blocks of free slots on the hosts with larger MXJ (if the slot utilization in the cluster is not too high).

Commands with the –R parameter (such as bhosts, bmod and bsub) will use the default order defined in DEFAULT_RESREQ_ORDER for scheduling if no order is specified in the command.

To change the system default sorting order:

  1. Configure the DEFAULT_RESREQ_ORDER in lsb.params.

  2. Run badmin reconfig to have the changes take effect.

  3. Optional: Run bparams -a | grep ORDER to verify that the parameter was set. Output similar to that shown in the following example appears:

    DEFAULT_RESREQ_ORDER = r15m:it

  4. Submit your job.

  5. When you check the output, you can see the sort order for the resource requirements in the RESOURCE REQUIREMENT DETAILS section:

    bjobs -l 422
    Job <422>, User <lsfadmin>, Project <default>
    Status <DONE>, Queue <normal>, Command <sleep1>
    Fri Jan 18 13:29:35: Submitted from hostA, CWD
                         <home/admin/lsf/conf/lsbatch/LSF/configdir>;
    Fri Jan 18 13:29:37: Started on <hostA>, Execution Home </home/lsfadmin>,
    Execution CWD </home/admin/lsf/conf/lsbatch/LSF/configdir>;
    Fri Jan 18 13:29:44: Done successfully. The CPU time used is 0.0 seconds.
     
      MEMORY USAGE:
      MAX MEM: 3 Mbytes;  AVG MEM: 3 Mbytes
     
      SCHEDULING PARAMETERS:
              r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp   mem
    loadSched   -     -     -     -       -     -    -     -     -      -      -
    loadStop    -     -     -     -       -     -    -     -     -      -      -
     
    RESOURCE REQUIREMENT DETAILS:
    Combined: select[type == local] order[r15m:it]
    Effective: select[type == local] order[r15m:it]
    

When changing the value for DEFAULT_RESREQ_ORDER, note the following:

  • For job scheduling, there are three levels at which you can sort resources from the order section: job-level, application-level and queue-level. The sort order for resource requirements defined at the job level overwrites those defined at the application level or queue level. The sort order for resource requirements defined at the application level overwrites those defined at the queue level. If no sort order is defined at any level, mbschd uses the value of DEFAULT_RESREQ_ORDER when scheduling the job.

  • You should only sort by one or two resources since it may take longer to sort with more.

  • Once the job is running, you cannot redefine the sort order. However, you can still change it while the job is in PEND state.

  • For MultiCluster forward and MultiCluster lease modes, the DEFAULT_RESREQ_ORDER value for each local cluster is used.

  • If you change DEFAULT_RESREQ_ORDER then requeue a running job, the job will use the new DEFAULT_RESREQ_ORDER value for scheduling.

Specify multiple -R options

bsub accepts multiple -R options for the order section.

Restriction:

Compound resource requirements do not support multiple -R options.

You can specify multiple resource requirement strings instead of using the && operator. For example:

bsub -R "order[r15m]" -R "order[ut]"

LSF merges the multiple -R options into one string and dispatches the job if all of the resource requirements can be met. By allowing multiple resource requirement strings and automatically merging them into one string, LSF simplifies the use of multiple layers of wrapper scripts. The number of -R option sections is unlimited.

Default

The default sorting order is r15s:pg: ls:r1m).

swp:r1m:tmp:r15s

Resizable jobs

The order in which hosts are considered for resize allocation requests is determined by the order expression of the job. For example, to run an autoresizable job on 1-100 slots, preferring hosts with larger memory, the following job submission specifies this resource request:
bsub -ar -app <appplicaion_file> -n "1,100" -R "rusage[swp=100,license=1]" myjob

When slots on multiple hosts become available simultaneously, hosts with larger available memory get preference when the job adds slots.

Note:

Resizable jobs cannot have compound or alternative resource requirements.

Reordering hosts

You can reorder hosts using the order[! ] syntax.

Suppose host h1 exists in a cluster and has 110 units of a consumable resource 'res' while host h2 has 20 of this resource ('res' can be the new batch built-in resource slots, for example). Assume that these two jobs are pending and being considered by scheduler in same scheduling cycle, and job1 will be scheduled first:

Job1: bsub -R “maxmem>1000” -R “order[res] rusage[res=100]” -q q1 sleep 10000

Job2: bsub -R “mem<1000” -R “order[res] rusage[res=10]” -q q2 sleep 10000

Early in the scheduling cycle, a candidate host list is built by taking either all hosts in the cluster or the hosts listed in any asked host list (-m) and ordering them by the order section of the resource requirement string. Assume the ordered candidate host lists for the jobs look like this after the ordering:

Job1:{h1, h7, h4, h10}

Job2:{h1, h2}

This means h1 ends up being the highest 'res' host the candidate host lists of both jobs. In later scheduling only, one by one each job will be allocated hosts to run on and resources from these hosts.

Suppose Job1 is scheduled to land on host h1, and thus will be allocated 100 'res'. Then when Job2 is considered, it too might be scheduled to land on host h1 because its candidate host list still looks the same. That is, it does not take into account the 100 'res' allocated to Job1 within this same scheduling cycle. To resolve this problem, use ! at the beginning of the order section to force the scheduler to re-order candidate host lists for jobs in the later scheduling phase:

Job1: bsub -R “maxmem>1000” -R “order[!res] rusage[res=100]” -q q1 sleep 10000

Job2: bsub -R “mem <1000” -R “order[!res] rusage[res=10]” -q q2 sleep 10000

The ! forces a reordering of Job2's candidate host list to Job2: {h2, h1} since after Job1 is allocated 100 'res' on h1, h1 will have 10 'res' (110-100) whereas h2 will have 20.

You can combine new batch built-in resources slots/maxslots with both reverse ordering and re-ordering to better ensure that large parallel jobs will have a chance to run later (improved packing). For example:

bsub -n 2 -R “order[!-slots:maxslots]” ...

bsub -n 1 -R “order[!-slots:maxslots]” ...