Using Compute Units to have Platform LSF consider cluster topology when scheduling

The network that underlies an HPC cluster can often be represented by a rooted tree. Such is the case for “fat-tree” networks, for example. In tree-like networks, leaves may represent hosts, while internal nodes represent switches. Hosts with fewer edges between them through the tree have smaller communications latency.

For a job that spans multiple hosts, it is often desirable to allocate hosts to the job that are close together according to the network topology. The purpose is to minimize communication latency between the various tasks of the job.

This article explains how to use the Platform LSF Compute Unit (CU) feature to have LSF consider a tree-like network topology when scheduling jobs.

Step 1: Define the set of allowable Compute Unit types

To use the Compute Unit feature, the LSF administrator must define the allowable set of Compute Unit types. This is a set of one or more user-defined strings, specifying the names of the levels of the network hierarchy. You configure these in the parameter COMPUTE_UNIT_TYPES in lsb.params.

For example, in lsb.params set the following parameter:

COMPUTE_UNIT_TYPES = switch! rack building

This example specifies three CU types. In this parameter, the order of the values corresponds to levels in the network topology. CUs of type switch are contained in CUs of type rack; CUs of type rack are contained CUs of type building. The individual hosts of the cluster are contained in CUs of the lowest level; that is, switch in this example.

The exclamation mark (!) following switch means that this is the default level to be used for jobs with CU topology requirements. If the exclamation mark is omitted, the first string listed is the default type.

In your own cluster, set a CU type for each level of granularity that is important for jobs running in the cluster. For example, if you want some jobs to have all their tasks within a single rack, then configure a type rack.

Step 2: Arrange hosts into Compute Units

A CU may be thought of as a LSF host group with an associated CU type. CUs are configured in the ComputeUnit section of lsb.hosts. The following is an example of hosts organized into a CU hierarchy.

Begin ComputeUnit
NAME     TYPE       MEMBER    
s1       switch     (host01 host02 host03 host04)
s2       switch     (host05 host06 host07 host08)
...
r1       rack       (s1 s2)
r2       rack       (s3 s4)
...
b1       building   (r1 r2)
b2       building   (r3 r4)
End ComputeUnit

Unlike standard LSF host groups, a CU hierarchy is strictly a forest. Each CU and host in the cluster can appear in the member list of at most one CU. Moreover, the type of the parent CU is strictly the next coarser granularity than the child, as defined in COMPUTE_UNIT_TYPES in lsb.params.

After setting the COMPUTE_UNIT_TYPES parameter in lsb.params, and defining CUs in the ComputeUnits section of lsb.hosts, reconfigure mbatchd, by running badmin reconfig.

You can view the CU configuration by running bmgroup –cu.

$ bmgroup –cu
NAME          TYPE          HOSTS
s1            switch        host01 host02 host03 host04
s1            switch        host05 host06 host07 host08
...
r1            rack          s1/ s2/
r2            rack          s3/ s4/
...
b1            building      r1/ r2/
b2            building      r3/ r4/
...

By default, bhosts displays status information for individual hosts. It may be easier in some cases to view status information at the level of CUs rather than individual hosts. In this case, choose the CU type you would like bhosts to display. Add a CONDENSE column to the ComputeUnit section. Set ‘Y’ in this column for CUs of the chosen type, and ‘N’ otherwise. Run badmin reconfig to have these changes take effect. Then, bhosts will show status information for CUs of the chosen type. You can run bhosts –X to view this output expanded to individual hosts.

Step 3: Submit jobs with compute unit requirements

Since a CU is essentially an LSF host group, all host group functionality in LSF applies to CUs. For example, once you define a CU named r1, you can submit a job to use this CU using the bsub –m option as follows.

$ bsub –m “r1” –n 64 ./a.out

This job asks for 64 slots, all of which must be on hosts in the CU r1.

Similarly, you can include CU names in the HOSTS parameter of Queue (lsb.queues) and Limit (lsb.resources) sections.

In addition to the basic host group functionality, LSF supports a cu section in resource requirement expressions. The cu section controls job placement across CUs. This is supported both in the submission resource requirement of a job (bsub –R) as well as the RES_REQ parameter given in Queue (lsb.queues) and Application (lsb.applications) sections. For example, you can submit a job as follows.

$ bsub –R “cu[]” –n 64 ./a.out

LSF will place the job only on hosts belonging to CUs of the default type. It tries to place the job on CUs that appear as early as possible in the configuration order. That is, when placing the job, LSF gives preference for one CU of the default type over another, if it appears first in the ComputeUnit section.

If you would like to have LSF place a CU type that is different from the default CU type, use the type keyword to specify the CU level.

$ bsub –R “cu[type=rack]” –n 64 ./a,out

As an alternative to using the configuration order to decide preference among CUs, LSF also supports a preference for CUs with the fewest or most free slots. Preferring fewer free slots can be used for jobs that require few slots (for example, sequential jobs) to help avoid fragmentation of CUs. For parallel jobs that require several slots, a preference for CUs with the most free slots can ensure that the job spans the fewest CUs possible. Using a small number of CUs can be beneficial for a parallel job, to ensure low communications latencies between tasks of the job, and thus a lower run time for the job as a whole.

Use the pref keyword to specify CU preference as in the following examples.

The following job sets a CU uses minavail to set a preference for the fewest free slots:

$ bsub –R “cu[pref=minavail]” ./a.out

The following job sets a CU uses maxavail to set a preference for the fewest free slots:

$ bsub –R “cu[pref=maxavail]” –n 64 ./a.out

In general, a parallel job is able to span multiple CUs. You may want to limit the number of CUs that a job can span. For example, you may want to restrict a job to run within a single rack. It is also possible to limit a job to some arbitrary number of CUs of some type. This is done with the maxcus keyword.

$ bsub –R “cu[maxcus=1]” ./a.out
$ bsub –R “cu[maxcus=10]” ./a.out

When maxcus is used, if the job cannot be placed within the specified number of CUs, it remains pending.

You can combine pref and maxcus keywords to have LSF place a parallel job on as few CUs as possible, while limiting the maximum number of CUs spanned by the job. The following job must be placed on as few racks as possible, and cannot spread out to more than two racks.

$ bsub –R “cu[type=rack:pref=maxavail:maxcus=2]” –n 64 ./a.out

In some cases a job must have exclusive access to CUs. This can be especially useful for benchmarking jobs, where in order to get consistent results across runs, it is important to avoid interference from other jobs. LSF supports exclusive use of CUs with the excl keyword.

In order to use this functionality, you must first enable it at the queue level. Set the maximum (coarsest) CU level that can be used exclusively in the EXCLUSIVE parameter in the Queue section (lsb.queues).

EXCLUSIVE = CU[rack]

This setting means that CUs of type rack is permitted to be used exclusively by jobs in the queue. Further, any finer-grained CU types as well as individual hosts can also be used exclusively by jobs in the queue.

To submit a job that uses CUs exclusively, use the excl keyword in the cu resource requirement expression. Specify the granularity of CU that is to be used exclusively with the type keyword. For example, the following job requests exclusive use of the racks that it spans.

$ bsub –R “cu[type=rack:excl]” –n 64 ./a.out

The job remains pending until LSF is able to grant it exclusive use of one or more racks. Once the job runs, no other job is allowed to dispatch onto the racks.