Backfill scheduling

By default, a reserved job slot cannot be used by another job. To make better use of resources and improve performance of LSF, you can configure backfill scheduling.

About backfill scheduling

Backfill scheduling allows other jobs to use the reserved job slots, as long as the other jobs do not delay the start of another job. Backfilling, together with processor reservation, allows large parallel jobs to run while not underutilizing resources.

In a busy cluster, processor reservation helps to schedule large parallel jobs sooner. However, by default, reserved processors remain idle until the large job starts. This degrades the performance of LSF because the reserved resources are idle while jobs are waiting in the queue.

Backfill scheduling allows the reserved job slots to be used by small jobs that can run and finish before the large job starts. This improves the performance of LSF because it increases the utilization of resources.

How backfilling works

For backfill scheduling, LSF assumes that a job can run until its run limit expires. Backfill scheduling works most efficiently when all the jobs in the cluster have a run limit.

Since jobs with a shorter run limit have more chance of being scheduled as backfill jobs, users who specify appropriate run limits in a backfill queue is rewarded by improved turnaround time.

Once the big parallel job has reserved sufficient job slots, LSF calculates the start time of the big job, based on the run limits of the jobs currently running in the reserved slots. LSF cannot backfill if the big job is waiting for a job that has no run limit defined.

If LSF can backfill the idle job slots, only jobs with run limits that expire before the start time of the big job is allowed to use the reserved job slots. LSF cannot backfill with a job that has no run limit.

Example

In this scenario, assume the cluster consists of a 4-CPU multiprocessor host.
  1. A sequential job (job1) with a run limit of 2 hours is submitted and gets started at 8:00 am (figure a).

  2. Shortly afterwards, a parallel job (job2) requiring all 4 CPUs is submitted. It cannot start right away because job1 is using one CPU, so it reserves the remaining 3 processors (figure b).

  3. At 8:30 am, another parallel job (job3) is submitted requiring only two processors and with a run limit of 1 hour. Since job2 cannot start until 10:00am (when job1 finishes), its reserved processors can be backfilled by job3 (figure c). Therefore job3 can complete before job2's start time, making use of the idle processors.

  4. Job3 finishes at 9:30am and job1 at 10:00am, allowing job2 to start shortly after 10:00am. In this example, if job3's run limit was 2 hours, it would not be able to backfill job2's reserved slots, and would have to run after job2 finishes.

Limitations

  • A job does not have an estimated start time immediately after mbatchd is reconfigured.

Backfilling and job slot limits

A backfill job borrows a job slot that is already taken by another job. The backfill job does not run at the same time as the job that reserved the job slot first. Backfilling can take place even if the job slot limits for a host or processor have been reached. Backfilling cannot take place if the job slot limits for users or queues have been reached.

Job resize allocation requests

Pending job resize allocation requests are supported by backfill policies. However, the run time of pending resize request is equal to the remaining run time of the running resizable job. For example, if RUN LIMIT of a resizable job is 20 hours and 4 hours have already passed, the run time of pending resize request is 16 hours.

Configure backfill scheduling

Backfill scheduling is enabled at the queue level. Only jobs in a backfill queue can backfill reserved job slots. If the backfill queue also allows processor reservation, then backfilling can occur among jobs within the same queue.