Use backfill on memory
If BACKFILL is configured in a queue, and a run limit is specified with -W on bsub or with RUNLIMIT in the queue, backfill jobs can use the accumulated memory reserved by the other jobs, as long as the backfill job can finish before the predicted start time of the jobs with the reservation.
Unlike slot reservation, which only applies to parallel jobs, backfill on memory applies to sequential and parallel jobs.
Begin Queue
QUEUE_NAME = reservation_backfill
DESCRIPTION = For resource reservation and backfill
PRIORITY = 40
RESOURCE_RESERVE = MAX_RESERVE_TIME[20]
BACKFILL = Y
End Queue
Examples of memory reservation and backfill on memory
Begin Queue
QUEUE_NAME = reservation
DESCRIPTION = For resource reservation
PRIORITY=40
RESOURCE_RESERVE = MAX_RESERVE_TIME[20]
End Queue
Begin Queue
QUEUE_NAME = backfill
DESCRIPTION = For backfill scheduling
PRIORITY = 30
BACKFILL = y
End Queue
lsb.params
Per-slot memory reservation is enabled by RESOURCE_RESERVE_PER_TASK=y in lsb.params.
Assumptions
Assume one host in the cluster with 10 CPUs and 1 GB of free memory currently available.
Sequential jobs
Each of the following sequential jobs requires 400 MB of memory. The first three jobs run for 300 minutes.
bsub -W 300 -R "rusage[mem=400]" -q reservation myjob1
The job starts running, using 400M of memory and one job slot.
Job 2:
Submitting a second job with same requirements get the same result.
Job 3:
Submitting a third job with same requirements reserves one job slot, and reserve all free memory, if the amount of free memory is between 20 MB and 200 MB (some free memory may be used by the operating system or other software.)
bsub -W 400 -q backfill -R "rusage[mem=50]" myjob4
The job keeps pending, since memory is reserved by job 3 and it runs longer than job 1 and job 2.
bsub -W 100 -q backfill -R "rusage[mem=50]" myjob5
The job starts running. It uses one free slot and memory reserved by job 3. If the job does not finish in 100 minutes, it is killed by LSF automatically.
bsub -W 100 -q backfill -R "rusage[mem=300]" myjob6
The job keeps pending with no resource reservation because it cannot get enough memory from the memory reserved by job 3.
bsub -W 100 -q backfill myjob7
The job starts running. LSF assumes it does not require any memory and enough job slots are free.
Parallel jobs
Each process of a parallel job requires 100 MB memory, and each parallel job needs 4 cpus. The first two of the following parallel jobs run for 300 minutes.
bsub -W 300 -n 4 -R "rusage[mem=100]" -q reservation myJob1
The job starts running and use 4 slots and get 400MB memory.
Job 2:
Submitting a second job with same requirements gets the same result.
Job 3:
Submitting a third job with same requirements reserves 2 slots, and reserves all 200 MB of available memory, assuming no other applications are running outside of LSF.
bsub -W 400 -q backfill -R "rusage[mem=50]" myJob4
The job keeps pending since all available memory is already reserved by job 3. It runs longer than job 1 and job 2, so no backfill happens.
bsub -W 100 -q backfill -R "rusage[mem=50]" myJob5
This job starts running. It can backfill the slot and memory reserved by job 3. If the job does not finish in 100 minutes, it is killed by LSF automatically.