Time-based slot reservation

Existing LSF slot reservation works in simple environments, where the host-based MXJ limit is the only constraint to job slot request. I

n complex environments, where more than one constraint exists (for example job topology or generic slot limit):
  • Estimated job start time becomes inaccurate
  • The scheduler makes a reservation decision that can postpone estimated job start time or decrease cluster utilization.

Current slot reservation by start time (RESERVE_BY_STARTTIME) resolves several reservation issues in multiple candidate host groups, but it cannot help on other cases:

  • Special topology requests, like span[ptile=n] and cu[] keywords balance, maxcus, and excl.
  • Only calculates and displays reservation if host has free slots. Reservations may change or disappear if there are no free CPUs; for example, if a backfill job takes all reserved CPUs.
  • For HPC machines containing many internal nodes, host-level number of reserved slots is not enough for administrator and end user to tell which CPUs the job is reserving and waiting for.

Time-based slot reservation versus greedy slot reservation

With time-based reservation, a set of pending jobs gets future allocation and an estimated start time so that the system can reserve a place for each job. Reservations use the estimated start time, which is based on future allocations.

Time-based resource reservation provides a more accurate predicted start time for pending jobs because LSF considers job scheduling constraints and requirements, including job topology and resource limits, for example.

Restriction: Time-based reservation does not work with job chunking.

Start time and future allocation

The estimated start time for a future allocation is the earliest start time when all considered job constraints are satisfied in the future. There may be a small delay of a few minutes between the job finish time on which the estimate was based and the actual start time of the allocated job.

For compound resource requirement strings, the predicted start time is based on the simple resource requirement term (contained in the compound resource requirement) with the latest predicted start time.

If a job cannot be placed in a future allocation, the scheduler uses greedy slot reservation to reserve slots. Existing LSF slot reservation is a simple greedy algorithm:

  • Only considers current available resources and minimal number of requested job slots to reserve as many slots as it is allowed

  • For multiple exclusive candidate host groups, scheduler goes through those groups and makes reservation on the group that has the largest available slots

  • For estimated start time, after making reservation, scheduler sorts all running jobs in ascending order based on their finish time and goes through this sorted job list to add up slots used by each running job till it satisfies minimal job slots request. The finish time of last visited job will be job estimated start time.

Reservation decisions made by greedy slot reservation do not have an accurate estimated start time or information about future allocation. The calculated job start time used for backfill scheduling is uncertain, so bjobs displays:

Job will start no sooner than indicated time stamp

Time-based reservation and greedy reservation compared


Start time prediction

Time-based reservation

Greedy reservation

Backfill scheduling if free slots are available

Yes

Yes

Correct with no job topology

Yes

Yes

Correct for job topology requests

Yes

No

Correct based on resource allocation limits

Yes (guaranteed if only two limits are defined)

No

Correct for memory requests

Yes

No

When no slots are free for reservation

Yes

No

Future allocation and reservation based on earliest start time

Yes

No

bjobs displays best estimate

Yes

No

bjobs displays predicted future allocation

Yes

No

Absolute predicted start time for all jobs

No

No

Advance reservation considered

No

No


Greedy reservation example

A cluster has four hosts: A, B, C, and D, with 4 CPUs each. Four jobs are running in the cluster: Job1, Job2, Job3 and Job4. According to calculated job estimated start time, the job finish times (FT) have this order: FT(Job2) < FT(Job1) < FT(Job4) < FT(Job3).

Now, a user submits a high priority job. It pends because it requests –n 6 –R “span[ptile=2]”. This resource requirement means this pending job needs three hosts with two CPUs on each host. The default greedy slot reservation calculates job start time as the job finish time of Job4 because after Job4 finishes, three hosts with a minimum of two slots are available.

Greedy reservation indicates that the pending job starts no sooner than when Job 2 finishes.

In contrast, time-based reservation can determine that the pending job starts in 2 hours. It is a much more accurate reservation.