Configuring an interruptible backfill queue

Procedure

Configure INTERRUPTIBLE_BACKFILL=seconds in the lowest priority queue in the cluster. There can only be one interruptible backfill queue in the cluster.

Specify the minimum number of seconds for the job to be considered for backfilling. This minimal time slice depends on the specific job properties; it must be longer than at least one useful iteration of the job. Multiple queues may be created if a site has jobs of distinctively different classes.

For example:
Begin Queue
QUEUE_NAME   = background
# REQUEUE_EXIT_VALUES (set to whatever needed)
DESCRIPTION  = Interruptible Backfill queue
BACKFILL = Y
INTERRUPTIBLE_BACKFILL = 1
RUNLIMIT = 10
PRIORITY = 1
End Queue

Interruptible backfill is disabled if BACKFILL and RUNLIMIT are not configured in the queue.

The value of INTERRUPTIBLE_BACKFILL is the minimal time slice in seconds for a job to be considered for backfill. The value depends on the specific job properties; it must be longer than at least one useful iteration of the job. Multiple queues may be created for different classes of jobs.

BACKFILL and RUNLIMIT must be configured in the queue.

RUNLIMIT corresponds to a maximum time slice for backfill, and should be configured so that the wait period for the new jobs submitted to the queue is acceptable to users. 10 minutes of runtime is a common value.

You should configure REQUEUE_EXIT_VALUES for the queue so that resubmission is automatic. In order to terminate completely, jobs must have specific exit values:
  • If jobs are checkpoint-able, use their checkpoint exit value.
  • If jobs periodically save data on their own, use the SIGTERM exit value.