Configuring job-level automatic re-queuing

Procedure

Use bsub -Q to submit a job that is automatically re-queued if it exits with the specified exit values.

Use spaces to separate multiple exit codes. The reserved keyword all specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~) to exclude specified exit codes from the list.

Job-level re-queue exit values override application-level and queue-level configuration of the parameter REQUEUE_EXIT_VALUES, if defined.

Jobs running with the specified exit code share the same application and queue with other jobs.

For example:
bsub -Q "all ~1 ~2 EXCLUDE(9)" myjob

Jobs exited with all exit codes except 1 and 2 are re-queued. Jobs with exit code 9 are re-queued so that the failed job is not rerun on the same host (exclusive job re-queue).

Enabling exclusive job re-queuing

Procedure

Define an exit code as EXCLUDE(exit_code) to enable exclusive job re-queue.

Exclusive job re-queue does not work for parallel jobs.

Note: If mbatchd is restarted, it does not remember the previous hosts from which the job exited with an exclusive re-queue exit code. In this situation, it is possible for a job to be dispatched to hosts on which the job has previously exited with an exclusive exit code.

Modifying re-queue exit values

Procedure

Use bmod -Q to modify or cancel job-level re-queue exit values.

bmod -Q does not affect running jobs. For re-runnable and re-queue jobs, bmod -Q affects the next run.

Multicluster job forwarding model

For jobs sent to a remote cluster, arguments of bsub -Q take effect on remote clusters.

Multicluster lease model

The arguments of bsub -Q apply to jobs running on remote leased hosts as if they are running on local hosts.