Configure a remote-only queue

About this task

To make a queue that only runs jobs in remote clusters, take the following steps:

Procedure

  1. Edit the lsb.queues queue definition for the send-jobs queue.
    1. Define SNDJOBS_TO. This specifies that the queue can forward jobs to specified remote execution queues.
    2. Set HOSTS to none. This specifies that the queue uses no local hosts.
  2. Edit the lsb.queues queue definition for each receive-jobs queue.
    1. Define RCVJOBS_FROM. This specifies that the receive-jobs queue accepts jobs from the specified submission cluster.

Example

In cluster1:

Begin Queue
QUEUE_NAME = queue1
HOSTS = none
SNDJOBS_TO = queue2@cluster2
MAX_RSCHED_TIME = infinit
DESCRIPTION = A remote-only queue that sends jobs to cluster2.
End Queue

In cluster2:

Begin Queue
QUEUE_NAME = queue2
RCVJOBS_FROM = cluster1
DESCRIPTION = A queue that receives jobs from cluster1.
End Queue

Queue1 in cluster1 forwards all jobs to queue2 in cluster2.

Disable timeout in remote-only queues

About this task

A remote-only send-jobs queue that sends to only one receive-jobs queue.

Procedure

Set MAX_RSCHED_TIME=infinit to maintain FCFS job order of MultiCluster jobs in the execution queue.

Otherwise, jobs that time out are rescheduled to the same execution queue, but they lose priority and position because they are treated as a new job submission.

In general, the timeout is helpful because it allows LSF to automatically shift a pending MultiCluster job to a better queue.

Submit a job to run in a remote cluster

About this task

Jobs can be submitted to run only in a remote cluster.

Procedure

Use bsub -q and specify a remote-only MultiCluster queue.

This is not compatible with bsub -m. When your job is forwarded to a remote queue, you cannot specify the execution host by name.

Example:

queue1 is a remote-only MultiCluster queue.

% bsub -q queue1 myjob
Job <101> is submitted to queue <queue1>.

This job will be dispatched to a remote cluster.

Force a pending job to run

Use brun -m to force a pending or finished job to run or be forwarded to a specifed cluster. The exact behavior of brun on a pending job depends on where the job is pending, and which hosts or clusters are specified in the brun command.

Important:

Only administrators can use the brun command. You can only run brun from the submission cluster.

You must specify one or more host names or a cluster name when you force a job to run.

If multiple hosts are specified, the first available host is selected and the remainder ignored. Specified hosts cannot belong to more than one cluster.

You can only specify one cluster name. The job is forced to be forwarded to the specified cluster.

You cannot specify host names and cluster names together in the same brun command.

A job pending in an execution cluster forced to run in a different cluster is returned to the submission cluster, and then forwarded once again.

If a job is submitted with a cluster name and the job is forwarded to a remote cluster, you cannot use brun -m again to switch the job to another execution cluster. For example:

bsub -m cluster1 -q test1 sleep 1000

The job is pending on cluster1. Running brun again to forward the job to cluster2 is rejected:

brun -m cluster2 1803
Failed to run the job: Hosts requested do not belong to the cluster

For example:

brun -m "host12 host27"

In this example, if host12 is available the job is sent to the cluster containing host12 and tries to run. If unsuccessful, the job pends in the cluster containing host12. If host12 is not available, the job is sent to the cluster containing host27 where it runs or pends.

Force a job to run on a specific host

Local host specified
Job runs locally. For example:
brun -m hostA 246
Job <246> is being forced to run or forwarded.
bjobs 246
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
246     user1   RUN   normal     hostD       hostA       *eep 10000 Jan  3 12:15
bhist -l 246
Job <246>, User <user1>, Project <default>, Command <sleep 10000>
Mon Jan  3 12:15:22: Submitted from host <hostD>, to Queue <normal>,
CWD <$HOME/envs>, Requested Resources <type == any>;
Mon Jan  3 12:16:13: Job is forced to run or forwarded by user or administrator <user1>;
Mon Jan  3 12:16:13: Dispatched to <hostA>;
Mon Jan  3 12:16:41: Starting (Pid 10467);
Mon Jan  3 12:16:59: Running with execution home </home/user1>,
Execution CWD </home/user1/envs>, Execution Pid <10467>;
Host in execution cluster specified
Job is forwarded to execution cluster containing specified host, and runs.

For example:

brun -m hostB 244
Job <244> is being forced to run or forwarded.
bjobs 244
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
244     user1   RUN   normal     hostD       hostB       *eep 10000 Jan  3 12:15
bhist -l 244
 
Job <244>, User <user1>, Project <default>, Command <sleep 10000>
Mon Jan  3 12:15:22: Submitted from host <hostD>, to Queue <normal>,
 CWD <$HOME/envs>, Requested Resources <type == any>;
Mon Jan  3 12:19:18: Job is forced to run or forwarded by user or administrator <user1>;
Mon Jan  3 12:19:18: Forwarded job to cluster cluster2;
Mon Jan  3 12:19:18: Remote job control initiated;
Mon Jan  3 12:19:18: Dispatched to <hostB>;
Mon Jan  3 12:19:18: Remote job control completed;
Mon Jan  3 12:19:19: Starting (Pid 28804);
Mon Jan  3 12:19:19: Running with execution home </home/user1>,
 Execution CWD </home/user1/envs>, Execution Pid <28804>;
Host in same execution cluster specified
Job runs on the specified host in the same execution cluster. For example:
brun -m hostB 237
Job <237> is being forced to run or forwarded.

bjobs 237
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
237     user1   RUN   normal     hostD       hostB       *eep 10000 Jan  3 12:14
 
bhist -l 237
 
Job <237>, User <user1>, Project <default>, Command <sleep 10000>
Mon Jan  3 12:14:48: Submitted from host <hostD>, to Queue <normal>,
 CWD <$HOME/envs>, Requested Resources <type == any>;
Mon Jan  3 12:14:53: Forwarded job to cluster cluster2;
Mon Jan  3 12:22:08: Job is forced to run or forwarded by user or administrator <user1>;
Mon Jan  3 12:22:08: Remote job control initiated;
Mon Jan  3 12:22:08: Dispatched to <hostB>;
Mon Jan  3 12:22:09: Remote job control completed;
Mon Jan  3 12:22:09: Starting (Pid 0);
Mon Jan  3 12:22:09: Starting (Pid 29073);
Mon Jan  3 12:22:09: Running with execution home </home/user1>,
 Execution CWD </home/user1/envs>, Execution Pid <29073>;
Host in submission cluster specified
Job runs on the specified host in the submission cluster. For example:
brun -m hostA 238
Job <238> is being forced to run or forwarded.

bjobs 237
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
238     user1   RUN   normal     hostB       hostA       *eep 10000 Oct  5 11:00
 
bhist -l 237

Job <237>, User <user1>, Project <default>, Command <sleep 10000>
Wed Oct  5 11:00:16: Submitted from host <hostB>, to Queue <normal>,
 CWD </usr/local/xl/conf>, 
                     Requested Resources <type == any>;
Wed Oct  5 11:00:18: Forwarded job to cluster ec1;
Wed Oct  5 11:00:46: Job is forced to run or forwarded by user or administrator <user1>;
Wed Oct  5 11:00:46: Pending: Job has returned from remote cluster;
Wed Oct  5 11:00:46: Dispatched to <hostA>;
Wed Oct  5 11:00:46: Starting (Pid 15686);
Wed Oct  5 11:00:47: Running with execution home </home/user1>,
 Execution CWD </usr/local/xl/conf>, 
                     Execution Pid <15686>;

Summary of time in seconds spent in various states by  Wed Oct  5 11:01:06
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  30       0        20       0        0        0        50

Force a job to run in a specific cluster

Host in different execution cluster specified
Job returns to submission cluster, is forwarded to execution cluster containing specified host, and runs.
brun -m ec2-hostA 3111
Job <3111> is being forced to run or forwarded.

bjobs 3111
JOBID   USER    STAT  QUEUE      FROM_HOST       EXEC_HOST   JOB_NAME   SUBMIT_TIME
3111    user1   RUN   queue1     sub-management  ec2-hostA       sleep 1000 Feb 23 11:21
 
bhist -l 3111
 
Job <3111>, User <user1>, Project <default>, Command <sleep 1000>
Wed Feb 23 11:21:00: Submitted from host <sub-management>, to Queue <queue1>,
 CWD </usr/local/xl/conf>;
Wed Feb 23 11:21:03: Forwarded job to cluster cluster1;
Wed Feb 23 11:21:58: Job is forced to run or forwarded by user or administrator <user1>;
Wed Feb 23 11:21:58: Pending: Job has returned from remote cluster;
Wed Feb 23 11:21:58: Forwarded job to cluster cluster2;
Wed Feb 23 11:21:58: Remote job run control initiated;
Wed Feb 23 11:21:59: Dispatched to <ec2-hostA>;
Wed Feb 23 11:21:59: Remote job run control completed;
Wed Feb 23 11:21:59: Starting (Pid 3257);
Wed Feb 23 11:21:59: Running with execution home </home/user1>,
 Execution CWD </usr/local/xl/conf >, Execution Pid <3257>;
 
Summary of time in seconds spent in various states by  Wed Feb 23 11:24:59
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  59       0        180      0        0        0        239 
Job already forwarded to execution
Job has already been forwarded to an execution cluster, and you specify a different execution cluster. The job returns to submission cluster, and is forced to be forwarded to the specified execution cluster. The job is not forced to run in the new execution cluster. After the job is forwarded, the execution cluster schedules the job according to local policies.

For example:

brun -m cluster2 244
Job <244> is being forced to run or forwarded.
bjobs 244
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
244     user1   RUN   normal     hostD     hostB    *eep 10000 Jan  3 12:15
bhist -l 244
 
Job <244>, User <user1>, Project <default>, Command <sleep 10000>
Mon Jan  3 12:15:22: Submitted from host <hostD>, to Queue <normal>,
 CWD <$HOME/envs>, Requested Resources <type == any>;
Mon Jan 3 12:15:25: Forwarded job to cluster cluster1;
Mon Jan  3 12:19:18: Job is forced to run or forwarded by user or administrator <user1>;
Mon Jan  3 12:19:18: Pending: Job has returned from remote cluster;
Mon Jan  3 12:19:18: Forwarded job to cluster cluster2;
Mon Jan  3 12:19:18: Dispatched to <hostB>;
Mon Jan  3 12:19:19: Starting (Pid 28804);
Mon Jan  3 12:19:19: Running with execution home </home/user1>,
 Execution CWD </home/user1/envs>, Execution Pid <28804>;
Job pending in execution cluster
Job is forwarded to the specified execution cluster, but the job is not forced to run. After the job is forwarded, the execution cluster schedules the job according to local policies.

For example:

brun -m cluster2 244
Job <244> is being forced to run or forwarded.
bhist -l 244
 
Job <244>, User <user1>, Project <default>, Command <sleep 10000>
Mon Jan  3 12:15:22: Submitted from host <hostD>, to Queue <normal>,
 CWD <$HOME/envs>, Requested Resources <type == any>;
Mon Jan  3 12:19:18: Job is forced to run or forwarded by user or administrator <user1>;
Mon Jan  3 12:19:18: Forwarded job to cluster cluster2;
Mon Jan  3 12:19:18: Remote job control initiated;
Mon Jan  3 12:19:18: Dispatched to <hostB>;
Mon Jan  3 12:19:18: Remote job control completed;
Mon Jan  3 12:19:19: Starting (Pid 28804);
Mon Jan  3 12:19:19: Running with execution home </home/user1>,
 Execution CWD </home/user1/envs>, Execution Pid <28804>;