Configure a remote-only queue
About this task
To make a queue that only runs jobs in remote clusters, take the following steps:
Procedure
Example
In cluster1:
Begin Queue
QUEUE_NAME = queue1
HOSTS = none
SNDJOBS_TO = queue2@cluster2
MAX_RSCHED_TIME = infinit
DESCRIPTION = A remote-only queue that sends jobs to cluster2.
End Queue
In cluster2:
Begin Queue
QUEUE_NAME = queue2
RCVJOBS_FROM = cluster1
DESCRIPTION = A queue that receives jobs from cluster1.
End Queue
Queue1 in cluster1 forwards all jobs to queue2 in cluster2.
Disable timeout in remote-only queues
About this task
A remote-only send-jobs queue that sends to only one receive-jobs queue.
Procedure
Otherwise, jobs that time out are rescheduled to the same execution queue, but they lose priority and position because they are treated as a new job submission.
In general, the timeout is helpful because it allows LSF to automatically shift a pending MultiCluster job to a better queue.
Submit a job to run in a remote cluster
About this task
Jobs can be submitted to run only in a remote cluster.
Procedure
This is not compatible with bsub -m. When your job is forwarded to a remote queue, you cannot specify the execution host by name.
Example:
queue1 is a remote-only MultiCluster queue.
% bsub -q queue1 myjob
Job <101> is submitted to queue <queue1>.
This job will be dispatched to a remote cluster.
Force a pending job to run
Use brun -m to force a pending or finished job to run or be forwarded to a specifed cluster. The exact behavior of brun on a pending job depends on where the job is pending, and which hosts or clusters are specified in the brun command.
Only administrators can use the brun command. You can only run brun from the submission cluster.
You must specify one or more host names or a cluster name when you force a job to run.
If multiple hosts are specified, the first available host is selected and the remainder ignored. Specified hosts cannot belong to more than one cluster.
You can only specify one cluster name. The job is forced to be forwarded to the specified cluster.
You cannot specify host names and cluster names together in the same brun command.
A job pending in an execution cluster forced to run in a different cluster is returned to the submission cluster, and then forwarded once again.
If a job is submitted with a cluster name and the job is forwarded to a remote cluster, you cannot use brun -m again to switch the job to another execution cluster. For example:
bsub -m cluster1 -q test1 sleep 1000
The job is pending on cluster1. Running brun again to forward the job to cluster2 is rejected:
brun -m cluster2 1803
Failed to run the job: Hosts requested do not belong to the cluster
For example:
brun -m "host12 host27"
In this example, if host12 is available the job is sent to the cluster containing host12 and tries to run. If unsuccessful, the job pends in the cluster containing host12. If host12 is not available, the job is sent to the cluster containing host27 where it runs or pends.
Force a job to run on a specific host
- Local host specified
- Job runs locally. For
example:
brun -m hostA 246 Job <246> is being forced to run or forwarded.
bjobs 246 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 246 user1 RUN normal hostD hostA *eep 10000 Jan 3 12:15
bhist -l 246 Job <246>, User <user1>, Project <default>, Command <sleep 10000> Mon Jan 3 12:15:22: Submitted from host <hostD>, to Queue <normal>, CWD <$HOME/envs>, Requested Resources <type == any>; Mon Jan 3 12:16:13: Job is forced to run or forwarded by user or administrator <user1>; Mon Jan 3 12:16:13: Dispatched to <hostA>; Mon Jan 3 12:16:41: Starting (Pid 10467); Mon Jan 3 12:16:59: Running with execution home </home/user1>, Execution CWD </home/user1/envs>, Execution Pid <10467>;
- Host in execution cluster specified
- Job is forwarded to execution cluster containing specified host, and runs.
For example:
brun -m hostB 244 Job <244> is being forced to run or forwarded.
bjobs 244 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 244 user1 RUN normal hostD hostB *eep 10000 Jan 3 12:15
bhist -l 244 Job <244>, User <user1>, Project <default>, Command <sleep 10000> Mon Jan 3 12:15:22: Submitted from host <hostD>, to Queue <normal>, CWD <$HOME/envs>, Requested Resources <type == any>; Mon Jan 3 12:19:18: Job is forced to run or forwarded by user or administrator <user1>; Mon Jan 3 12:19:18: Forwarded job to cluster cluster2; Mon Jan 3 12:19:18: Remote job control initiated; Mon Jan 3 12:19:18: Dispatched to <hostB>; Mon Jan 3 12:19:18: Remote job control completed; Mon Jan 3 12:19:19: Starting (Pid 28804); Mon Jan 3 12:19:19: Running with execution home </home/user1>, Execution CWD </home/user1/envs>, Execution Pid <28804>;
- Host in same execution cluster specified
- Job runs on the specified host in the same execution cluster. For
example:
brun -m hostB 237 Job <237> is being forced to run or forwarded. bjobs 237 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 237 user1 RUN normal hostD hostB *eep 10000 Jan 3 12:14 bhist -l 237 Job <237>, User <user1>, Project <default>, Command <sleep 10000> Mon Jan 3 12:14:48: Submitted from host <hostD>, to Queue <normal>, CWD <$HOME/envs>, Requested Resources <type == any>; Mon Jan 3 12:14:53: Forwarded job to cluster cluster2; Mon Jan 3 12:22:08: Job is forced to run or forwarded by user or administrator <user1>; Mon Jan 3 12:22:08: Remote job control initiated; Mon Jan 3 12:22:08: Dispatched to <hostB>; Mon Jan 3 12:22:09: Remote job control completed; Mon Jan 3 12:22:09: Starting (Pid 0); Mon Jan 3 12:22:09: Starting (Pid 29073); Mon Jan 3 12:22:09: Running with execution home </home/user1>, Execution CWD </home/user1/envs>, Execution Pid <29073>;
- Host in submission cluster specified
- Job runs on the specified host in the submission cluster. For
example:
brun -m hostA 238 Job <238> is being forced to run or forwarded. bjobs 237 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 238 user1 RUN normal hostB hostA *eep 10000 Oct 5 11:00 bhist -l 237 Job <237>, User <user1>, Project <default>, Command <sleep 10000> Wed Oct 5 11:00:16: Submitted from host <hostB>, to Queue <normal>, CWD </usr/local/xl/conf>, Requested Resources <type == any>; Wed Oct 5 11:00:18: Forwarded job to cluster ec1; Wed Oct 5 11:00:46: Job is forced to run or forwarded by user or administrator <user1>; Wed Oct 5 11:00:46: Pending: Job has returned from remote cluster; Wed Oct 5 11:00:46: Dispatched to <hostA>; Wed Oct 5 11:00:46: Starting (Pid 15686); Wed Oct 5 11:00:47: Running with execution home </home/user1>, Execution CWD </usr/local/xl/conf>, Execution Pid <15686>; Summary of time in seconds spent in various states by Wed Oct 5 11:01:06 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 30 0 20 0 0 0 50
Force a job to run in a specific cluster
- Host in different execution cluster specified
- Job returns to submission cluster, is forwarded to execution cluster containing specified host,
and
runs.
brun -m ec2-hostA 3111 Job <3111> is being forced to run or forwarded. bjobs 3111 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 3111 user1 RUN queue1 sub-management ec2-hostA sleep 1000 Feb 23 11:21 bhist -l 3111 Job <3111>, User <user1>, Project <default>, Command <sleep 1000> Wed Feb 23 11:21:00: Submitted from host <sub-management>, to Queue <queue1>, CWD </usr/local/xl/conf>; Wed Feb 23 11:21:03: Forwarded job to cluster cluster1; Wed Feb 23 11:21:58: Job is forced to run or forwarded by user or administrator <user1>; Wed Feb 23 11:21:58: Pending: Job has returned from remote cluster; Wed Feb 23 11:21:58: Forwarded job to cluster cluster2; Wed Feb 23 11:21:58: Remote job run control initiated; Wed Feb 23 11:21:59: Dispatched to <ec2-hostA>; Wed Feb 23 11:21:59: Remote job run control completed; Wed Feb 23 11:21:59: Starting (Pid 3257); Wed Feb 23 11:21:59: Running with execution home </home/user1>, Execution CWD </usr/local/xl/conf >, Execution Pid <3257>; Summary of time in seconds spent in various states by Wed Feb 23 11:24:59 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 59 0 180 0 0 0 239
- Job already forwarded to execution
- Job has already been forwarded to an execution cluster, and you specify a different execution
cluster. The job returns to submission cluster, and is forced to be forwarded to the specified
execution cluster. The job is not forced to run in the new execution cluster. After the job is
forwarded, the execution cluster schedules the job according to local policies.
For example:
brun -m cluster2 244 Job <244> is being forced to run or forwarded.
bjobs 244 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 244 user1 RUN normal hostD hostB *eep 10000 Jan 3 12:15
bhist -l 244 Job <244>, User <user1>, Project <default>, Command <sleep 10000> Mon Jan 3 12:15:22: Submitted from host <hostD>, to Queue <normal>, CWD <$HOME/envs>, Requested Resources <type == any>; Mon Jan 3 12:15:25: Forwarded job to cluster cluster1; Mon Jan 3 12:19:18: Job is forced to run or forwarded by user or administrator <user1>; Mon Jan 3 12:19:18: Pending: Job has returned from remote cluster; Mon Jan 3 12:19:18: Forwarded job to cluster cluster2; Mon Jan 3 12:19:18: Dispatched to <hostB>; Mon Jan 3 12:19:19: Starting (Pid 28804); Mon Jan 3 12:19:19: Running with execution home </home/user1>, Execution CWD </home/user1/envs>, Execution Pid <28804>;
- Job pending in execution cluster
- Job is forwarded to the specified execution cluster, but the job is not forced to run. After the
job is forwarded, the execution cluster schedules the job according to local policies.
For example:
brun -m cluster2 244 Job <244> is being forced to run or forwarded.
bhist -l 244 Job <244>, User <user1>, Project <default>, Command <sleep 10000> Mon Jan 3 12:15:22: Submitted from host <hostD>, to Queue <normal>, CWD <$HOME/envs>, Requested Resources <type == any>; Mon Jan 3 12:19:18: Job is forced to run or forwarded by user or administrator <user1>; Mon Jan 3 12:19:18: Forwarded job to cluster cluster2; Mon Jan 3 12:19:18: Remote job control initiated; Mon Jan 3 12:19:18: Dispatched to <hostB>; Mon Jan 3 12:19:18: Remote job control completed; Mon Jan 3 12:19:19: Starting (Pid 28804); Mon Jan 3 12:19:19: Running with execution home </home/user1>, Execution CWD </home/user1/envs>, Execution Pid <28804>;