The job migration feature requires that a job be made checkpoint-able or re-runnable at
the job, application, or queue level.
An LSF user can make a job
- Checkpoint-able, using bsub -k and specifying a checkpoint directory and
checkpoint period, and an optional initial checkpoint period
- Re-runnable, using bsub -r
Configuration file
|
Parameter and syntax
|
Behavior
|
lsb.queues
|
CHKPNT=chkpnt_dir [chkpnt_period]
|
All jobs submitted to the queue are checkpointable.
The specified checkpoint directory must already exist. LSF
will not create the checkpoint directory.
The user account that submits the job must have read and write
permissions for the checkpoint directory.
For the job to restart on another execution host, both the
original and new hosts must have network connectivity to the checkpoint
directory.
If the queue administrator specifies a checkpoint period, in
minutes, LSF creates a checkpoint file every chkpnt_period during
job execution.
If a user specifies a checkpoint directory and checkpoint period
at the job level with bsub -k, the job-level values
override the queue-level values.
|
RERUNNABLE=Y
|
|
lsb.applications
|
CHKPNT_DIR=chkpnt_dir
|
Specifies the checkpoint directory for automatic checkpointing
for the application. To enable automatic checkpoint for the application
profile, administrators must specify a checkpoint directory in the
configuration of the application profile.
If CHKPNT_PERIOD, CHKPNT_INITPERIOD or CHKPNT_METHOD was set
in an application profile but CHKPNT_DIR was not set, a warning message
is issued and those settings are ignored.
The checkpoint directory is the directory where the checkpoint
files are created. Specify an absolute path or a path relative to
the current working directory for the job. Do not use environment
variables in the directory path.
If checkpoint-related configuration is specified in both the
queue and an application profile, the application profile setting
overrides queue level configuration.
|
|
CHKPNT_INITPERIOD=init_chkpnt_period
|
|
|
CHKPNT_PERIOD=chkpnt_period
|
|
|
CHKPNT_METHOD=chkpnt_method
|
|
Configuration to enable automatic job migration
Automatic
job migration assumes that if a job is system-suspended (
SSUSP)
for an extended period of time, the execution host is probably heavily
loaded. Configuring a queue-level or host-level migration threshold
lets the job to resume on another less loaded host, and reduces the
load on the original host. You can use
bmig at
any time to override a configured migration threshold.
Configuration file
|
Parameter and syntax
|
Behavior
|
lsb.queues
lsb.applications
|
MIG=minutes
|
LSF automatically migrates jobs that have been in the SSUSP state
for more than the specified number of minutes
Specify a value of 0 to migrate jobs immediately upon suspension
Applies to all jobs submitted to the queue
Job-level command-line migration threshold (bsub -mig)
overrides threshold configuration in application profile and queue.
Application profile configuration overrides queue level configuration.
|
lsb.hosts
|
HOST_NAME MIG
host_name minutes
|
LSF automatically migrates jobs that have been in the SSUSP state
for more than the specified number of minutes
Specify a value of 0 to migrate jobs immediately upon suspension
Applies to all jobs running on the host
|
Note: When a host migration threshold is specified, and is lower than the value for the
job, the queue, or the application, the host value is used. You cannot auto-migrate a suspended
chunk job member.