Configuration to enable job migration

The job migration feature requires that a job be made checkpoint-able or re-runnable at the job, application, or queue level.

An LSF user can make a job
  • Checkpoint-able, using bsub -k and specifying a checkpoint directory and checkpoint period, and an optional initial checkpoint period
  • Re-runnable, using bsub -r

Configuration file

Parameter and syntax

Behavior

lsb.queues

CHKPNT=chkpnt_dir [chkpnt_period]

  • All jobs submitted to the queue are checkpointable.
    • The specified checkpoint directory must already exist. LSF will not create the checkpoint directory.

    • The user account that submits the job must have read and write permissions for the checkpoint directory.

    • For the job to restart on another execution host, both the original and new hosts must have network connectivity to the checkpoint directory.

  • If the queue administrator specifies a checkpoint period, in minutes, LSF creates a checkpoint file every chkpnt_period during job execution.

  • If a user specifies a checkpoint directory and checkpoint period at the job level with bsub -k, the job-level values override the queue-level values.

RERUNNABLE=Y

  • If the execution host becomes unavailable, LSF reruns the job from the beginning on a different host.

lsb.applications

CHKPNT_DIR=chkpnt_dir

  • Specifies the checkpoint directory for automatic checkpointing for the application. To enable automatic checkpoint for the application profile, administrators must specify a checkpoint directory in the configuration of the application profile.

  • If CHKPNT_PERIOD, CHKPNT_INITPERIOD or CHKPNT_METHOD was set in an application profile but CHKPNT_DIR was not set, a warning message is issued and those settings are ignored.

  • The checkpoint directory is the directory where the checkpoint files are created. Specify an absolute path or a path relative to the current working directory for the job. Do not use environment variables in the directory path.

  • If checkpoint-related configuration is specified in both the queue and an application profile, the application profile setting overrides queue level configuration.

CHKPNT_INITPERIOD=init_chkpnt_period

CHKPNT_PERIOD=chkpnt_period

CHKPNT_METHOD=chkpnt_method


Configuration to enable automatic job migration

Automatic job migration assumes that if a job is system-suspended (SSUSP) for an extended period of time, the execution host is probably heavily loaded. Configuring a queue-level or host-level migration threshold lets the job to resume on another less loaded host, and reduces the load on the original host. You can use bmig at any time to override a configured migration threshold.

Configuration file

Parameter and syntax

Behavior

lsb.queues

lsb.applications

MIG=minutes

  • LSF automatically migrates jobs that have been in the SSUSP state for more than the specified number of minutes

  • Specify a value of 0 to migrate jobs immediately upon suspension

  • Applies to all jobs submitted to the queue

  • Job-level command-line migration threshold (bsub -mig) overrides threshold configuration in application profile and queue. Application profile configuration overrides queue level configuration.

lsb.hosts

HOST_NAME     MIG
host_name     minutes
  • LSF automatically migrates jobs that have been in the SSUSP state for more than the specified number of minutes

  • Specify a value of 0 to migrate jobs immediately upon suspension

  • Applies to all jobs running on the host


Note: When a host migration threshold is specified, and is lower than the value for the job, the queue, or the application, the host value is used. You cannot auto-migrate a suspended chunk job member.