Job checkpoint and restart commands

Commands for submission


Command

Description

bsub -k "checkpoint_dir [checkpoint_period] [method=echkpnt_application]"

  • Specifies a relative or absolute path for the checkpoint directory and makes the job checkpointable.

  • If the specified checkpoint directory does not already exist, LSF creates the checkpoint directory.

  • If a user specifies a checkpoint period (in minutes), LSF creates a checkpoint file every chkpnt_period during job execution.

  • The command-line values for the checkpoint directory and checkpoint period override the values specified for the queue.

  • If a user specifies an echkpnt_application, LSF runs the corresponding restart executable when the job restarts. For example, for bsub -k "my_dir method=fluent" LSF runs echkpnt.fluent at job checkpoint and erestart.fluent at job restart.

  • The command-line value for echkpnt_application overrides the value specified by LSB_ECHKPNT_METHOD in lsf.conf or as an environment variable. Users can override LSB_ECHKPNT_METHOD and use the default checkpoint and restart executables by defining method=default.


Commands to monitor


Command

Description

bacct -l

  • Displays accounting statistics for finished jobs, including termination reasons. TERM_CHKPNT indicates that a job was checkpointed and killed.

  • If JOB_CONTROL is defined for a queue, LSF does not display the result of the action.

bhist -l

  • Displays the actions that LSF took on a completed job, including job checkpoint, restart, and migration to another host.

bjobs -l

  • Displays information about pending, running, and suspended jobs, including the checkpoint directory, the checkpoint period, and the checkpoint method (either application or default).


Commands to control


Command

Description

bmod -k "checkpoint_dir [checkpoint_period] [method=echkpnt_application]"

  • Resubmits a job and changes the checkpoint directory, checkpoint period, and the checkpoint and restart executables associated with the job.

bmod -kn

  • Dissociates the checkpoint directory from a job, which makes the job no longer checkpointable.

bchkpnt

  • Checkpoints the most recently submitted checkpointable job. Users can specify particular jobs to checkpoint by including various bchkpnt options.

bchkpnt -p checkpoint_period job_ID

  • Checkpoints a job immediately and changes the checkpoint period for the job.

bchkpnt -k job_ID

  • Checkpoints a job immediately and kills the job.

bchkpnt -p 0 job_ID

  • Checkpoints a job immediately and disables periodic checkpointing.

brestart

  • Restarts a checkpointed job on the first available host.

brestart -m

  • Restarts a checkpointed job on the specified host or host group.

bmig

  • Migrates one or more running jobs from one host to another. The jobs must be checkpointable or rerunnable.

  • Checkpoints, kills, and restarts one or more checkpointable jobs.


Commands to display configuration


Command

Description

bqueues -l

  • Displays information about queues configured in lsb.queues, including the values defined for checkpoint directory and checkpoint period.
    Note:

    The bqueues command displays the checkpoint period in seconds; the lsb.queues CHKPNT parameter defines the checkpoint period in minutes.

badmin showconf

  • Displays all configured parameters and their values set in lsf.conf or ego.conf that affect mbatchd and sbatchd.

    Use a text editor to view other parameters in the lsf.conf or ego.conf configuration files.

  • In a MultiCluster environment, displays the parameters of daemons on the local cluster.