Job checkpoint and restart behavior

LSF invokes the echkpnt interface when a job meets these conditions:

  • Automatically check-pointed based on a configured checkpoint period
  • Manually check-pointed with bchkpnt
  • Migrated to a new host with bmig
After checkpointing, LSF invokes the erestart interface to restart the job. LSF also invokes the erestart interface when a user
  • Manually restarts a job using brestart
  • Migrates the job to a new host using bmig

All checkpoint and restart executables run under the user account of the user who submits the job.

Note: By default, LSF redirects standard error and standard output to /dev/null and discards the data.

Checkpoint directory and files

LSF identifies checkpoint files by the checkpoint directory and job ID. For example:

bsub -k my_dir
Job <123> is submitted to default queue <default>

LSF writes the checkpoint file to my_dir/123.

LSF maintains all of the checkpoint files for a single job in one location. When a job restarts, LSF creates both a new subdirectory based on the new job ID and a symbolic link from the old to the new directory. For example, when job 123 restarts on a new host as job 456, LSF creates my_dir/456 and a symbolic link from my_dir/123 to my_dir/456.

The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name.

Precedence of job, queue, application, and cluster-level checkpoint values

LSF handles checkpoint and restart values as follows:
  1. Checkpoint directory and checkpoint period—values specified at the job level override values for the queue. Values specified in an application profile setting overrides queue level configuration.
    If checkpoint-related configuration is specified in the queue, application profile, and at job level:
    • Application-level and job-level parameters are merged. If the same parameter is defined at both job-level and in the application profile, the job-level value overrides the application profile value.
    • The merged result of job-level and application profile settings override queue-level configuration.
  2. Checkpoint and restart executables—the value for checkpoint_method specified at the job level overrides the application-level CHKPNT_METHOD, and the cluster-level value for LSB_ECHKPNT_METHOD specified in lsf.conf or as an environment variable.
  3. Configuration parameters and environment variables—values specified as environment variables override the values specified in lsf.conf

If the command line is…

And…

Then…

bsub -k "my_dir 240"

In lsb.queues,
CHKPNT=other_dir 360
  • LSF saves the checkpoint file to my_dir/job_ID every 240 minutes

bsub -k "my_dir fluent"

In lsf.conf,
LSB_ECHKPNT_METHOD=myapp
  • LSF invokes echkpnt.fluent at job checkpoint and erestart.fluent at job restart

bsub -k "my_dir"

In lsb.applications,
CHKPNT_PERIOD=360
  • LSF saves the checkpoint file to my_dir/job_ID every 360 minutes

bsub -k "240"

In lsb.applications,

CHKPNT_DIR=app_dir
CHKPNT_PERIOD=360
In lsb.queues,
CHKPNT=other_dir
  • LSF saves the checkpoint file to app_dir/job_ID every 240 minutes