Job checkpoint and restart behavior
LSF invokes the echkpnt interface when a job meets these conditions:
- Automatically check-pointed based on a configured checkpoint period
- Manually check-pointed with bchkpnt
- Migrated to a new host with bmig
- Manually restarts a job using brestart
- Migrates the job to a new host using bmig
All checkpoint and restart executables run under the user account of the user who submits the job.
Checkpoint directory and files
LSF identifies checkpoint files by the checkpoint directory and job ID. For example:
bsub -k my_dir
Job <123> is submitted to default queue <default>
LSF writes the checkpoint file to my_dir/123.
LSF maintains all of the checkpoint files for a single job in one location. When a job restarts, LSF creates both a new subdirectory based on the new job ID and a symbolic link from the old to the new directory. For example, when job 123 restarts on a new host as job 456, LSF creates my_dir/456 and a symbolic link from my_dir/123 to my_dir/456.
The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name.
Precedence of job, queue, application, and cluster-level checkpoint values
- Checkpoint directory and checkpoint period—values specified at the job level override
values for the queue. Values specified in an application profile setting overrides queue level
configuration.If checkpoint-related configuration is specified in the queue, application profile, and at job level:
- Application-level and job-level parameters are merged. If the same parameter is defined at both job-level and in the application profile, the job-level value overrides the application profile value.
- The merged result of job-level and application profile settings override queue-level configuration.
- Checkpoint and restart executables—the value for checkpoint_method specified at the job level overrides the application-level CHKPNT_METHOD, and the cluster-level value for LSB_ECHKPNT_METHOD specified in lsf.conf or as an environment variable.
- Configuration parameters and environment variables—values specified as environment variables override the values specified in lsf.conf
If the command line is… |
And… |
Then… |
---|---|---|
bsub -k "my_dir 240" |
In lsb.queues,
|
|
bsub -k "my_dir fluent" |
In lsf.conf,
|
|
bsub -k "my_dir" |
In lsb.applications,
|
|
bsub -k "240" |
In lsb.applications,
In lsb.queues,
|
|