Configuration to enable job checkpoint and restart
The job checkpoint and restart feature requires that a job be made checkpoint-able at the job or queue level. LSF users can make jobs checkpoint-able by submitting jobs using bsub -k and specifying a checkpoint directory. Queue administrators can make all jobs in a queue checkpoint-able by specifying a checkpoint directory for the queue.
Configuration file |
Parameter and syntax |
Behavior |
---|---|---|
lsb.queues |
CHKPNT=chkpnt_dir [chkpnt_period] |
|
lsb.applications |
Configuration to enable kernel-level checkpoint and restart
Kernel-level checkpoint and restart is enabled by default. LSF users make a job checkpointable by either submitting a job using bsub -k and specifying a checkpoint directory or by submitting a job to a queue that defines a checkpoint directory for the CHKPNT parameter.
Configuration to enable application-level checkpoint and restart
Have access to the command line used to submit or modify the job
Exit with a return value without running an application; the erestart interface runs the application to restart the job
Executable file |
UNIX naming convention |
Windows naming convention |
---|---|---|
echkpnt |
LSF_SERVERDIR/echkpnt.application |
LSF_SERVERDIR\echkpnt.application.exe LSF_SERVERDIR\echkpnt.application.bat |
erestart |
LSF_SERVERDIR/erestart.application |
LSF_SERVERDIR\erestart.application.exe LSF_SERVERDIR\erestart.application.bat |
The names echkpnt.default and erestart.default are reserved. Do not use these names for application-level checkpoint and restart executables.
Valid file names contain only alphanumeric characters, underscores (_), and hyphens (-).
For application-level checkpoint and restart, once the LSF_SERVERDIR contains one or more checkpoint and restart executables, users can specify the external checkpoint executable associated with each checkpointable job they submit. At restart, LSF invokes the corresponding external restart executable.
Requirements for application-level checkpoint and restart executables
The executables must be written in C or Fortran.
The directory/name combinations must be unique within the cluster. For example, you can write two different checkpoint executables with the name echkpnt.fluent and save them as LSF_SERVERDIR/echkpnt.fluent and my_execs/echkpnt.fluent. To run checkpoint and restart executables from a directory other than LSF_SERVERDIR, you must configure the parameter LSB_ECHKPNT_METHOD_DIR in lsf.conf.
- Your executables must return the following values.
An echkpnt.application must return a value of 0 when checkpointing succeeds and a non-zero value when checkpointing fails.
- The erestart interface provided with LSF restarts the job using a restart command that erestart.application writes to a file. The return value indicates whether erestart.application successfully writes the parameter definition LSB_RESTART_CMD=restart_command to the file checkpoint_dir/job_ID/.restart_cmd.
A non-zero value indicates that erestart.application failed to write to the .restart_cmd file.
A return value of 0 indicates that erestart.application successfully wrote to the .restart_cmd file, or that the executable intentionally did not write to the file.
- Your executables must recognize the syntax used by the echkpnt and erestart interfaces, which communicate with your executables by means of a common syntax.
- echkpnt.application syntax:
echkpnt [-c] [-f] [-k | -s] [-d checkpoint_dir] [-x] process_group_ID
Restriction:The -k and -s options are mutually exclusive.
- erestart.application syntax:
erestart [-c] [-f] checkpoint_dir
Option or variable
Description
Operating systems
-c
Copies all files in use by the checkpointed process to the checkpoint directory.
Some
-f
Forces a job to be checkpointed even under non-checkpointable conditions, which are specific to the checkpoint implementation used. This option could create checkpoint files that do not provide for successful restart.
Some
-k
Kills a job after successful checkpointing. If checkpoint fails, the job continues to run.
All operating systems that LSF supports
-s
Stops a job after successful checkpointing. If checkpoint fails, the job continues to run.
Some
-d checkpoint_dir
Specifies the checkpoint directory as a relative or absolute path.
All operating systems that LSF supports
-x
Identifies the cpr (checkpoint and restart) process as type HID. This identifies the set of processes to checkpoint as a process hierarchy (tree) rooted at the current PID.
Some
process_group_ID
ID of the process or process group to checkpoint.
All operating systems that LSF supports