Configuration to enable job checkpoint and restart

The job checkpoint and restart feature requires that a job be made checkpointable at the job or queue level. LSF users can make jobs checkpointable by submitting jobs using bsub -k and specifying a checkpoint directory. Queue administrators can make all jobs in a queue checkpointable by specifying a checkpoint directory for the queue.

Configuration file

Parameter and syntax

Behavior

lsb.queues

CHKPNT=chkpnt_dir [chkpnt_period]

  • All jobs submitted to the queue are checkpointable. LSF writes the checkpoint files, which contain job state information, to the checkpoint directory. The checkpoint directory can contain checkpoint files for multiple jobs.
    • The specified checkpoint directory must already exist. LSF will not create the checkpoint directory.

    • The user account that submits the job must have read and write permissions for the checkpoint directory.

    • For the job to restart on another execution host, both the original and new hosts must have network connectivity to the checkpoint directory.

  • If the queue administrator specifies a checkpoint period, in minutes, LSF creates a checkpoint file every chkpnt_period during job execution.
    Note:

    There is no default value for checkpoint period. You must specify a checkpoint period if you want to enable periodic checkpointing.

  • If a user specifies a checkpoint directory and checkpoint period at the job level with bsub -k, the job-level values override the queue-level values.

  • The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name.

lsb.applications


Configuration to enable kernel-level checkpoint and restart

Kernel-level checkpoint and restart is enabled by default. LSF users make a job checkpointable by either submitting a job using bsub -k and specifying a checkpoint directory or by submitting a job to a queue that defines a checkpoint directory for the CHKPNT parameter.

Configuration to enable application-level checkpoint and restart

Application-level checkpointing requires the presence of at least one echkpnt.application executable in the directory specified by the parameter LSF_SERVERDIR in lsf.conf. Each echkpnt.application must have a corresponding erestart.application.
Important:
The erestart.application executable must:
  • Have access to the command line used to submit or modify the job

  • Exit with a return value without running an application; the erestart interface runs the application to restart the job


Executable file

UNIX naming convention

Windows naming convention

echkpnt

LSF_SERVERDIR/echkpnt.application

LSF_SERVERDIR\echkpnt.application.exe

LSF_SERVERDIR\echkpnt.application.bat

erestart

LSF_SERVERDIR/erestart.application

LSF_SERVERDIR\erestart.application.exe

LSF_SERVERDIR\erestart.application.bat


Restriction:

The names echkpnt.default and erestart.default are reserved. Do not use these names for application-level checkpoint and restart executables.

Valid file names contain only alphanumeric characters, underscores (_), and hyphens (-).

For application-level checkpoint and restart, once the LSF_SERVERDIR contains one or more checkpoint and restart executables, users can specify the external checkpoint executable associated with each checkpointable job they submit. At restart, LSF invokes the corresponding external restart executable.

Requirements for application-level checkpoint and restart executables

  • The executables must be written in C or Fortran.

  • The directory/name combinations must be unique within the cluster. For example, you can write two different checkpoint executables with the name echkpnt.fluent and save them as LSF_SERVERDIR/echkpnt.fluent and my_execs/echkpnt.fluent. To run checkpoint and restart executables from a directory other than LSF_SERVERDIR, you must configure the parameter LSB_ECHKPNT_METHOD_DIR in lsf.conf.

  • Your executables must return the following values.
    • An echkpnt.application must return a value of 0 when checkpointing succeeds and a non-zero value when checkpointing fails.

    • The erestart interface provided with LSF restarts the job using a restart command that erestart.application writes to a file. The return value indicates whether erestart.application successfully writes the parameter definition LSB_RESTART_CMD=restart_command to the file checkpoint_dir/job_ID/.restart_cmd.
      • A non-zero value indicates that erestart.application failed to write to the .restart_cmd file.

      • A return value of 0 indicates that erestart.application successfully wrote to the .restart_cmd file, or that the executable intentionally did not write to the file.

  • Your executables must recognize the syntax used by the echkpnt and erestart interfaces, which communicate with your executables by means of a common syntax.
    • echkpnt.application syntax:
      echkpnt [-c] [-f] [-k | -s] [-d checkpoint_dir] [-x] process_group_ID
      
      Restriction:

      The -k and -s options are mutually exclusive.

    • erestart.application syntax:
      erestart [-c] [-f] checkpoint_dir
      

    Option or variable

    Description

    Operating systems

    -c

    Copies all files in use by the checkpointed process to the checkpoint directory.

    Some

    -f

    Forces a job to be checkpointed even under non-checkpointable conditions, which are specific to the checkpoint implementation used. This option could create checkpoint files that do not provide for successful restart.

    Some

    -k

    Kills a job after successful checkpointing. If checkpoint fails, the job continues to run.

    All operating systems that LSF supports

    -s

    Stops a job after successful checkpointing. If checkpoint fails, the job continues to run.

    Some

    -d checkpoint_dir

    Specifies the checkpoint directory as a relative or absolute path.

    All operating systems that LSF supports

    -x

    Identifies the cpr (checkpoint and restart) process as type HID. This identifies the set of processes to checkpoint as a process hierarchy (tree) rooted at the current PID.

    Some

    process_group_ID

    ID of the process or process group to checkpoint.

    All operating systems that LSF supports