Checkpoint a multicluster job
Before you begin
Checkpointing of a multicluster job is only supported when the send-jobs queue is configured to forward jobs to a single remote receive-jobs queue, without ever using local hosts:
Procedure
Configuration
About this task
Checkpointing multicluster jobs
Procedure
- in both the send-jobs and receive-jobs queues (CHKPNT in lsb.queues)
- or in an application profile (CHKPNT_DIR, CHKPNT_PERIOD, CHKPNT_INITPERIOD, CHKPNT_METHOD in lsb.applications) of both submission cluster and execution cluster.
LSF uses the directory specified in the execution cluster and ignores the directory specified in the submission cluster.
LSF writes
the checkpoint file in a subdirectory named with the submission cluster name and submission cluster
job ID. This allows LSF to
checkpoint multiple jobs to the same checkpoint directory. For example, the submission cluster is
ClusterA, the submission job ID is 789, and the send-jobs queue enables
checkpointing. The job is forwarded to ClusterB, the execution job ID is 123, and
the receive-jobs queue specifies a checkpoint directory called XYZ_dir.
LSF will save the checkpoint file in the XYZ_dir/clusterA/789/ directory.
Checkpoint a job
Procedure
Force a check-pointed job
Procedure
Example
About this task
In this example, users in a remote cluster submit work to a data center using a send-jobs queue that is configured to forward jobs to only one receive-jobs queue. You are the administrator of the data center and you need to shut down a host for maintenance. The host is busy running checkpoint-able multicluster jobs.