Configuration to modify pre- and post-execution processing
Configuration parameters modify various aspects of pre- and post-execution processing behavior by:
- Preventing a new job from starting until post-execution processing has finished
- Controlling the length of time post-execution processing can run
- Specifying a user account under which the pre- and post-execution commands run
- Controlling how many times pre-execution retries
- Determining if email providing details of the post execution output should be sent to the user who submitted the job. For more details, see LSB_POSTEXEC_SEND_MAIL.
Job- and host-based |
Job-based only |
---|---|
JOB_INCLUDE_POSTPROC in lsb.applications and lsb.params MAX_PREEXEC_RETRY in lsb.applications and lsb.params LOCAL_MAX_PREEXEC_RETRY in lsb.applications and lsb.params LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.applications, lsb.queues, and lsb.params REMOTE_MAX_PREEXEC_RETRY in lsb.applications and lsb.params LSB_DISABLE_RERUN_POST_EXEC in lsf.conf JOB_PREPROC_TIMEOUT in lsb.applications and lsb.params JOB_POSTPROC_TIMEOUT in lsb.applications and lsb.params LSB_PRE_POST_EXEC_USER in lsf.sudoers LSB_POSTEXEC_SEND_MAIL in lsf.conf |
PREEXEC_EXCLUDE_HOST_EXIT_VALUES in lsb.params |
For details on each parameter, see IBM Spectrum LSF configuration reference.
JOB_PREPROC_TIMEOUT is designed to protect the system from hanging during pre-execution processing. When LSF detects pre-execution processing is running longer than the JOB_PREPROC_TIMEOUT value (the default value is infinite), LSF will terminate the execution. Therefore, the LSF Administrator should ensure JOB_PREPROC_TIMEOUT is set to a value longer than any pre-execution processing is required. JOB_POSTPROC_TIMEOUT should also be set to a value that gives host-based post execution processing enough time to run.
Configuration to modify when new jobs can start
When a job finishes, sbatchd reports a job finish status of DONE or EXIT to mbatchd. This causes LSF to release resources associated with the job, allowing new jobs to start on the execution host before post-execution processing from a previous job has finished.
In some cases, you might want to prevent the overlap of a new job with post-execution processing. Preventing a new job from starting prior to completion of post-execution processing can be configured at the application level or at the job level.
At the job level, the bsub -w option allows you to specify job dependencies; the keywords post_done and post_err cause LSF to wait for completion of post-execution processing before starting another job.
File |
Parameter and syntax |
Description |
---|---|---|
lsb.applications lsb.params |
JOB_INCLUDE_POSTPROC=Y |
|
- sbatchd sends both job finish status (DONE or EXIT) and post-execution processing status (POST_DONE or POST_ERR) to mbatchd at the same time
- The job remains in the RUN state and holds its job slot until post-execution processing has finished
- Job requeue happens (if required) after completion of post-execution processing, not when the job itself finishes
- For job history and job accounting, the job CPU and run times include the post-execution processing CPU and run times
- The job control commands bstop, bkill, and bresume have no effect during post-execution processing
- If a host becomes unavailable during post-execution processing for a rerunnable job, mbatchd sees the job as still in the RUN state and reruns the job
- LSF does not preempt jobs during post-execution processing
Configuration to modify the post-execution processing time
File |
Parameter and syntax |
Description |
---|---|---|
lsb.applications lsb.params |
JOB_POSTPROC_TIMEOUT=minutes |
|
Configuration to modify the pre- and post-execution processing user account
File |
Parameter and syntax |
Description |
---|---|---|
lsf.sudoers |
LSB_PRE_POST_EXEC_USER =user_name |
|
Configuration to control how many times pre-execution retries
By default, if job pre-execution fails, LSF retries the job automatically. The job remains in the queue and pre-execution is retried 5 times by default, to minimize any impact to performance and throughput.
Limiting the number of times LSF retries job pre-execution is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). Pre-execution retry in lsb.applications overrides lsb.queues, and lsb.queues overrides lsb.params configuration.
Configuration file |
Parameter and syntax |
Behavior |
---|---|---|
lsb.params |
LOCAL_MAX_PREEXEC_RETRY=integer |
|
MAX_PREEXEC_RETRY=integer |
|
|
REMOTE_MAX_PREEXEC_RETRY=integer |
|
|
lsb.queues |
LOCAL_MAX_PREEXEC_RETRY=integer |
|
MAX_PREEXEC_RETRY=integer |
|
|
REMOTE_MAX_PREEXEC_RETRY=integer |
|
|
lsb.applications |
LOCAL_MAX_PREEXEC_RETRY=integer |
|
MAX_PREEXEC_RETRY=integer |
|
|
REMOTE_MAX_PREEXEC_RETRY=integer |
|
When pre-execution retry is configured, if a job pre-execution fails and exits with non-zero value, the number of pre-exec retries is set to 1. When the pre-exec retry limit is reached, the job is suspended with PSUSP status.
The number of times that pre-execution is retried includes queue-level, application-level, and job-level pre-execution command specifications. When pre-execution retry is configured, a job will be suspended when the sum of its queue-level pre-exec retry times + application-level pre-exec retry times is greater than the value of the pre-execution retry parameter or if the sum of its queue-level pre-exec retry times + job-level pre-exec retry times is greater than the value of the pre-execution retry parameter.
The pre-execution retry limit is recovered when LSF is restarted and reconfigured. LSF replays the pre-execution retry limit in the PRE_EXEC_START or JOB_STATUS events in lsb.events.
Configuration to define default behavior of a job after it reaches the pre-execution retry limit
By default, if LSF retries the pre-execution command of a job on the local cluster and reaches the pre-execution retry threshold (LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, or lsb.applications), LSF suspends the job.
This default behavior of a job that has reached the pre-execution retry limit is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). The behavior specified in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration.
Configuration file |
Parameter and syntax |
Behavior |
---|---|---|
lsb.params |
LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT |
|
lsb.queues |
LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT |
|
lsb.applications |
LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT |
|