Configuration to modify pre- and post-execution processing

Configuration parameters modify various aspects of pre- and post-execution processing behavior by:

  • Preventing a new job from starting until post-execution processing has finished
  • Controlling the length of time post-execution processing can run
  • Specifying a user account under which the pre- and post-execution commands run
  • Controlling how many times pre-execution retries
  • Determining if email providing details of the post execution output should be sent to the user who submitted the job. For more details, see LSB_POSTEXEC_SEND_MAIL.
Some configuration parameters only apply to job-based pre- and post-execution processing and some apply to both job- and host-based pre- and post-execution processing:

Job- and host-based

Job-based only

JOB_INCLUDE_POSTPROC in lsb.applications and lsb.params

MAX_PREEXEC_RETRY in lsb.applications and lsb.params

LOCAL_MAX_PREEXEC_RETRY in lsb.applications and lsb.params

LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.applications, lsb.queues, and lsb.params

REMOTE_MAX_PREEXEC_RETRY in lsb.applications and lsb.params

LSB_DISABLE_RERUN_POST_EXEC in lsf.conf

JOB_PREPROC_TIMEOUT in lsb.applications and lsb.params

JOB_POSTPROC_TIMEOUT in lsb.applications and lsb.params

LSB_PRE_POST_EXEC_USER in lsf.sudoers

LSB_POSTEXEC_SEND_MAIL in lsf.conf

PREEXEC_EXCLUDE_HOST_EXIT_VALUES in lsb.params


For details on each parameter, see IBM Spectrum LSF configuration reference.

JOB_PREPROC_TIMEOUT is designed to protect the system from hanging during pre-execution processing. When LSF detects pre-execution processing is running longer than the JOB_PREPROC_TIMEOUT value (the default value is infinite), LSF will terminate the execution. Therefore, the LSF Administrator should ensure JOB_PREPROC_TIMEOUT is set to a value longer than any pre-execution processing is required. JOB_POSTPROC_TIMEOUT should also be set to a value that gives host-based post execution processing enough time to run.

Configuration to modify when new jobs can start

When a job finishes, sbatchd reports a job finish status of DONE or EXIT to mbatchd. This causes LSF to release resources associated with the job, allowing new jobs to start on the execution host before post-execution processing from a previous job has finished.

In some cases, you might want to prevent the overlap of a new job with post-execution processing. Preventing a new job from starting prior to completion of post-execution processing can be configured at the application level or at the job level.

At the job level, the bsub -w option allows you to specify job dependencies; the keywords post_done and post_err cause LSF to wait for completion of post-execution processing before starting another job.

At the application level:

File

Parameter and syntax

Description

lsb.applications

lsb.params

JOB_INCLUDE_POSTPROC=Y

  • Enables completion of post-execution processing before LSF reports a job finish status of DONE or EXIT

  • Prevents a new job from starting on a host until post-execution processing is finished on that host


  • sbatchd sends both job finish status (DONE or EXIT) and post-execution processing status (POST_DONE or POST_ERR) to mbatchd at the same time
  • The job remains in the RUN state and holds its job slot until post-execution processing has finished
  • Job requeue happens (if required) after completion of post-execution processing, not when the job itself finishes
  • For job history and job accounting, the job CPU and run times include the post-execution processing CPU and run times
  • The job control commands bstop, bkill, and bresume have no effect during post-execution processing
  • If a host becomes unavailable during post-execution processing for a rerunnable job, mbatchd sees the job as still in the RUN state and reruns the job
  • LSF does not preempt jobs during post-execution processing

Configuration to modify the post-execution processing time

Controlling the length of time post-execution processing can run is configured at the application level.

File

Parameter and syntax

Description

lsb.applications

lsb.params

JOB_POSTPROC_TIMEOUT=minutes

  • Specifies the length of time, in minutes, that post-execution processing can run.

  • The specified value must be greater than zero.

  • If post-execution processing takes longer than the specified value, sbatchd reports post-execution failure—a status of POST_ERR. On UNIX and Linux, it kills the entire process group of the job's pre-execution processes. On Windows, only the parent process of the pre-execution command is killed when the timeout expires, the child processes of the pre-execution command are not killed.

  • If JOB_INCLUDE_POSTPROC=Y and sbatchd kills the post-execution process group, post-execution processing CPU time is set to zero, and the job’s CPU time does not include post-execution CPU time.


Configuration to modify the pre- and post-execution processing user account

Specifying a user account under which the pre- and post-execution commands run is configured at the system level. By default, both the pre- and post-execution commands run under the account of the user who submits the job.

File

Parameter and syntax

Description

lsf.sudoers

LSB_PRE_POST_EXEC_USER

=user_name

  • Specifies the user account under which pre- and post-execution commands run (UNIX only)

  • This parameter applies only to pre- and post-execution commands configured at the queue level; pre-execution commands defined at the application or job level run under the account of the user who submits the job

  • If the pre-execution or post-execution commands perform privileged operations that require root permissions on UNIX hosts, specify a value of root

  • You must edit the lsf.sudoers file on all UNIX hosts within the cluster and specify the same user account


Configuration to control how many times pre-execution retries

By default, if job pre-execution fails, LSF retries the job automatically. The job remains in the queue and pre-execution is retried 5 times by default, to minimize any impact to performance and throughput.

Limiting the number of times LSF retries job pre-execution is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). Pre-execution retry in lsb.applications overrides lsb.queues, and lsb.queues overrides lsb.params configuration.


Configuration file

Parameter and syntax

Behavior

lsb.params

LOCAL_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster.

  • Specify an integer greater than 0

    By default, the number of retries is unlimited.

MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

  • Specify an integer greater than 0

    By default, the number of retries is 5.

REMOTE_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

    Equivalent to MAX_PREEXEC_RETRY

  • Specify an integer greater than 0

    By default, the number of retries is 5.

lsb.queues

LOCAL_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster.

  • Specify an integer greater than 0

    By default, the number of retries is unlimited.

MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

  • Specify an integer greater than 0

    By default, the number of retries is 5.

REMOTE_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

    Equivalent to MAX_PREEXEC_RETRY

  • Specify an integer greater than 0

    By default, the number of retries is 5.

lsb.applications

LOCAL_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster.

  • Specify an integer greater than 0

    By default, the number of retries is unlimited.

MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

  • Specify an integer greater than 0

    By default, the number of retries is 5.

REMOTE_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

    Equivalent to MAX_PREEXEC_RETRY

  • Specify an integer greater than 0

    By default, the number of retries is 5.


When pre-execution retry is configured, if a job pre-execution fails and exits with non-zero value, the number of pre-exec retries is set to 1. When the pre-exec retry limit is reached, the job is suspended with PSUSP status.

The number of times that pre-execution is retried includes queue-level, application-level, and job-level pre-execution command specifications. When pre-execution retry is configured, a job will be suspended when the sum of its queue-level pre-exec retry times + application-level pre-exec retry times is greater than the value of the pre-execution retry parameter or if the sum of its queue-level pre-exec retry times + job-level pre-exec retry times is greater than the value of the pre-execution retry parameter.

The pre-execution retry limit is recovered when LSF is restarted and reconfigured. LSF replays the pre-execution retry limit in the PRE_EXEC_START or JOB_STATUS events in lsb.events.

Configuration to define default behavior of a job after it reaches the pre-execution retry limit

By default, if LSF retries the pre-execution command of a job on the local cluster and reaches the pre-execution retry threshold (LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, or lsb.applications), LSF suspends the job.

This default behavior of a job that has reached the pre-execution retry limit is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). The behavior specified in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration.


Configuration file

Parameter and syntax

Behavior

lsb.params

LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT

  • Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit.
  • If set to SUSPEND, the job is suspended and its status is set to PSUSP.

    If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code.

    By default, the job is suspended.

lsb.queues

LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT

  • Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit.
  • If set to SUSPEND, the job is suspended and its status is set to PSUSP.

    If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code.

    By default, this is not defined.

lsb.applications

LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT

  • Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit.
  • If set to SUSPEND, the job is suspended and its status is set to PSUSP.

    If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code.

    By default, this is not defined.