Handling exceptions in your flow

IBM Spectrum LSF Process Manager provides flexible ways to handle certain job processing failures so that you can define what to do when these failures occur. A failure of a job to process is indicated by an exception.

IBM Spectrum LSF Process Manager provides some built-in exception handlers you can use to automate the recovery process, and an alarm facility you can use to notify people of particular failures.

IBM Spectrum LSF Process Manager exceptions

IBM Spectrum LSF Process Manager monitors for the following exceptions:

  • Misschedule
  • Overrun
  • Underrun
  • Start Failed
  • Cannot Run

Misschedule

A Misschedule exception occurs when a job, job array, flow or subflow depends on a time event, but is unable to start during the duration of that event. There are many reasons why your job can miss its schedule. For example, you may have specified a dependency that was not satisfied while the time event was active.

Note:

When a job depends on a time event, and you want to monitor for a misschedule of the job, ensure that the time event either directly precedes the job in the flow diagram, or precedes no more than one link (AND or OR) prior to the job in the flow diagram. IBM Spectrum LSF Process Manager is unable to process the misschedule exception if multiple links are used between the time event and the job depending on it.

Overrun

An Overrun exception occurs when a job, job array, flow or subflow exceeds its maximum allowable run time. You use this exception to detect run away or hung jobs. The time is calculated using wall-clock time, from when the work item is first submitted to LSF® until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first.

Underrun

An Underrun exception occurs when a job, job array, flow or subflow finishes sooner than its minimum expected run time. You use this exception to detect when a job finishes prematurely. This exception is not raised when a job is killed by IBM Spectrum LSF Process Manager. The time is calculated using wall-clock time, from when the work item is first submitted to LSF until its status changes from Running to Exit or Done.

Start Failed

A Start Failed exception occurs when a job or job array is unable to run because its execution environment could not be set up properly. Typical reasons for this exception include lack of system resources such as a process table was full on the execution host, or a file system was not mounted properly.

Cannot Run

A Cannot Run exception occurs when a job or job array cannot proceed because of an error in submission. A typical reason for this exception might be an invalid job parameter.

Behavior when an exception occurs

The following describes IBM Spectrum LSF Process Manager behavior when an exception occurs, and no automatic exception handling is used:


When a …

Experiences this exception …

This happens …

Flow definition

Misschedule

The flow is not triggered.

Flow

Overrun

The flow continues to run after the exception occurs. The run time is calculated from when the flow is first triggered until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first.

Underrun

The time is calculated from when the flow is first triggered until its status changes from Running to Exit or Done.

Subflow

Misschedule

The subflow is not run.

Overrun

The subflow continues to run after the exception occurs. The run time is calculated from when the subflow is first triggered until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first.

Underrun

The time is calculated from when the subflow first starts running until its status changes from Running to Exit or Done.

Job

Misschedule

The job is not run.

Cannot Run

The job is not run.

Start Failed

The job is still waiting. Submission of the job is retried until the configured number of retry times. If the job still cannot run, a Cannot Run exception is raised.

Overrun

The job continues to run after the exception occurs. The run time is calculated from when the job is successfully submitted until it reaches Exit or Done state, or until the Overrun time is reached, whichever comes first.

Underrun

The time is calculated from the when the job is successfully submitted until it reaches Exit or Done state.

Job array

Misschedule

The job array is not run.

Cannot Run

The job array is not run.

Start Failed

The job array is still waiting. Submission of the job array is retried until it runs.

Overrun

The job array continues to run after the exception occurs. The run time is calculated from when the job array is successfully submitted until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first.

Underrun

The time is calculated from when the job array is successfully submitted until all elements in the array reach Exit or Done state.


User-specified conditions

In addition to the IBM Spectrum LSF Process Manager exceptions, you can specify and handle other conditions, depending on the type of work item you are defining. For example, when you are defining a job, you can monitor the job for a particular exit code, and automatically rerun the job if the exit code occurs. The behavior when one of these conditions occurs depends on what you specify in the flow definition.

You can monitor for the following conditions in addition to the IBM Spectrum LSF Process Manager exceptions:


Work Item

Condition

Flow

An exit code of n (sum of all exit codes)

n unsuccessful jobs

Subflow

An exit code of n

n unsuccessful jobs

Job

An exit code of n

Job array

An exit code of n

n unsuccessful jobs