About exception handlers

IBM Spectrum LSF Process Manager provides built-in exception handlers you can use to automatically take corrective action when certain exceptions occur, minimizing human intervention required. You can also define your own exception handlers for certain conditions.

IBM Spectrum LSF Process Manager built-in exception handlers

The built-in exception handlers are:

  • Rerun
  • Kill

Rerun

The Rerun exception handler reruns the entire job, job array, subflow or flow. Use this exception handler in situations where rerunning the work item can fix the problem. The Rerun exception handler can be used with Underrun, Exit and Start Failed exceptions.

Kill

The Kill exception handler kills the job, job array, subflow or flow. Use this exception handler when a work item has overrun its time limits. The Kill exception handler can be used with the Overrun exception, and when you are monitoring for the number of jobs done or exited in a flow or subflow.

User-defined exception handlers

In addition to the built-in exception handlers, you can create your flow definitions to handle exceptions by:

  • Opening an alarm
  • Running a recovery job
  • Triggering another flow

Alarm

An alarm provides a visual, graphical cue that an exception has occurred, and either an email notification to one or more addresses, or the execution of a script. You use an alarm to notify key personnel, such as database administrators, of problems that require attention. An alarm has no effect on the flow itself.

When you are creating your flow definition, you can add a predefined alarm to the flow diagram, as you would another job. You create a dependency from the work item to the alarm, which can be opened by any of the exceptions available in the dependency definition. The alarm cannot precede another work item in the diagram—you cannot draw a dependency from an alarm to another work item in the flow.

An opened alarm appears in the list of open alarms in the Flow Manager until the history log file containing the alarm is deleted or archived.

Valid alarm names are configured by the IBM Spectrum LSF Process Manager administrator.

Recovery job

You can use a job dependency in a flow diagram to run a job that performs some recovery function when an exception occurs.

Recovery flow

You can create a flow that performs some recovery function for another flow. When you submit the recovery flow, specify the name of the flow and exception as an event to trigger the recovery flow.

Behavior when exception handlers are used

Flow


When a Flow Experiences this Exception …

and the Handler Used is …

This Happens …

Overrun

Kill

The flow is killed. All incomplete jobs in the flow are killed. The flow status is ‘Killed’

Underrun

Rerun

Flows that have a dependency on this flow may not be triggered, depending on the type of dependency. The flow is recreated with the same flow ID. The flow is rerun from the first job, or from any rerun starting points, as many times as required until the execution time exceeds the underrun time specified.

An exit code of n

Rerun

Flows that have a dependency on this flow may not be triggered, depending on the type of dependency. The flow is recreated with the same flow ID. The flow is rerun from the first job, or from any rerun starting points, as many times as required until an exit code other than n is reached.

n unsuccessful jobs

Kill

The flow is killed. All incomplete jobs in the flow are killed. The flow status is ‘Killed’


Subflow


When a Subflow Experiences this Exception …

and the Handler Used is …

This Happens …

Misschedule

Alarm

The alarm is opened. The subflow is not run. The flow continues execution as designed.

Recovery job or flow

The subflow is not run. The flow continues execution as designed. The recovery job or flow is triggered.

Overrun

Alarm

The alarm is opened. Both the flow and subflow continue execution as designed.

Recovery job or flow

Both the flow and subflow continue execution as designed. The recovery job or flow is triggered.

Kill

The subflow is killed. The flow behaves as designed.

Underrun

Alarm

The alarm is opened. The flow continues execution as designed.

Recovery job or flow

The subflow continues execution as designed. The recovery job or flow is triggered.

Rerun

Work items that have a dependency on this subflow may not be triggered, depending on the type of dependency. The subflow is rerun from the first job as many times as required until the execution time exceeds the underrun time specified.

An exit code of n

Rerun

Work items that have a dependency on this subflow may not be triggered, depending on the type of dependency. The subflow is rerun from the first job as many times as required until an exit code other than n is reached.

n unsuccessful jobs

Kill

The subflow is killed. The flow behaves as designed.


Job or job array


When a Job or Job Array Experiences this Exception …

and the Handler Used is …

This Happens …

Misschedule

Alarm

The alarm is opened. The job or job array is not run. The flow continues execution as designed.

Recovery job or flow

The job or job array is not run. The flow continues execution as designed. The recovery job or flow is triggered.

Overrun

Alarm

The alarm is opened. Both the flow and job or job array continue to execute as designed.

Recovery job or flow

Both the flow and job or job array continue to execute as designed. The recovery job or flow is triggered.

Kill

The job or job array is killed. The flow behaves as designed. The job or job array status is determined by its exit value.

Underrun

Alarm

The alarm is opened. The flow continues execution as designed.

Recovery job or flow

The flow continues execution as designed. The recovery job or flow is triggered.

Rerun

Work items that have a dependency on this job or job array are not triggered. The job or job array is rerun as many times as required until the execution time exceeds the underrun time specified.

Start Failed

Alarm

The alarm is opened. The flow continues execution as designed.

Recovery job or flow

The recovery job or flow is triggered.

Rerun

The job or job array is rerun as many times as required until it starts successfully.

Cannot Run

Alarm

The alarm is opened. The flow continues execution as designed.

Recovery job or flow

The recovery job or flow is triggered.

An exit code of n

Rerun

The job or job array is rerun as many times as required until it starts successfully.

n unsuccessful jobs

Kill

The job array is killed. The flow behaves as designed. The job array status is determined by its exit value.