Process Manager exceptions
Process Manager provides flexible ways to handle certain job processing failures so that you can define what to do when these failures occur. A failure of a job to process is indicated by an exception. Process Manager provides some built-in exception handlers you can use to automate the recovery process, and an alarm facility you can use to notify people of particular failures.
- Misschedule
- Overrun
- Underrun
- Start Failed
- Cannot Run
Misschedule
Overrun
An Overrun exception occurs when a job, job array, flow or subflow exceeds its maximum allowable run time. You use this exception to detect run away or hung jobs. The time is calculated using wall-clock time, from when the work item is first submitted to LSF® until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first.
Underrun
An Underrun exception occurs when a job, job array, flow or subflow finishes sooner than its minimum expected run time. You use this exception to detect when a job finishes prematurely. This exception is not raised when a job is killed by Process Manager. The time is calculated using wall-clock time, from when the work item is first submitted to LSF until its status changes from Running to Exit or Done.
Start Failed
A Start Failed exception occurs when a job or job array is unable to run because its execution environment could not be set up properly. Typical reasons for this exception include lack of system resources such as a process table was full on the execution host, or a file system was not mounted properly.
Cannot Run
A Cannot Run exception occurs when a job or job array cannot proceed because of an error in submission. A typical reason for this exception might be an invalid job parameter.
Behavior when an exception occurs
The following describes Process Manager behavior when an exception occurs, and no automatic exception handling is used:
When a … |
Experiences this exception … |
This happens … |
---|---|---|
Flow definition |
Misschedule |
The flow is not triggered. |
Flow |
Overrun |
The flow continues to run after the exception occurs. The run time is calculated from when the flow is first triggered until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first. |
Underrun |
The time is calculated from when the flow is first triggered until its status changes from Running to Exit or Done. |
|
Subflow |
Misschedule |
The subflow is not run. |
Overrun |
The subflow continues to run after the exception occurs. The run time is calculated from when the subflow is first triggered until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first. |
|
Underrun |
The time is calculated from when the subflow first starts running until its status changes from Running to Exit or Done. |
|
Job |
Misschedule |
The job is not run. |
Cannot Run |
The job is not run. |
|
Start Failed |
The job is still waiting. Submission of the job is retried until the configured number of retry times. If the job still cannot run, a Cannot Run exception is raised. |
|
Overrun |
The job continues to run after the exception occurs. The run time is calculated from when the job is successfully submitted until it reaches Exit or Done state, or until the Overrun time is reached, whichever comes first. |
|
Underrun |
The time is calculated from the when the job is successfully submitted until it reaches Exit or Done state. |
|
Job array |
Misschedule |
The job array is not run. |
Cannot Run |
The job array is not run. |
|
Start Failed |
The job array is still waiting. Submission of the job array is retried until it runs. |
|
Overrun |
The job array continues to run after the exception occurs. The run time is calculated from when the job array is successfully submitted until its status changes from Running to Exit or Done, or until the Overrun time is reached, whichever comes first. |
|
Underrun |
The time is calculated from when the job array is successfully submitted until all elements in the array reach Exit or Done state. |
User-specified conditions
In addition to the Process Manager exceptions, you can specify and handle other conditions, depending on the type of work item you are defining. For example, when you are defining a job, you can monitor the job for a particular exit code, and automatically rerun the job if the exit code occurs. The behavior when one of these conditions occurs depends on what you specify in the flow definition.
You can monitor for the following conditions in addition to the Process Manager exceptions:
Work Item |
Condition |
---|---|
Flow |
An exit code of n (sum of all exit codes) |
n unsuccessful jobs |
|
Subflow |
An exit code of n |
n unsuccessful jobs |
|
Job |
An exit code of n |
Job array |
An exit code of n |
n unsuccessful jobs |