About job states
The bjobs command displays the current state of the job.
Normal job states
Most jobs enter only three states:
Waiting in a queue for scheduling and dispatch
Dispatched to a host and running
Finished normally with a zero exit value
Suspended job states
If a job is suspended, it has three states:
Suspended by its owner or the LSF administrator while in PEND state
Suspended by its owner or the LSF administrator after being dispatched
Suspended by the LSF system after being dispatched
A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.
A job remains pending until all conditions for its execution are met. Some of the conditions are:
- Start time that is specified by the user when the job is submitted
- Load conditions on qualified hosts
- Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs
- Run windows during which jobs from the queue can run
- Limits on the number of job slots that are configured for a queue, a host, or a user
- Relative priority to other users and jobs
- Availability of the specified resources
- Job dependency and pre-execution conditions
Maximum pending job threshold
If the user or user group submitting the job has reached the pending job or slots thresholds as specified by MAX_PEND_JOBS or MAX_PEND_SLOTS (either in the User section of lsb.users, or cluster-wide in lsb.params), LSF will reject any further job submission requests sent by that user or user group. The system will continue to send the job submission requests with the interval specified by SUB_TRY_INTERVAL in lsb.params until it has made a number of attempts equal to the LSB_NTRIES environment variable. If LSB_NTRIES is undefined and LSF rejects the job submission request, the system will continue to send the job submission requests indefinitely as the default behavior.
Pending job eligibility for scheduling
A job that is in an eligible pending state is a job that LSF would normally select for resource allocation, but is currently pending because its priority is lower than other jobs. It is a job that is eligible for scheduling and will be run if there are sufficient resources to run it.
An ineligible pending job remains pending even if there are enough resources to run it and is therefore ineligible for scheduling. Reasons for a job to remain pending, and therefore be in an ineligible pending state, include the following:
- The job has a start time constraint (specified with the -b option)
- The job is suspended while pending (in a PSUSP state).
- The queue of the job is made inactive by the administrator or by its time window.
- The job's dependency conditions are not satisfied.
- The job cannot fit into the run time window (RUN_WINDOW)
- Delayed scheduling is enabled for the job (NEW_JOB_SCHED_DELAY is greater than zero)
- The job's queue or application profile does not exist.
A job that is not under any of the ineligible pending state conditions is treated as an eligible pending job. In addition, for chunk jobs in WAIT status, the time spent in the WAIT status is counted as eligible pending time.
If TRACK_ELIGIBLE_PENDINFO in lsb.params is set to Y or y, LSF determines which pending jobs are eligible or ineligible for scheduling, and uses eligible pending time instead of total pending time to determine job priority for the following time-based scheduling policies:
- Automatic job priority escalation: Only increases job priority of jobs that have been in an eligible pending state instead of pending state for the specified period of time.
- Absolute priority scheduling (APS): The JPRIORITY subfactor for the APS priority calculation uses the amount of time that the job spent in an eligible pending state instead of the total pending time.
In multicluster job fowarding mode, if the MC_SORT_BY_SUBMIT_TIME parameter is enabled in lsb.params, LSF counts all pending time before the job is forwarded as eligible for a forwarded job in the execution cluster.
In addition, the following LSF commands also display the eligible or ineligible pending information of jobs if TRACK_ELIGIBLE_PENDINFO is set to Y or y:
- bjobs -l shows the total amount of time that the job is in the eligible and ineligible pending states.
- bjobs -pei shows pending jobs divided into lists of eligible and ineligible pending jobs.
- bjobs -pe only shows eligible pending jobs.
- bjobs -pi only shows ineligible pending jobs.
- bjobs -o has the pendstate, ependtime, and ipendtime fields that you can specify to display jobs' pending state, eligible pending time, and ineligible pending time, respectively.
- bacct uses total pending time to calculate the wait time, turnaround time, expansion factor (turnaround time/run time), and hog factor (cpu time/turnaround time).
- bacct -E uses eligible pending time to calculate the wait time, turnaround time, expansion factor (turnaround time/run time), and hog factor (cpu time/turnaround time).
If TRACK_ELIGIBLE_PENDINFO is disabled and LSF did not log any eligible or ineligible pending time, the ineligible pending time is zero for bacct -E.
- bhist -l shows the total amount of time that the job spent in the eligible and ineligible pending states after the job started.
mbschd saves eligible and ineligible pending job data to disk every five minutes. This allows the eligible and ineligible pending information to be recovered when mbatchd restarts. When mbatchd restarts, some ineligible pending time may be lost since it is recovered from the snapshot file, which is dumped periodically at set intervals. The lost time period is counted as eligible pending time under such conditions. To change this time interval, specify the ELIGIBLE_PENDINFO_SNAPSHOT_INTERVAL parameter, in minutes, in lsb.params.
A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF.
After a job is dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.
If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.
LSF suspends jobs according to the priority of the job’s queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise.
Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.
A system-suspended job can later be resumed by LSF if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.
WAIT state (chunk jobs)
If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as WAIT by bjobs. Any jobs in WAIT status are included in the count of pending jobs by bqueues and busers, even though the entire chunk job has been dispatched and occupies a job slot. The bhosts command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.
You can switch (bswitch) or migrate (bmig) a chunk job member in WAIT state to another queue.
An exited job that is ended with a non-zero exit status.
A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:
The job is canceled by its owner or the LSF administrator while pending, or after being dispatched to a host.
The job is not able to be dispatched before it reaches its termination deadline that is set by bsub -t, and thus is terminated by LSF.
The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
The application exits with a non-zero exit code.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host.
Some jobs may not be considered complete until some post-job processing is performed. For example, a job may need to exit from a post-execution job script, clean up job files, or transfer job output after the job completes.
The DONE or EXIT job states do not indicate whether post-processing is complete, so jobs that depend on processing may start prematurely. Use the post_done and post_err keywords on the bsub -w command to specify job dependency conditions for job post-processing. The corresponding job states POST_DONE and POST_ERR indicate the state of the post-processing.
After the job completes, you cannot perform any job control on the post-processing. Post-processing exit codes are not reported to LSF.