Time-based service classes
Time-based service classes configure workload based on the number of jobs running at any one time. Goals for deadline, throughput, and velocity of jobs ensure that your jobs are completed on time and reduce the risk of missed deadlines.
Time-based SLA scheduling makes use of other, lower level LSF policies like queues and host partitions to satisfy the service-level goal that the service class expresses. The decisions of a time-based service class are considered first before any queue or host partition decisions. Limits are still enforced with respect to lower level scheduling objects like queues, hosts, and users.
Optimum number of running jobs
As jobs are submitted, LSF determines the optimum number of job slots (or concurrently running jobs) needed for the time-based service class to meet its goals. LSF schedules a number of jobs at least equal to the optimum number of slots that are calculated for the service class.
LSF attempts to meet time-based goals in the most efficient way, using the optimum number of job slots so that other service classes or other types of work in the cluster can still progress. For example, in a time-based service class that defines a deadline goal, LSF spreads out the work over the entire time window for the goal, which avoids blocking other work by not allocating as many slots as possible at the beginning to finish earlier than the deadline.
You should submit time-based SLA jobs with a run time limit at the job level (-W option), the application level (RUNLIMIT parameter in the application definition in lsb.applications), or the queue level (RUNLIMIT parameter in the queue definition in lsb.queues). You can also submit the job with a run time estimate defined at the application level (RUNTIME parameter in lsb.applications) instead of or in conjunction with the run time limit.
If you specify… |
And… |
Then… |
---|---|---|
A run time limit and a run time estimate |
The run time estimate is less than or equal to the run time limit |
LSF uses the run time estimate to compute the optimum number of running jobs. |
A run time limit |
You do not specify a run time estimate, or the estimate is greater than the limit |
LSF uses the run time limit to compute the optimum number of running jobs. |
A run time estimate |
You do not specify a run time limit |
LSF uses the run time estimate to compute the optimum number of running jobs. |
Neither a run time limit nor a run time estimate |
LSF automatically adjusts the optimum number of running jobs according to the observed run time of finished jobs. |
Time-based service class priority
A higher value indicates a higher priority, relative to other time-based service classes. Similar to queue priority, time-based service classes access the cluster resources in priority order.
LSF schedules jobs from one time-based service class at a time, starting with the highest-priority service class. If multiple time-based service classes have the same priority, LSF runs the jobs from these service classes in the order the service classes are configured in lsb.serviceclasses.
Time-based service class priority in LSF is completely independent of the UNIX scheduler’s priority system for time-sharing processes. In LSF, the NICE parameter is used to set the UNIX time-sharing priority for batch jobs.
User groups for time-based service classes
You can control access to time-based SLAs by configuring a user group for the service class. If LSF user groups are specified in lsb.users, each user in the group can submit jobs to this service class. If a group contains a subgroup, the service class policy applies to each member in the subgroup recursively. The group can define fair share among its members, and the SLA defined by the service class enforces the fair share policy among the users in the user group configured for the SLA.
By default, all users in the cluster can submit jobs to the service class.
Time-based SLA limitations
- Multicluster
Multicluster does not support time-based SLAs.
- Preemption
Time-based SLA jobs cannot be preempted. You should avoid running jobs belonging to an SLA in low priority queues.
- Chunk jobs
SLA jobs will not get chunked. You should avoid submitting SLA jobs to a chunk job queue.
- Resizable jobs
For resizable job allocation requests, since the job itself has already started to run, LSF bypasses dispatch rate checking and continues scheduling the allocation request.
Time-based SLA statistics files
Each time-based SLA goal generates a statistics file for monitoring and analyzing the system. When the goal becomes inactive the file is no longer updated. Files are created in the LSB_SHAREDIR/cluster_name/logdir/SLA directory. Each file name consists of the name of the service class and the goal type.
For example, the file named Quadra.deadline is created for the deadline
goal of the service class name Quadra
. The following file named
Tofino.velocity refers to a velocity goal of the service class named
Tofino
:
cat Tofino.velocity
# service class Tofino velocity, NJOBS, NPEND (NRUN + NSSUSP + NUSUSP), (NDONE + NEXIT)
17/9 15:7:34 1063782454 2 0 0 0 0
17/9 15:8:34 1063782514 2 0 0 0 0
17/9 15:9:34 1063782574 2 0 0 0 0
# service class Tofino velocity, NJOBS, NPEND (NRUN + NSSUSP + NUSUSP), (NDONE + NEXIT)
17/9 15:10:10 1063782610 2 0 0 0 0