Managing floating software licenses in LSF
Typically, a pool of floating software licenses is represented by numeric resources in LSF. Every job that requires licenses must include the license requirement in its rusage expression to ensure that enough licenses are free for the job at the time that the job is dispatched.
About this task
There are three recommended approaches to represent a pool of licenses as LSF resources:
- Manage licenses as static numeric shared resources
- Manage licenses as dynamic numeric shared resources
- Use LSF License Scheduler Basic Edition to manage license resources
Manage licenses as static numeric shared resources
Manually encode the total number of licenses that are available to jobs in the cluster in the LSF configuration.
About this task
For each job that requires licenses, specify the requirement in its rusage expression without specifying a duration. This configures LSF to reserve licenses for the job for as long as it is running.
Do not use this approach if there are multiple clusters that share a common pool of licenses. One cluster does not know the number of licenses that are checked out or reserved for jobs in other clusters, which can lead to license checkout failures.
Procedure
What to do next
One disadvantage of using a static resource to represent a license is that you must reconfigure LSF whenever the total number of licenses changes. To avoid this issue, write an elim that polls license servers to determine the total licenses in the pool (that is, the number of licenses currently available plus the number in use).
For more information on writing an elim, refer to External load indices
Using a static resource to represent a license only approximates the real license availability. If an application checks out a license outside of an LSF job, or an LSF job checks out more licenses than specified in its rusage, LSF might dispatch a job that requires licenses even though no licenses are available. As a result, the job either fails or occupies allocated compute resources while waiting for licenses to become available.
Manage licenses as dynamic numeric shared resources
An alternative to the static approach is to write an elim that periodically collects the availability of free licenses from license servers, and reports this number to LSF.
About this task
As with static approach, a job that requires the license requests it in the job rusage expression. However, you should also specify a duration (generally of a few minutes) for the license rusage. LSF reserves licenses for the job for this specified duration, starting from the time that the job is first dispatched. Specify a duration that is long enough that the job can check out the license from the license server before the duration expires.
In the following example, there are two units of a license feature. You want to run a job that requires one of them. The job submission and license checkout process is as follows:
- Submit the job to LSF, specifying one unit of the license with a duration in the job’s rusage expression.
- The elim reports to LSF that there are two licenses free.
- LSF dispatches the job, and internally reserves one license for the job, to avoid dispatching another job that will contend for the license.
- The job checks out one license from the license server.
- Upon polling the license server, elim detects that there is now only one license free.
- The rusage duration expires, and LSF stops reserving a license for the job.
At this point, elim is reporting only one license available. When the job checks back in the license and elim detects that the license is free, the license can then be used by another job.
Procedure
What to do next
If you have multiple clusters sharing a common pool of licenses, each cluster should have its own independent elim collecting licenses from the pool. Generally, the dynamic approach gives fewer license checkout failures than with the static approach. This is because once a job in one cluster has checked out a license, the license checkout is reflected in the reduced resource availability reported by the elims in other clusters.
You must set an appropriate rusage duration. If it is too short, then LSF stops reserving before the job has checked out a license. The result is that there might not be a license available for the job when it does try to check out.
If the duration is too long, then there is a significant period of time that the job has a license checked out and LSF simultaneously reserves a license for the job. In effect, the job occupies two licenses (one checked out, and one reserved) even though it needs only one.
Use LSF License Scheduler Basic Edition to manage license resources
With LSF License Scheduler you do not need to write and manage an elim.
About this task
Jobs do not need a duration for the license in the rusage expression. LSF License Scheduler can track how a job checks out licenses, which helps jobs avoid overusing licenses and occupying double the required resources. License resource availability is automatically adjusted based on the real license availability, just like an elim.
LSF License Scheduler periodically gets license usage (including total licenses, available licenses, and licenses in use) from the license server. It tracks license use of individual jobs by matching license checkouts to jobs and figures out how many licenses are used outside of LSF jobs and how many licenses could be used by the LSF cluster, which is the sum of free licenses and licenses that running LSF jobs use. LSF periodically communicates with LSF License Scheduler to get the number of licenses that are available to the LSF cluster, then reserves licenses for running jobs according to their rusage expressions; the remaining free licenses are used to dispatch pending jobs. During intervals that LSF connects to LSF License Scheduler, the licenses that are released by done jobs can be used to dispatch pending jobs immediately.
In the following example, there are three units of a license feature. One unit is used outside of LSF jobs. One unit is used by an LSF job. The job submission and license checkout process is as follows:
- LSF License Scheduler queries license server to get three units of a license feature.
- LSF connects to LSF License Scheduler to find out that here are three units available.
- The user named user1 submits a job to LSF on host1, specifying one unit of the license in the rusage expression. After the job runs, LSF informs LSF License Scheduler of this running job.
- At the same time, outside of LSF, the user named user2 checks out one unit of the license on host2.
- LSF License Scheduler gets the latest license usage from the license server. It finds there are two license checkouts and one free unit. The user@host of one license checkout is consistent with the running job, and LSF License Scheduler determines that the job made this checkout. As result, one unit is considered to be used outside of LSF jobs, one unit is considered to be used by the LSF job and one unit is free. Therefore, there are 2 units available to the LSF cluster.
- LSF gets the newest licenses type resources from LSF License Scheduler. The new available value is 2. LSF then reserves 1 license for the running job. The 1 free unit of the license can be used by other jobs.
When LSF License Scheduler determines which licenses are used by a job, it only considers whether license checkout has the same user and host pair with the job; if so, the licenses are deemed to be consumed by the job. When there are multiple licenses checkouts shared the same user@host, LSF License Scheduler cannot exactly determine the relationship between checkouts and jobs because it does not know which license checkout are associated with which job processes. This means that the blusers -J does not always accurately reflect whether a job has actually checked out licenses. However, this does not affect the accuracy of LSF job dispatch.
Procedure
What to do next
LSF License Scheduler Basic Edition limits license use of jobs to cluster mode features, which are distributed to a single cluster with one service domain per license feature. It is not intended to apply policies on how licenses are shared between clusters or projects. If you have more advanced license management requirements, you can upgrade to LSF License Scheduler Standard Edition, which has full LSF License Scheduler functionality, including support for all modes (cluster mode and project mode), sharing licenses across multiple clusters, sharing licenses between project teams and departments around the globe, scheduling licenses within extremely large clusters, license preemption across departments, feature groups, project groups, multiple service domains per license, and taskman jobs. For details about the features of LSF License Scheduler Standard Edition, see IBM Spectrum LSF License Scheduler.