Clusters, jobs, and queues

The IBM Spectrum LSF ("LSF", short for load sharing facility) software is industry-leading enterprise-class software that distributes work across existing heterogeneous IT resources to create a shared, scalable, and fault-tolerant infrastructure, that delivers faster, balanced, more reliable workload performance and reduces cost.

A typical LSF environment

Cluster

A group of computers (hosts) running LSF that work together as a single unit, combining computing power and sharing workload and resources. A cluster provides a single-system image for a network of computing resources.

Hosts can be grouped into clusters in a number of ways. A cluster could contain:

  • All the hosts in a single administrative group

  • All the hosts on one file server or sub-network

  • Hosts that perform similar functions

Commands

  • lshosts — View static resource information about hosts in the cluster

  • bhosts — View resource and job information about server hosts in the cluster

  • lsid — View the cluster name

  • lsclusters — View cluster status and size

Configuration

  • Define hosts in your cluster in lsf.cluster.cluster_name

    Tip:

    The name of your cluster should be unique. It should not be the same as any host or queue.

Job

A unit of work runs in the LSF system. A job is a command that is submitted to LSF for execution. LSF schedules, controls, and tracks the job according to configured policies.

Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power.

Commands

  • bjobs — View jobs in the system

  • bsub — Submit jobs

Job slot

A job slot is a bucket into which a single unit of work is assigned in the LSF system. Hosts are configured to have a number of job slots available and queues dispatch jobs to fill job slots.

Commands

  • bhosts — View job slot limits for hosts and host groups

  • bqueues — View job slot limits for queues

  • busers — View job slot limits for users and user groups

Configuration

  • Define job slot limits in lsb.resources.

Job states

LSF jobs have the following states:

  • PEND — Waiting in a queue for scheduling and dispatch

  • RUN — Dispatched to a host and running

  • DONE — Finished normally with zero exit value

  • EXIT — Finished with non-zero exit value

  • PSUSP — Suspended while pending

  • USUSP — Suspended by user

  • SSUSP — Suspended by the LSF system

  • POST_DONE — Post-processing is completed without errors

  • POST_ERR — Post-processing is completed with errors

  • WAIT — Members of a chunk job that are waiting to run

Queue

A cluster wide container for jobs. All jobs wait in queues until they are scheduled and dispatched to hosts.

Queues do not correspond to individual hosts; each queue can use all server hosts in the cluster, or a configured subset of the server hosts.

When you submit a job to a queue, you do not need to specify an execution host. LSF dispatches the job to the best available execution host in the cluster to run that job.

Queues implement different job scheduling and control policies.

Commands

  • bqueues — View available queues

  • bsub -q — Submit a job to a specific queue

  • bparams — View default queues

Configuration

  • Define queues in lsb.queues

    Tip:

    The names of your queues should be unique. They should not be the same as the cluster name or any host in the cluster.

First-come, first-served scheduling (FCFS)

The default type of scheduling in LSF. Jobs are considered for dispatch based on their order in the queue.