Skip to main content

skip to main content

developerWorks  >  Autonomic computing | Tivoli | Grid computing  >

Manage grid jobs with IBM Tivoli Workload Scheduler LoadLeveler

How to use this advanced scheduling system in your grid

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Jeff Mausolf (mausolf@us.ibm.com), IT Architect, IBM Global Services, Global Technology Center, Grid Computing Initiative

18 Jul 2006

IBM® Tivoli® Workload Scheduler LoadLeveler® is an advanced scheduling system for AIX® and Linux®. This article provides an overview of the product and demonstrates how to submit, monitor, and control jobs in the LoadLeveler environment.

Introduction

The IBM Tivoli Workload Scheduler (TWS) LoadLeveler is a job scheduling system developed to maximize resource utilization and throughput for large computational jobs. It was initially developed for AIX, but is now also available for Linux. TWS LoadLeveler schedules serial or parallel jobs based on priority, resource requirements, and resource availability. It handles the management, execution, and accounting details associated with running jobs across a distributed set of resources.

This article provides an overview of the TWS LoadLeveler system, and the concepts associated with creating and running jobs in this environment. We'll develop a sample job to illustrate job creation, submission, and management.



Back to top


How does TWS LoadLeveler work?

A TWS LoadLeveler pool is a collection of machines that coordinate to provide a high-throughput computing environment. The pool can be thought of as a distributed computing cluster that consists of a scheduler, a central manager, and multiple compute nodes. When users submit jobs, the jobs are added to the job queue, which is managed by the scheduler.

The scheduler assigns a unique identifier to each job and notifies the central manager that there is a new job that needs resources.

The central manger checks for available resources that might be able to handle the job. Appropriate resources are located by inspecting the job requirements and checking the available resources to see if they can satisfy the job requirements. When the central manager locates a resource that can satisfy the job requirements, it notifies the scheduler. The scheduler then contacts the selected resource and asks it to run the job.

The compute resource runs the job by forking a child process that executes under the user ID of the job submitter. When the job is started, the compute resource notifies the central manager that the job is running. When the job completes, LoadLeveler generates accounting information about the resources consumed by the job.



Back to top


The processes that compose TWS LoadLeveler

There are many processes that coordinate to provide this high-throughput computing environment. These processes run on the scheduler, central manager, or compute nodes. Following is a summary of the processes that make up the TWS LoadLeveler environment.

Master

The master process runs on every machine in the TWS LoadLeveler pool. It reads local configuration information, then starts the other processes that should be running on the machine. The processes are different, depending on whether the machine is configured as a compute node, a scheduler, or the central manager. After starting the appropriate processes, the master monitors them. If a process fails, the master restarts it. If it has to restart a process, the master sends an e-mail to the administrator about the problem and sends information related to the failed process, such as the termination status and the log file.

Scheduler

The scheduler manages the job queue. It receives requests from users to run jobs and assigns a unique identifier to each new job. It informs the central manager that there is a job request, and when the central manager has located the appropriate resource(s) for the job, the scheduler spawns a shadow process to handle it. The shadow process monitors the job running on the compute resource and notifies the scheduler when it completes. The scheduler forwards this information on to the central manager.

Shadow

The shadow process is started by the scheduler when the central manager has selected a resource for the job. The shadow process contacts the startd process on the compute resource that was selected to run the job. The startd then spawns a starter process to manage the job execution. The shadow process sends the executable and associated job information to this starter process. When the job has completed, the shadow receives the job exit status from the starter process, returns that to the scheduler, and exits.

startd

The compute nodes run a start process called startd. When it receives requests from the scheduler to run jobs, the startd spawns a starter process and monitors the job, along with the resources on the machine. The startd forwards this information on to the central manager so it can be considered when matching other jobs. When the jobs complete, the startd gathers information about the state of the machine and the resources consumed by the jobs. This information is sent to the central manager for accounting purposes.

Starter

The starter process is spawned by the startd process to run a job. It contacts the shadow process on the scheduling machine to get the executable and other associated information related to the job. It runs the job by forking a child process that executes under the user ID of the job submitter, then notifies the central manager that the job state is now running. When the child process finishes running the job, the starter process returns the job exit status to the shadow, which forwards the information to the scheduler and exits.

Central manager

The central manager knows about all the resources in the pool. It maintains status information about jobs and machines. The machine information includes a list of attributes that describe the resources it has available. When the central manager receives notification from the scheduler that there is a new job, it examines machine resource attributes to determine which machine can satisfy the job requirements. When an appropriate resource is located, the central manager sets the job state to pending and asks the scheduler to run the job on that resource.



Back to top


TWS LoadLeveler users view

You should now have a basic understanding of how the machines and processes in a TWS LoadLeveler pool coordinate to provide a high-throughput computing environment.



Back to top


TWS LoadLeveler job classes

TWS LoadLeveler uses classes to schedule jobs on resources. A job class defines the characteristics associated with a type of job, such as the priority, run limits, and resources. Administrators create job classes and supply a meaningful name for the class so users can easily associate their jobs with the appropriate job classes. The TWS LoadLeveler default job classes include small, medium, large, very long, parallel, or interactive. Additional job classes can be created by the administrator.

Resource owners or administrators determine which job classes can run on a machine. They specify the job classes the machine will accept and the number of jobs of each class allowed to run on the resource. The following example illustrates a sample configuration that specifies that eight small jobs, five medium jobs, and two large jobs can run on the machine.

CLASS = small(8) medium(5) large(2)



Back to top


TWS LoadLeveler consumable resources

The attributes associated with each machine, such as CPU, memory, and disk space, are considered consumable resources. These resources are consumable in the sense that a job will use a specified amount of the resource and release it when it completes. Consumable resources can be specified for a specific machine or for the entire cluster.

Machine resources include things like CPU, memory, and disk space available on a machine. Floating resources are available across all the machines in the cluster -- a floating software license, for example. Administrators specify the machine and floating consumable resources in a pool. The central manager considers these resources when matching jobs to resources. Users specify the resources required by a job when they submit. The central manager looks at the consumable resources required for the job and checks available consumable resources on machines to determine where a job can run.

When a job is scheduled to run on a machine, the central manager decreases the consumable resources on that machine by the amount required for the job. When the job completes, the consumable resources are increased by the same amount.



Back to top


TWS LoadLeveler schedulers

TWS LoadLeveler has different scheduling algorithms that can be used, depending on the types of jobs that need to be run. Administrators select and specify the scheduling algorithm to use in the TWS LoadLeveler administration configuration file. A default, backfill, and gang scheduler are provided with TWS LoadLeveler.

Default scheduler

The default scheduler runs jobs by scheduling them on idle resources. The default scheduler starts, suspends, and resumes jobs based on workload. It handles serial jobs extremely well, but can schedule parallel jobs, using reservations. When scheduling parallel jobs, nodes are reserved as they become available. The reserved nodes remain idle until enough nodes are available to run the parallel job. Since this process leaves reserved nodes idle until enough are accumulated, it can result in lower overall utilization when scheduling large parallel jobs.

Backfill scheduler

The backfill scheduler schedules large parallel jobs across blocks of resources, then looks for smaller jobs to fill in the gaps. The back-filling algorithm runs shorter jobs that require a small number of resources, while larger jobs are waiting for enough nodes to run. Shorter jobs can run on the idle resources as long as they do not delay the start of the larger jobs.

The scheduler determines if a shorter job can run by comparing the projected start time of a large job with the wall_clock_limit of a smaller job. If the smaller job will complete before the larger job's predicted start time, it is allowed to run. In order to use the backfill scheduler, users must set a wall_clock_limit for jobs or the administrator must set a wall_clock_limit for each of the job classes.

The wall_clock_limit is the upper time limit a job should take to complete. If a job runs longer than the wall clock time, it is assumed to be runaway and is terminated. This may seem harsh, but requiring accurate estimates of the execute time for a job allows the scheduler to do more efficient scheduling.

Gang scheduler

A job’s start time depends on jobs ahead of it in the queue and the completion of the running jobs. Long-running jobs can potentially delay the start of other jobs. Time sharing allocates processing time to jobs in small increments. The gang scheduler works like an operating system using context-switching to support time-sharing and space-sharing. The gang is the group of tasks associated with a job. The tasks are scheduled to run at the same time across multiple machines with each task receiving a slice of time in which to execute.



Back to top


Getting started

The job command file

Users describe a job in a job command file, then submit the command file to the TWS LoadLeveler to run the job.

The following example illustrates the features of TWS LoadLeveler by creating a simple command file to submit a sample job. A command file should contain comments to describe the job, as well as track version history and changes. The pound character (#) at the beginning of a line identifies a comment. If the at sign (@) follows the pound character, it identifies a keyword. Keywords are used to specify input, output, and error files associated with a job, as well as the job class. To see which classes have been predefined by the administrator, use the llclass command.


Listing 1. llclass command
                
[loadl@blade31 loadl]$ llclass

Name                 MaxJobCPU     MaxProcCPU  Free   Max  Description
                    d+hh:mm:ss     d+hh:mm:ss Slots Slots
--------------- -------------- -------------- ----- -----  ---------------------
large                undefined      undefined     2     2  
medium               undefined      undefined     5     5  
small                undefined      undefined     8     8  

Additional information on a job class is viewed using the -l flag.


Listing 2. llclass command arguments
                
[loadl@blade31 loadl]$ llclass -l large

=============== Class large ===============
                Name: large
            Priority: 0
       Exclude_Users: adams
       Include_Users: 
      Exclude_Groups: 
      Include_Groups: staff
               Admin: 
           NQS_class: F
          NQS_submit: 
           NQS_query: 
      Max_processors: -1
             Maxjobs: -1
Resource_requirement: 
       Class_comment: large job class for grid integration center cluster
      Class_ckpt_dir: 
          Ckpt_limit: undefined, undefined
    Wall_clock_limit: 00:30:00, undefined (1800 seconds, undefined)
Def_wall_clock_limit: 00:30:00, undefined (1800 seconds, undefined)
       Job_cpu_limit: undefined, undefined
           Cpu_limit: undefined, undefined
          Data_limit: undefined, undefined
          Core_limit: undefined, undefined
          File_limit: undefined, undefined
         Stack_limit: undefined, undefined
           Rss_limit: undefined, undefined
                Nice: 0
          Free_slots: 2
       Maximum_slots: 2
    Execution_factor: 1
     Max_total_tasks: -1
 Max_proto_instances: 2
       Preempt_class: 
         Start_class: 
        User default: maxidle(-1) maxqueued(-1) maxjobs(-1) max_total_tasks(-1)

The following command file is for a small job that directs errors to the simple.err file, output to the simple.out file, and queues the job. The job will sleep, then echo the job name and hostname. The sleep command has been inserted to allow us to view the job status before it completes.


Listing 3. Job command file
                
#
# Simple sample script that will demonstrate submitting jobs to LL
#
# @ class = small
# @ error   = simple.err
# @ output  = simple.out
# @ queue
#
sleep 60
echo Job $0 is running on `hostname`

Submitting the job

Jobs are submitted using the llsubmit command, followed by the command file.

[loadl@blade31 loadl]$ llsubmit simple.cmd

llsubmit: The job "blade31.austin.ibm.com.15" has been submitted.

Job status

Once the job has been submitted, the status can be viewed using the llq command.


Listing 4. llq command
                
[loadl@blade31 loadl]$ llq

Id                       Owner      Submitted   ST PRI Class        Running On 
------------------------ ---------- ----------- -- --- ------------ -----------
blade31.15.0             loadl       5/12 14:02 R  50  small        blade31

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted

Cluster status

The status of all the resources in the pool can be viewed using the llstatus command.


Listing 5. llstatus command
                
[loadl@blade31 loadl]$ llstatus

blade31.austin.ibm.com    Avail     1   1 Run      1 2.00     0 i386      Linux2   

i386/Linux2                 1 machines      1  jobs      1  running
Total Machines              1 machines      1  jobs      1  running

The Central Manager is defined on blade31.austin.ibm.com

The BACKFILL scheduler is in use

All machines on the machine_list are present.

When the job completes, TWS LoadLeveler notifies the submitter and provides job summary information in an e-mail.


Listing 6. Notification
                
From loadl@blade31.austin.ibm.com  Tue May 16 10:03:53 2006
Date: Tue, 16 May 2006 10:03:53 -0500
From: loadl@blade31.austin.ibm.com
To: mausolf@blade31.austin.ibm.com
Subject: blade31.austin.ibm.com.16

From: LoadLeveler


LoadLeveler Job Step: blade31.austin.ibm.com.16.0
        Executable: /home/mausolf/simple.cmd
        Executable arguments: 
        State for machine: blade31.austin.ibm.com
        LoadL_starter: The program, simple.cmd, \
        exited normally and returned an exit code of 0.

This job step was dispatched to run 1 time(s).
This job step was rejected by Starter 0 time(s).
Submitted at: Tue May 16 09:58:52 2006
Started at: Tue May 16 09:58:53 2006
Exited at: Tue May 16 10:03:53 2006
               Real Time:   0 00:05:01
      Job Step User Time:   0 00:00:00
    Job Step System Time:   0 00:00:00
     Total Job Step Time:   0 00:00:00

       Starter User Time:   0 00:00:00
     Starter System Time:   0 00:00:00
      Total Starter Time:   0 00:00:00

The command-line interface demonstrated above is preferred by some users, but for others a graphical user interface (GUI), xloadl, is also provided.


Figure 1. xloadl -- Select a job to submit
xloadl -- Select a job to submit

Figure 2 shows job status as viewed through the xload GUI.


Figure 2. xloadl -- Status
xloadl -- Status


Back to top


Summary

In TWS LoadLeveler, the central manager knows about all the resources in the pool. Each machine sends the central manager information about the resources it has available.

When a user submits a job, it's placed in the job queue by the scheduler. The scheduler coordinates with the central manager to locate a resource that can satisfy the job requirements, then contacts the selected resource to run the job. Users can submit jobs to TWS LoadLeveler through the command line or using the xloadl GUI. TWS LoadLeveler handles all the complexities associated with determining when and where jobs should run in order to maximize job throughput and resource utilization.

TWS LoadLeveler V8.3 is available for AIX and Linux.



Resources

Learn

Get products and technologies
  • Build your next development project with IBM trial software, available for download directly from developerWorks.


Discuss


About the author

Jeff Mausolf is a certified IT architect for the IBM Grid Integration Center in Austin, Texas, which is part of the Global Technology Integration and Management Competency; he has worked as an Application Architect and Software Engineer, developing commercial and governmental portals. He is part of the Grid Computing Initiative and has worked on numerous grid projects, including authoring an IBM Redbook on developing grid services. Before coming to IBM 14 years ago, he worked for Lockheed, Loral, and Ford AeroSpace at NASA. While at the Johnson Space Center, he supported astronaut mission training in the Shuttle Engineering Simulation (SES) laboratory, building Space Station components and working on the AP101S General Purpose Computer (GPC) for the Space Shuttle.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top