Level: Introductory Jeff Mausolf (mausolf@us.ibm.com), IT Architect, IBM Global Services, Global Technology Center, Grid Computing Initiative
18 Jul 2006 IBM® Tivoli® Workload Scheduler LoadLeveler® is an advanced scheduling system for AIX® and Linux®. This article provides an overview of the product and demonstrates how to submit, monitor, and control jobs in the LoadLeveler environment.
Introduction
The IBM Tivoli Workload Scheduler (TWS) LoadLeveler is a job scheduling system developed to maximize resource utilization and throughput for large computational jobs. It was initially developed for AIX, but is now also available for Linux. TWS LoadLeveler schedules serial or parallel jobs based on priority, resource requirements, and resource availability. It handles the management, execution, and accounting details associated with running jobs across a distributed set of resources.
This article provides an overview of the TWS LoadLeveler system, and the concepts associated with creating and running jobs in this environment. We'll develop a sample job to illustrate job creation,
submission, and management.
How does TWS LoadLeveler work?
A TWS LoadLeveler pool is a collection of machines that coordinate to provide a high-throughput computing environment. The pool can be thought of as a distributed computing cluster that consists of a scheduler, a central manager, and multiple compute nodes. When users submit jobs, the jobs are added to the job queue, which is managed by the scheduler.
The scheduler assigns a unique identifier to each job and notifies the central manager that there is a new job that needs resources.
The central manger checks for available resources that might be able to handle the job. Appropriate resources are located by inspecting the job requirements and checking the available resources to see if they can satisfy the job requirements. When the central manager locates a resource that can satisfy the job requirements, it notifies the scheduler. The scheduler then contacts the selected resource and asks it to run the job.
The compute resource runs the job by forking a child process that executes under the user ID of the job submitter. When the job is started, the compute resource notifies the central manager that the job
is running. When the job completes, LoadLeveler generates accounting information about the resources consumed by the job.
The processes that compose TWS LoadLeveler
There are many processes that coordinate to provide this high-throughput computing environment. These processes run on the scheduler, central manager, or compute nodes. Following is a summary of the processes that make up the TWS LoadLeveler environment.
Master
The master process runs on every machine in the TWS LoadLeveler pool. It reads local configuration information, then starts the other processes that should be running on the machine. The processes are different, depending on whether the machine is configured as a compute node, a scheduler, or the central manager. After starting the appropriate processes, the master monitors them. If a process fails, the master restarts it. If it has to restart a process, the master sends an e-mail to the administrator about the problem and sends information related to the failed process, such as the
termination status and the log file.
Scheduler
The scheduler manages the job queue. It receives requests from users to run jobs and assigns a unique identifier to each new job. It informs the central manager that there is a job request, and when the central manager has located the appropriate resource(s) for the job, the scheduler spawns a shadow process to handle it. The shadow process monitors the job running on the compute resource and notifies the scheduler when it completes. The scheduler forwards this information on to the central manager.
Shadow
The shadow process is started by the scheduler when the central manager has selected a resource for the job. The shadow process contacts the startd process on the compute resource that was selected to run the job. The startd then spawns a starter process to manage the job execution. The shadow process sends the executable and associated job information to this starter process. When the job has completed, the shadow receives the job exit status from the starter process, returns that to the scheduler, and exits.
startd
The compute nodes run a start process called startd. When it receives requests from the scheduler to run jobs, the startd spawns a starter process and monitors the job, along with the resources on the machine. The startd forwards this information on to the central manager so it can be considered when matching other jobs. When the jobs complete, the startd gathers information about the state of the machine and the resources consumed by the jobs. This information is sent to the central manager for accounting purposes.
Starter
The starter process is spawned by the startd process to run a job. It contacts the shadow process on the scheduling machine to get the executable and other associated information related to the job. It runs the job by forking a child process that executes under the user ID of the job submitter, then notifies the central manager that the job state is now running. When the child process finishes running the job, the starter process returns the job exit status to the shadow, which forwards the information to the scheduler and exits.
Central manager
The central manager knows about all the resources in the pool. It maintains status information about jobs and machines. The machine information includes a list of attributes that describe the resources it has available. When the central manager receives notification from the scheduler that there is a new job, it examines machine resource attributes to determine which machine can satisfy the job requirements. When an appropriate resource is located, the central manager sets the job state to pending and asks the scheduler to run the job on that resource.
TWS LoadLeveler users view
You should now have a basic understanding of how the machines and processes in a TWS LoadLeveler pool coordinate to provide a high-throughput computing environment.
TWS LoadLeveler job classes
TWS LoadLeveler uses classes to schedule jobs on resources. A job class defines the characteristics associated with a type of job, such as the priority, run limits, and resources. Administrators create job classes and supply a meaningful name for the class so users can easily associate their jobs with the appropriate job classes. The TWS LoadLeveler default job classes include small, medium, large, very long, parallel, or interactive. Additional job classes can be created by
the administrator.
Resource owners or administrators determine which job classes can run on a machine. They specify the job classes the machine will accept and the number of jobs of each class allowed to run on the resource. The following example illustrates a sample configuration that specifies that eight small jobs, five medium jobs, and two large jobs can run on the machine.
CLASS = small(8) medium(5) large(2)
|
TWS LoadLeveler consumable resources
The attributes associated with each machine, such as CPU, memory, and disk space, are
considered consumable resources. These resources are consumable in the sense that a job will use a specified amount of the resource and release it when it completes. Consumable resources can be specified for a specific machine or for the entire cluster.
Machine resources include things like CPU, memory, and disk space available on a machine. Floating resources are available across all the machines in the cluster -- a floating software license, for example. Administrators specify the machine and floating consumable resources in a pool. The central manager considers these resources when matching jobs to resources. Users specify the resources required by a job when they submit. The central manager looks at the consumable resources required for the job and checks available consumable resources on machines to determine where a job can run.
When a job is scheduled to run on a machine, the central manager decreases the consumable resources on that machine by the amount required for the job. When the job completes, the consumable resources are increased by the same amount.
TWS LoadLeveler schedulers
TWS LoadLeveler has different scheduling algorithms that can be used, depending on the types of jobs that need to be run. Administrators select and specify the scheduling algorithm to use in the TWS LoadLeveler administration configuration file. A default, backfill, and gang scheduler are provided with TWS LoadLeveler.
Default scheduler
The default scheduler runs jobs by scheduling them on idle resources. The default scheduler starts, suspends, and resumes jobs based on workload. It handles serial jobs extremely well, but can schedule parallel jobs, using reservations. When scheduling parallel jobs, nodes are reserved as they become available. The reserved nodes remain idle until enough nodes are available to run the parallel job. Since this process leaves reserved nodes idle until enough are accumulated, it can result in lower overall utilization when scheduling large parallel jobs.
Backfill scheduler
The backfill scheduler schedules large parallel jobs across blocks of resources, then looks for smaller jobs to fill in the gaps. The back-filling algorithm runs shorter jobs that require a small number of resources, while larger jobs are waiting for enough nodes to run. Shorter jobs can run on the idle resources as long as they do not delay the start of the larger jobs.
The scheduler determines if a shorter job can run by comparing the projected start time of a large job with the wall_clock_limit of a smaller job. If the smaller job will complete before the larger job's predicted start time, it is allowed to run. In order to use the backfill scheduler, users must set a wall_clock_limit for jobs or the administrator must set a wall_clock_limit for each of the job classes.
The wall_clock_limit is the upper time limit a job should take to complete. If a job runs longer than the wall clock time, it is assumed to be runaway and is terminated. This may seem harsh, but requiring accurate estimates of the execute time for a job allows the scheduler to do more efficient scheduling.
Gang scheduler
A job’s start time depends on jobs ahead of it in the queue and the completion of the running jobs. Long-running jobs can potentially delay the start of other jobs. Time sharing allocates processing time to jobs in small increments. The gang scheduler works like an operating system using context-switching to support time-sharing and space-sharing. The gang is the group of tasks associated with a job. The tasks are scheduled to run at the same time across multiple machines with each task receiving a slice of time in which to execute.
Getting started
The job command file
Users describe a job in a job command file, then submit the command file to the TWS
LoadLeveler to run the job.
The following example illustrates the features of TWS LoadLeveler by creating a simple
command file to submit a sample job. A command file should contain comments to describe the job, as well as track version history and changes. The pound character (#) at the beginning of a line identifies a comment. If the at sign (@) follows the pound character, it identifies a keyword. Keywords are used to specify input, output, and error files associated with a job, as well as the job class. To see which classes have been predefined by the administrator, use the llclass command.
Listing 1. llclass command
[loadl@blade31 loadl]$ llclass
Name MaxJobCPU MaxProcCPU Free Max Description
d+hh:mm:ss d+hh:mm:ss Slots Slots
--------------- -------------- -------------- ----- ----- ---------------------
large undefined undefined 2 2
medium undefined undefined 5 5
small undefined undefined 8 8
|
Additional information on a job class is viewed using the -l flag.
Listing 2. llclass command arguments
[loadl@blade31 loadl]$ llclass -l large
=============== Class large ===============
Name: large
Priority: 0
Exclude_Users: adams
Include_Users:
Exclude_Groups:
Include_Groups: staff
Admin:
NQS_class: F
NQS_submit:
NQS_query:
Max_processors: -1
Maxjobs: -1
Resource_requirement:
Class_comment: large job class for grid integration center cluster
Class_ckpt_dir:
Ckpt_limit: undefined, undefined
Wall_clock_limit: 00:30:00, undefined (1800 seconds, undefined)
Def_wall_clock_limit: 00:30:00, undefined (1800 seconds, undefined)
Job_cpu_limit: undefined, undefined
Cpu_limit: undefined, undefined
Data_limit: undefined, undefined
Core_limit: undefined, undefined
File_limit: undefined, undefined
Stack_limit: undefined, undefined
Rss_limit: undefined, undefined
Nice: 0
Free_slots: 2
Maximum_slots: 2
Execution_factor: 1
Max_total_tasks: -1
Max_proto_instances: 2
Preempt_class:
Start_class:
User default: maxidle(-1) maxqueued(-1) maxjobs(-1) max_total_tasks(-1)
|
The following command file is for a small job that directs errors to the simple.err file, output to the simple.out file, and queues the job. The job will sleep, then echo the job name and hostname. The sleep command has been inserted to allow us to view the job status before it completes.
Listing 3. Job command file
#
# Simple sample script that will demonstrate submitting jobs to LL
#
# @ class = small
# @ error = simple.err
# @ output = simple.out
# @ queue
#
sleep 60
echo Job $0 is running on `hostname`
|
Submitting the job
Jobs are submitted using the llsubmit command, followed by the command file.
[loadl@blade31 loadl]$ llsubmit simple.cmd
llsubmit: The job "blade31.austin.ibm.com.15" has been submitted.
|
Job status
Once the job has been submitted, the status can be viewed using the llq command.
Listing 4. llq command
[loadl@blade31 loadl]$ llq
Id Owner Submitted ST PRI Class Running On
------------------------ ---------- ----------- -- --- ------------ -----------
blade31.15.0 loadl 5/12 14:02 R 50 small blade31
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted
|
Cluster status
The status of all the resources in the pool can be viewed using the llstatus command.
Listing 5. llstatus command
[loadl@blade31 loadl]$ llstatus
blade31.austin.ibm.com Avail 1 1 Run 1 2.00 0 i386 Linux2
i386/Linux2 1 machines 1 jobs 1 running
Total Machines 1 machines 1 jobs 1 running
The Central Manager is defined on blade31.austin.ibm.com
The BACKFILL scheduler is in use
All machines on the machine_list are present.
|
When the job completes, TWS LoadLeveler notifies the submitter and provides job summary information in an e-mail.
Listing 6. Notification
From loadl@blade31.austin.ibm.com Tue May 16 10:03:53 2006
Date: Tue, 16 May 2006 10:03:53 -0500
From: loadl@blade31.austin.ibm.com
To: mausolf@blade31.austin.ibm.com
Subject: blade31.austin.ibm.com.16
From: LoadLeveler
LoadLeveler Job Step: blade31.austin.ibm.com.16.0
Executable: /home/mausolf/simple.cmd
Executable arguments:
State for machine: blade31.austin.ibm.com
LoadL_starter: The program, simple.cmd, \
exited normally and returned an exit code of 0.
This job step was dispatched to run 1 time(s).
This job step was rejected by Starter 0 time(s).
Submitted at: Tue May 16 09:58:52 2006
Started at: Tue May 16 09:58:53 2006
Exited at: Tue May 16 10:03:53 2006
Real Time: 0 00:05:01
Job Step User Time: 0 00:00:00
Job Step System Time: 0 00:00:00
Total Job Step Time: 0 00:00:00
Starter User Time: 0 00:00:00
Starter System Time: 0 00:00:00
Total Starter Time: 0 00:00:00
|
The command-line interface demonstrated above is preferred by some users, but for others a graphical user interface (GUI), xloadl, is also provided.
Figure 1. xloadl -- Select a job to submit
Figure 2 shows job status as viewed through the xload GUI.
Figure 2. xloadl -- Status
Summary
In TWS LoadLeveler, the central manager knows about all the resources in the pool. Each machine sends the central manager information about the resources it has available.
When a user submits a job, it's placed in the job queue by the scheduler. The scheduler coordinates with the central manager to locate a resource that can satisfy the job requirements, then contacts the selected resource to run the job. Users can submit jobs to TWS LoadLeveler through the command line or using the xloadl GUI. TWS LoadLeveler handles all the complexities associated with determining when and where jobs should run in order to maximize job throughput and resource utilization.
TWS LoadLeveler V8.3 is available for AIX and Linux.
Resources Learn
Get products and technologies
-
Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss
About the author  | 
|  | Jeff Mausolf is a certified IT architect for the IBM Grid Integration Center in Austin, Texas, which is part of the Global Technology Integration and Management Competency; he has worked as an Application Architect and Software Engineer, developing commercial and governmental portals. He is part of the Grid Computing Initiative and has worked on numerous grid projects, including authoring an IBM Redbook on developing grid services. Before coming to IBM 14 years ago, he worked for Lockheed, Loral, and Ford AeroSpace at NASA. While at the Johnson Space Center, he supported astronaut mission training in the Shuttle Engineering Simulation (SES) laboratory, building Space Station components and working on the AP101S General Purpose Computer (GPC) for the Space Shuttle. |
Rate this page
|