LSF Job Step Manager
IBM® Job Step Manager (JSM) Version 10.4 is a lightweight job step scheduler that can help you manage available compute resources that are provided through an LSF® allocation.
You can use JSM to run multiple parallel applications. A scenario where JSM might be helpful, is in a testing environment or when you have an application that is run in multiple phases and each phase has its own resource requirements. The granular control that JSM provides is extended to the individual cores, GPUs, and memory of the nodes within the LSF allocation (including the process binding and integration with OpenMP).
JSM is different from IBM® Spectrum Load Sharing Facility (LSF) because it manages a collection of resources for a single user. JSM provides the following features and benefits when compared to LSF:
- Job steps can be run in a specific order.
- Job steps allow access to the same staged files.
- Job steps can share compute resources.
- Resource Manager Jobs do not experience delays in scheduling because the scheduler is not managing requests from multiple users.
- The JSM scheduling algorithm is lightweight and does not schedule based on time estimations, but uses a simple FIFO queue for scheduling job steps.
Job steps are created whenever a user attempts to create processes that are running on the allocation. The jsrun command is used to create processes. The jsrun command requests resources and specifics one or more tasks to be launched on those resources. Up to 100,000 job steps can be created within a single LSF allocation.
You can also create new processes by using the PMIx_Spawn API. The PMIx_Spawn API is invoked by calls from the MPI_Comm_spawn API that is used in most MPI implementations.
The jsrun command is designed to provide fast job startup at a large scale by using efficient communication topology and by using the PMIx APIs to provide information to the application that allows the application to quickly connect to peer applications. The jsrun command provides the following features:
- Determines how processes can be launched onto the associated resources.
- Processes management that includes stdio management, signal propagation, environment propagation, and establishing the current working directory for processes.
You can use the jskill command to signal or terminate launched processes. You can use the jslist command to list job steps that are running, completed, or in a queue. The jsinfo command provides information about the current JSM allocation resources.
PMIx is open source software that establishes a portable API that allows processes to interact with resource managers. The PMIx software provides processes and tasks for requesting the following information services:
- Request the identify of the task and the number of peer tasks
- Mechanism to exchange data with peers (specifically, high-speed network endpoint data)
- Publish and lookup services of key value pairs
- Create extra processes
- Terminate processes
To access the features of PMIx, tasks must be registered as PMIx clients. Some MPI implementations use PMIx to launch and manage processes, including IBM Spectrum™ MPI. JSM uses the PMIx version 2.0 convenience library and supports only 2.0 version-based clients. For documentation about PMIx, see the PMIx website.
What's new in IBM Job Step Manager (JSM)
Read about new or significantly changed information for JSM.
LSF Job Step Manager PDFs
This topic contains links to PDFs of the LSF JSM documentation.
Job Step Manager configuration on a system
JSM is installed on systems that are running IBM® Custer System Management (CSM) software that is integrated with LSF.
Job Step Manager resource sets
JSM separated the concept of resource allocation from that of resource tasks. Find more information on JSM resource sets if you are not familiar with launchers that have complicated methods of determining resources, that are based on the number of tasks that you want created.
Job Step Manager process management
The jsrun command creates processes, but also provides management for the processes that the command creates. The processes remain children processes and are monitored to determine when they terminate.
Job Step Manager customization
A system administrator can change the default values that are used by JSM.
Job Step Manager job step stages
The lifecycle for a job step has three different stages that JSM tracks. The start, complete, and update events that occur in a stage are recorded in JSM with the progress tracing feature.
Job Step Manager progress tracing
The JSM progress tracing feature, records launch, fence, and shutdown events for each job step.
Job Step Manager fault tolerance support
JSM has support for some fail-stop failures for the
jsmddaemon. The processes for the
jsmddaemon can be terminated or isolated from communicating with peer
jsmddaemons. The remaining
jsmddaemons that are not terminated or isolated use the hardware resources that they manage to continue providing functionality for launching and managing job steps.
Troubleshooting Job Step Manager
JSM creates log files if the JSM_LOG_ENABLE environment variable is set to yes.
Job Step Manager commands
You can use the JSM commands to start a parallel application, stop and start job steps, and list the running, pending, and completed job steps.