Network-aware scheduling

LSF can schedule and launch IBM Parallel Environment (PE) jobs according to the job requirements, IBM Parallel Environment requirements, network availability, and LSF scheduling policies.

Network resource collection

To schedule a PE job, LSF must know what network resources are available.

LSF_PE_NETWORK_NUM must be defined with a non-zero value in lsf.conf, LSF collects network information for PE jobs. If LSF_PE_NETWORK_NUM is set to a value greater than zero, two string resources are created:
pe_network

A host-based string resource that contains the network ID and the number of network windows available on the network.

pnsd

Set to Y if the PE network resource daemon pnsd responds successfully, or N if there is no response. PE jobs can only run on hosts with pnsd installed and running.

Use lsload -l to view network information for PE jobs. For example, the following lsload command displays network information for hostA and hostB, both of which have 2 networks available. Each network has 256 windows, and pnsd is responsive on both hosts. In this case, LSF_PE_NETWORK_NUM=2 should be set in lsf.conf:
lsload -l
HOST_NAME   status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem   pnsd
pe_network                                 
hostA               ok   1.0   0.1   0.2  10%   0.0     4  12     1   33G 4041M 2208M  Y
ID= 1111111,win=256;ID= 2222222,win=256
hostB               ok   1.0   0.1   0.2  10%   0.0     4  12     1   33G 4041M 2208M  Y
ID= 1111111,win=256;ID= 2222222,win=256

Specifying network resource requirements

The network resource requirements for PE jobs are specified in the parameter NETWORK_REQ, which can be specified at queue-level in lsb.queues or in an application profile in lsb.applications, and on the bsub command with the -network option.

The NETWORK_REQ parameter and the -network option specifies network communication protocols, the adapter device type to use for message passing, network communication system mode, network usage characteristics, and number of network windows (instances) required by the PE job.

network_res_req has the following syntax:

[type=sn_all | sn_single] [:protocol=protocol_name[(protocol_number)][,protocol_name[(protocol_number)]] [:mode=US | IP] [:usage=shared | dedicated] [:instance=positive_integer]

LSF_PE_NETWORK_NUM must be defined to a non-zero value in lsf.conf for the LSF to recognize the -network option. If LSF_PE_NETWORK_NUM is not defined or is set to 0, the job submission is rejected with a warning message.

The -network option overrides the value of NETWORK_REQ defined in lsb.applications, which overrides the value defined in lsb.queues.

The following IBM LoadLeveller job command file options are not supported in LSF:
  • collective_groups
  • imm_send_buffers
  • rcxtblocks

For detailed information on the supported network resource requirement options, see IBM Spectrum LSF command reference and IBM Spectrum LSF configuration reference.

Network window reservation

On hosts with IBM PE installed, LSF reserves a specified number of network windows for job tasks. For a job with type=sn_single, LSF reserves windows from one network for each task. LSF ensures that the reserved windows on different hosts are from same network, such that:

reserved_window_per_task = num_protocols * num_instance

For jobs with type=sn_all, LSF reserve windows from all networks for each task, such that:

reserved_window_per_task_per_network = num_protocols * num_instance where:
  • num_protocols is the number of communication protocols specified by the protocols of bsub –network or NETWORK_REQ (lsb.queues and lsb.applications)

  • num_instance is the number of instances specified by the instances of bsub –network or NETWORK_REQ (lsb.queues and lsb.applications)

Network load balancing

LSF balances network window load. LSF does not to balance network load for jobs with type=sn_all because these jobs request network windows from all networks. Jobs with type=sn_single job request network windows from only one network, so LSF chooses a network with the lowest load, which is typically the network with most total available windows.

Network data striping

When multiple networks are configured in a cluster, a PE job can request striping over the networks by setting type=sn_all in the bsub -network option or the NETWORK_REQ parameter in lsb.queues or lsb.applications. LSF supports the IBM LoadLeveller striping with minimum networks feature, which specifies whether or not nodes which have more than half of their networks in READY state are considered for sn_all jobs. This makes certain that at least one network is UP and in READY state between any two nodes assigned for the job.

Network data striping is enabled in LSF for PE jobs with the STRIPING_WITH_MINUMUM_NETWORK parameter in lsb.params, which tells LSF how to select nodes for sn_all jobs when one or more networks are unavailable. For example, if there are 8 networks connected to a node and STRIPING_WITH_MINUMUM_NETWORK=n, all 8 networks would have to be up and in the READY state to consider that node for sn_all jobs. If STRIPING_WITH_MINUMUM_NETWORK=y, nodes with at least 5 networks up and in the READY state would be considered for sn_all jobs.

In a cluster with 8 networks, due to hardware failure, only 3 networks are ok on hostA, and 5 networks are ok on hostB. If STRIPING_WITH_MINUMUM_NETWORK=n, an sn_all job cannot run on either hostA or hostB. If STRIPING_WITH_MINUMUM_NETWORK=y, an sn_all job can run on hostB, but it cannot run on hostA.

Note: LSF_PE_NETWORK_NUM must be defined with a value greater than 0 for STRIPING_WITH_MINUMUM_NETWORK to take effect.

See the IBM Parallel Environment: Operation and Use guide (SC23-6781-05) and the LoadLeveler Using and Administering guide (SC23-6792-04) for more information about data striping for PE jobs.

LSF network options, PE environment variables, POE options

The following table shows the LSF network resource requirement options, and their equivalent PE environment variable POE job command file option:
LSF network option PE Environment variable POE option
bsub -n MP_PROCS -procs
bsub -network "protocol=..." MP_MSG_API -msg_api
bsub -network "type=..." MP_EUIDEVICE -euidevice
bsub -network "mode=..." MP_EUILIB -euilib
bsub -network "instance=..." MP_INSTANCE -instances
bsub -network "usage=..." MP_ADAPTER_USE -adapter_use