IBM Spectrum LSF resource connector overview
The resource connector for IBM® Spectrum LSF feature (previously referred to as host factory) enables LSF clusters to borrow resources from supported resource providers.
- IBM Cloud Virtual Servers for VPC Gen 2 (IBM Cloud Gen 2), configured with the ibmcloudgen2_config.json and ibmcloudgen2_templates.json files. The IBM Cloud provider requires IBM Spectrum LSF Fix Pack 11.
- Amazon Web Services (AWS), which is configured with the awsprov_config.json and awsprov_templates.json files. The AWS provider requires IBM Spectrum LSF Fix Pack 2.
- Microsoft Azure, which is configured with the azureprov_config.json and azureprov_templates.json files. The Microsoft Azure provider requires IBM Spectrum LSF Fix Pack 3.
- Google Compute Cloud, configured with the googleprov_config.json and googleprov_templates.json files. The Google Compute provider requires IBM Spectrum LSF Fix Pack 4.
- OpenStack, which is configured with the osprov_config.json and osprov_templates.json files. The OpenStack provider requires IBM Spectrum LSF Fix Pack 1.
LSF clusters can borrow hosts from a resource provider to satisfy pending workload. The borrowed resources join the LSF cluster as hosts. When the resources become idle, LSF resource connector returns them to the resource provider.
The resource connector generates requests for extra hosts from the resource provider and dispatches jobs to dynamic hosts that join the LSF cluster. When the resource provider reclaims the hosts, the resource connector requeues the jobs that are running on the LSF hosts, shuts down LSF daemons, and releases the hosts to the provider.
Requirements for configuring the resource connector
- You must have root access to the LSF management host.
- Both the LSF management host and the compute nodes (provider instances) must be reachable from each other.
- You must be able to restart the LSF cluster.
- You must be familiar with the provider's concepts and be able to perform administrative tasks.
- The virtual network to be used by the provider's virtual instances must be configured so that they can communicate with the LSF hosts on-premise.
- You must configure the provider instances to map users to the LSF cluster submission users. For example, add the submission users to the provider instance or synchronize the users on the launched provider instances.
- Decide how you want LSF to authenticate to the provider to access their services.
Java requirements for LSF management host
- Java Runtime Environment (JRE) version 8
How LSF borrows hosts from a resource provider
- A user submits a job to LSF as normal. The job generates demand, but the cluster doesn't have enough resources to service it, so it must borrow hosts from an external provider.
- The mbatchd daemon checks if hosts are already allocated that match the
demand, and LSF
calculates how many of each template type it requires to run the job. LSF sends
this demand to the resource connector ebrokerd daemon.
The administrator configures templates representing LSF hosts. Each resource provider has its own template file that defines the mapping between LSF resource demand requests and hosts that the provider allocates to LSF. Each template in the file represents a set of hosts that share some attributes, such as the number of CPUs, the amount of available memory, the installed software stack, and operating system image.
- Based on the demand from the submitted job, LSF
resource connector makes an allocation request to the resource provider.
For example, if the resource provider is IBM Cloud Gen 2, resource connector makes an allocation request to IBM Cloud Gen 2 as the LSF_Consumer.
- For IBM Cloud Gen 2 resources, if enough resources are available in the rg_shared resource group, the allocation request succeeds.
- The ebrokerd daemon monitors the status of the request in the resource provider until it detects that the request succeeds, starts LSF daemons on the allocated hosts, and notifies LSF that the hosts join the cluster and are ready to use.
- When the host joins the cluster, the job is dispatched to the host.
- When there is no more demand for borrowed resources, LSF lets resource connector know and the ebrokerd daemon returns the resources to the provider.
- Some resource providers (for example, IBM Cloud Gen 2) can also be configured to reclaim borrowed resources from LSF when they require the resources to satisfy their workload demand.
Example of borrowing hosts from IBM Cloud Gen 2
-
bhosts -a HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV lsfmanagement ok - 1 1 0 0 0 0
- IBM Cloud Gen 2 has a host ibmcloud01 with ncpus=1.
- Resource connector is configured to connect to the IBM Cloud Gen 2 cluster.
- A template is created that provides a numeric attribute ncpus with range [1:1].
- The bsub command submits a job that requires a single slot.
- Eventually host
ibmcloud01
joins the cluster, and the new job runs.
bhosts -a HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV lsfmangemnt ok - 1 1 1 0 0 0 ibmcloud01 ok - 1 1 1 0 0 0
View the status of provisioned hosts
Use the bhosts -rc or the bhosts -rconly command to see information about resources that are provisioned by LSF resource connector.
The -rc and -rconly options make use of the third-party mosquitto message queue application. LSF resource connector publishes additional provider host information to displayed by these bhosts options. The mosquitto binary file is included as part of the LSF distribution.
To use the -rc and -rconly options, LSF resource connector must be enabled with the LSB_RC_EXTERNAL_HOST_FLAG parameter in the lsf.conf file.
If you use the MQTT message broker that is distributed withLSF, you must configure the LSF_MQ_BROKER_HOSTS and MQTT_BROKER_HOST parameters in the lsf.conf file. The LSF_MQ_BROKER_HOSTS and MQTT_BROKER_HOST parameters must specify the same host name. The LSF_MQ_BROKER_HOSTS parameter enables LIM to start the mosquitto daemon.
If you use an existing MQTT message broker, you must configure the MQTT_BROKER_HOST parameter. You can optionally specify an MQTT broker port with the MQTT_BROKER_PORT parameter.
Use the ps command to check that the MQTT message broker daemon (mosquitto) is installed and running: ps -ef | grep mosquitto.
Configure the EBROKERD_HOST_CLEAN_DELAY to specify a delay, in minutes, after which the ebrokerd daemon removes information about relinquished or reclaimed hosts. This parameter allows the bhosts command to get LSF resource connector provider host information for some time after they are deprovisioned.
- RC_STATUS
- LSF
resource connector status.
- Preprovision_Started
- Resource connector started the preprovisioning script for the new host.
- Preprovision_Failed
- The preprovisioning script returned an error.
- Allocated
- The host is ready to join the LSF cluster.
- Reclaim_Received
- A host reclaim request was received from the provider (for example, for an AWS spot instance).
- RelinquishReq_Sent
- LSF started to relinquish the host.
- Relinquished
- LSF finished relinquishing the host.
- Deallocated_Sent
- LSF sent a return request to the provider.
- Postprovision_Started
- LSF started the postprovisioning script after the host was returned.
- Done
- The host life cyle is complete.
- PROV_STATUS
- Provider status. This status depends the provider. For example, AWS has pending, running, shutting down, terminated, and others. Check documentation for the provider to understand the status that is displayed.
- UPDATED_AT
- Time stamp of the latest status change.
For hosts provisioned by resource connector, these columns show appropriate status values and a time stamp. A dash (-) is displayed in these columns for other hosts in the cluster.
bhosts -rc
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV RC_STATUS PROV_STATUS UPDATED_AT
icgen2host-10-240-0-37 ok - 1 0 0 0 0 0 Allocated running 2017-04-07T12:28:46CDT
lsf1.aws. closed - 1 0 0 0 0 0 - - -
bhosts -rc -l
HOST icgen2host-10-240-0-37
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV RC_STATUS PROV_STATUS UPDATED_AT DISPATCH_WINDOW
ok 60.00 - 1 0 0 0 0 0 Allocated running 2017-04-07T12:28:46CDT -
CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots
Total 1.0 0.0 0.0 1% 0.0 33 0 3 5504M 0M 385M 1
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M -
The -rconly option shows the status of all hosts that are provisioned by LSF resource connector, no matter if they are in the cluster or not.
- PUB_DNS_NAME and PUB_IP_ADDRESS
- Public DNS name and IP address of the host.
- PRIV_DNS_NAME and PRIV_IP_ADDRESS
- Private DNS name and IP address of the host.
- RC_STATUS
- LSF resource connector status.
- PROV_STATUS
- Resource provider status.
- TAG
- The RC_ACCOUNT value that is defined in the lsb.queues or lsb.applications files.
- UPDATED_AT
- Time stamp of the latest status change.
bhosts -rconly
PROVIDER : aws
TEMPLATE : aws-vm-1
PUB_DNS_NAME PUB_IP_ADDRESS PRIV_DNS_NAME PRIV_IP_ADDRESS RC_STATUS PROV_STATUS TAG UPDATED_AT
icgen2host-10-240-0-37 52.43.171.109 ip-192-168-0-85.us 192.168.0.85 Done terminated default 2017-05-31T14:30:47CDT
icgen2host-10-240-0-47 35.160.157.112 ip-192-168-0-69.us 192.168.0.69 Allocated running default 2017-05-31T14:32:00CDT
Limitations
Do not create advanced reservations on AWS instances because the reservations might be terminated after idle time. If advanced reservations are created on instances, they remain active if the instances are destroyed. However, jobs are not able to run on the instance since the LSF daemons are shut down on terminated instances and the jobs become unavailable.
Hosts can be returned to their resource provider at any time by idle time or time-to-live policy, EGO reclaim, or AWS reclaim. The hosts might be closed or unavailable when the advanced reservation starts.
Hosts can be returned to their resource provider at any time by idle time or time-to-live policy, or AWS reclaim. The hosts might be closed or unavailable when the advanced reservation starts.
It also is possible for resource connector to over-demand for its workload if a borrowed host joins the cluster but it is not immediately usable by scheduler.
If borrowed hosts cannot resolve host names, then commands like lsrcp do not work when used to copy the files from one instance to another.
The HOSTS parameter in the lsb.queues file and the job-level -m option do not apply to borrowed hosts managed through the resource connector.
Administrators must use the RC_HOSTS parameter in the queue to specify the external resources that resource connector can borrow resources from. A queue can borrow hosts only from the resource that the RC_HOSTS parameter defines. For example, if the queue defines only the AWS resources (RC_HOSTS=awshost), it cannot borrow EGO or OpenStack resources.
The RC_ACCOUNT parameter that is defined in an application profile in the lsb.applications file is not displayed in the bapp -l command. The bqueues -l command shows the value of the RC_ACCOUNT and RC_HOSTS parameters that are defined in queues. The bhosts -rconly option displays the RC_ACCOUNT value under the TAG column.
If you configure the LSF_MQ_BROKER_HOSTS parameter to enable the bhosts -rc and bhosts -rconly command options to display resource provider host information, the -rc and -rconly options do not support host groups or CUs.