IBM Spectrum LSF resource connector overview

The resource connector for IBM Spectrum LSF feature (previously referred to as host factory) enables LSF clusters to borrow resources from supported resource providers.

LSF resource connector plug-ins support the following resource providers:
  • IBM Spectrum Conductor with Spark and IBM Spectrum Symphony through EGO, which is configured with the egoprov_config.json and egoprov_templates.json files.
  • OpenStack, which is configured with the osprov_config.json and osprov_templates.json files. The OpenStack provider requires IBM Spectrum LSF Fix Pack 1.
  • Amazon Web Services (AWS), which is configured with the awsprov_config.json and awsprov_templates.json files. The AWS provider requires IBM Spectrum LSF Fix Pack 2.
  • IBM Cloud (formerly IBM Bluemix and IBM SoftLayer), configured with the softlayerprov_config.json and softlayerprov_templates.json files. The IBM Cloud provider requires IBM Spectrum LSF Fix Pack 3.
  • Microsoft Azure, which is configured with the azureprov_config.json and azureprov_templates.json files. The Microsoft Azure provider requires IBM Spectrum LSF Fix Pack 3.
  • Google Compute Cloud, configured with the googleprov_config.json and googleprov_templates.json files. The Google Compute provider requires IBM Spectrum LSF Fix Pack 4.

LSF clusters can borrow hosts from a resource provider to satisfy pending workload. The borrowed resources join the LSF cluster as hosts. When the resources become idle, LSF resource connector returns them to the resource provider.

The resource connector generates requests for extra hosts from the resource provider and dispatches jobs to dynamic hosts that join the LSF cluster. When the resource provider reclaims the hosts, the resource connector requeues the jobs that are running on the LSF hosts, shuts down LSF daemons, and releases the hosts to the provider.

Requirements for configuring Resource Connector

The following are requirements for configuring IBM Spectrum LSF Resource Connector.
  • You must have root access to the LSF management host.
  • Both the LSF management host and the compute nodes (provider instances) must be reachable from each other.
  • You must be able to restart the LSF cluster.
  • You must be familiar with the provider's concepts and be able to perform administrative tasks.
  • The virtual network to be used by the provider's virtual instances must be configured so that they can communicate with the LSF hosts on-premise.
  • You must configure the provider instances to map users to the LSF cluster submission users. For example, add the submission users to the provider instance or synchronize the users on the launched provider instances.
  • Decide how you want LSF to authenticate to the provider to access their services.

Java requirements for LSF management host

The following are Java requirements for the IBM Spectrum LSF management host before configuring IBM Spectrum LSF Resource Connector.
  • Java Runtime Environment (JRE) version 8

How LSF borrows hosts from a resource provider

The following workflow summarizes how your job uses resources that are borrowed from a resource provider:
  1. A user submits a job to LSF as normal. The job generates demand, but the cluster doesn't have enough resources to service it, so it must borrow hosts from an external provider.
  2. The mbatchd daemon checks if hosts are already allocated that match the demand, and LSF calculates how many of each template type it requires to run the job. LSF sends this demand to the resource connector ebrokerd daemon.

    The administrator configures templates representing LSF hosts. Each resource provider has its own template file that defines the mapping between LSF resource demand requests and hosts that the provider allocates to LSF. Each template in the file represents a set of hosts that share some attributes, such as the number of CPUs, the amount of available memory, the installed software stack, and operating system image.

  3. Based on the demand from the submitted job, LSF resource connector makes an allocation request to the resource provider.

    For example, if the resource provider is EGO, resource connector makes an allocation request to EGO as the LSF_Consumer.

  4. For EGO resources, if enough resources are available in the rg_shared resource group, the allocation request succeeds.
  5. The ebrokerd daemon monitors the status of the request in the resource provider until it detects that the request succeeds, starts LSF daemons on the allocated hosts, and notifies LSF that the hosts join the cluster and are ready to use.
  6. When the host joins the cluster, the job is dispatched to the host.
  7. When there is no more demand for borrowed resources, LSF lets resource connector know and the ebrokerd daemon returns the resources to the provider.
  8. Some resource providers (for example, EGO) can also be configured to reclaim borrowed resources from LSF when they require the resources to satisfy their workload demand.

Example of borrowing hosts from EGO

In the following example, the resource provider is EGO:
  1. bhosts -a
    HOST_NAME          STATUS                JL/U    MAX  NJOBS    RUN  SSUSP           USUSP    RSV   
    lsfmanagement      ok                       -      1      1      0      0               0      0 
  2. EGO has a host ego01 with ncpus=1.
  3. Resource connector is configured to connect to the EGO cluster.
  4. A template is created that provides a numeric attribute ncpus with range [1:1].
  5. The bsub command submits a job that requires a single slot.
  6. Eventually host ego01 joins the cluster, and the new job runs.
    bhosts -a   
    HOST_NAME          STATUS                JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
    lsfmanagement      ok                       -      1      1      1      0      0      0
    ego01              ok                       -      1      1      1      0      0      0

View the status of provisioned hosts

Use the bhosts -rc or the bhosts -rconly command to see information about resources that are provisioned by LSF resource connector.

The -rc and -rconly options make use of the third-party mosquitto message queue application. LSF resource connector publishes additional provider host information to displayed by these bhosts options. The mosquitto binary file is included as part of the LSF distribution.

To use the -rc and -rconly options, LSF resource connector must be enabled with the LSB_RC_EXTERNAL_HOST_FLAG parameter in the lsf.conf file.

If you use the MQTT message broker that is distributed withLSF, you must configure the LSF_MQ_BROKER_HOSTS and MQTT_BROKER_HOST parameters in the lsf.conf file. The LSF_MQ_BROKER_HOSTS and MQTT_BROKER_HOST parameters must specify the same host name. The LSF_MQ_BROKER_HOSTS parameter enables LIM to start the mosquitto daemon.

If you use an existing MQTT message broker, you must configure the MQTT_BROKER_HOST parameter. You can optionally specify an MQTT broker port with the MQTT_BROKER_PORT parameter.

Use the ps command to check that the MQTT message broker daemon (mosquitto) is installed and running: ps -ef | grep mosquitto.

Configure the EBROKERD_HOST_CLEAN_DELAY to specify a delay, in minutes, after which the ebrokerd daemon removes information about relinquished or reclaimed hosts. This parameter allows the bhosts command to get LSF resource connector provider host information for some time after they are deprovisioned.

Three more columns are shown in the bhosts command host list:
LSF resource connector status.
Resource connector started the preprovisioning script for the new host.
The preprovisioning script returned an error.
The host is ready to join the LSF cluster.
A host reclaim request was received from the provider (for example, for an AWS spot instance).
LSF started to relinquish the host.
LSF finished relinquishing the host.
LSF sent a return request to the provider.
LSF started the postprovisioning script after the host was returned.
The host life cyle is complete.
Provider status. This status depends the provider. For example, AWS has pending, running, shutting down, terminated, and others. Check documentation for the provider to understand the status that is displayed.
Time stamp of the latest status change.

For hosts provisioned by resource connector, these columns show appropriate status values and a time stamp. A dash (-) is displayed in these columns for other hosts in the cluster.

For example,
bhosts -rc
ec2-35-160-173-192 ok              -      1      0      0      0      0      0  Allocated      running        2017-04-07T12:28:46CDT          closed          -      1      0      0      0      0      0          -           -              -

The -l option shows more detailed information about provisioned hosts:
bhosts -rc -l
ok              60.00     -      1      0      0      0      0      0 Allocated      running        2017-04-07T12:28:46CDT      -

                r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem  slots
 Total           1.0   0.0   0.0    1%   0.0    33    0     3 5504M    0M  385M      1
 Reserved        0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M      -

The -rconly option shows the status of all hosts that are provisioned by LSF resource connector, no matter if they are in the cluster or not.

The following information is shown:
Public DNS name and IP address of the host.
Private DNS name and IP address of the host.
LSF resource connector status.
Resource provider status.
The RC_ACCOUNT value that is defined in the lsb.queues or lsb.applications files.
Time stamp of the latest status change.
For example,
bhosts -rconly 
  TEMPLATE : aws-vm-1 
    ec2-52-43-171-109.    Done                  terminated            default        2017-05-31T14:30:47CDT 
    ec2-35-160-157-112    Allocated             running               default        2017-05-31T14:32:00CDT 


Do not create advanced reservations on AWS instances because the reservations might be terminated after idle time. If advanced reservations are created on instances, they remain active if the instances are destroyed. However, jobs are not able to run on the instance since the LSF daemons are shut down on terminated instances and the jobs become unavailable.

Hosts can be returned to their resource provider at any time by idle time or time-to-live policy, EGO reclaim, or AWS reclaim. The hosts might be closed or unavailable when the advanced reservation starts.

It also is possible for resource connector to over-demand for its workload if a borrowed host joins the cluster but it is not immediately usable by scheduler.

If borrowed hosts cannot resolve each other's host name, then commands like lsrcp do not work when used to copy the files from one instance to another.

The HOSTS parameter in the lsb.queues file and the job-level -m option do not apply to borrowed hosts managed through the resource connector.

Administrators must use the RC_HOSTS parameter in the queue to specify the external resources that resource connector can borrow resources from. A queue can borrow hosts only from the resource that the RC_HOSTS parameter defines. For example, if the queue defines only the AWS resources (RC_HOSTS=awshost), it cannot borrow EGO or OpenStack resources.

The RC_ACCOUNT parameter that is defined in an application profile in the lsb.applications file is not displayed in the bapp -l command. The bqueues -l command shows the value of the RC_ACCOUNT and RC_HOSTS parameters that are defined in queues. The bhosts -rconly option displays the RC_ACCOUNT value under the TAG column.

If you configure the LSF_MQ_BROKER_HOSTS parameter to enable the bhosts -rc and bhosts -rconly command options to display resource provider host information, the -rc and -rconly options do not support host groups or CUs.