Resource management

The following new feature affects resource management and allocation.

Request additional resources to allocate to running jobs

The new bresize request subcommand option allows you to request additional tasks to be allocated to a running resizable job, which grows the resizable job. This means that you can both grow and shrink a resizable job by using the bresize command.

Specify GPU resource requirements for your jobs

Specify all GPU resource requirement as part of job submission, or in a queue or application profile. Use the option bsub –gpu to submit jobs that require GPU resources. Specify how LSF manages GPU mode (exclusive or shared), and whether to enable the NVIDIA Multi-Process Service (MPS) for the GPUs used by the job.

The parameter LSB_GPU_NEW_SYNTAX in the lsf.conf file enables jobs to use GPU resource requirements that are specified with the bsub -gpu option or in the queue, application profile.

Use the bsub -gpu option to specify GPU requirements for your job or submit your job to a queue or application profile that configures GPU requirements in the GPU_REQ parameter.

Set a default GPU requirement by configuring the LSB_GPU_REQ parameter in the lsf.conf file.

Use the bjobs -l command to see the combined and effective GPU requirements that are specified for the job.

What's new in resource connector for IBM Spectrum LSF

Support for Microsoft Azure as a resource provider

LSF resource connector now supports Microsoft Azure as a resource provider. LSF clusters can launch instances from Microsoft Azure if the workload demand exceeds cluster capacity. The resource connector generates requests for additional hosts from these providers and dispatches jobs to dynamic hosts that join the LSF cluster. When the demand reduces, the resource connector shuts down the LSF slave daemons and cancels allocated virtual servers.

To specify the configuration for provisioning from Microsoft Azure, use the azureprov_config.json and the azureprov_templates.json configuration files.

Submit jobs to use AWS Spot instances

Use Spot instances to bid on spare Amazon EC2 computing capacity. Since Spot instances are often available at a discount compared to the pricing of On-Demand instances, you can significantly reduce the cost of running your applications, grow your application’s compute capacity and throughput for the same budget, and enable new types of cloud computing applications.

With Spot instances you can reduce your operating costs by up to 50-90%, compared to on-demand instances. Since Spot instances typically cost 50-90% less, you can increase your compute capacity by 2-10 times within the same budget.

Spot instances are supported on any Linux x86 system that is supported by LSF.

Support federated accounts with temporary access tokens

Resource connector supports federated accounts for LSF resource connector as an option instead of requiring permanent AWS IAM account credentials. Federated users are external identities that are granted temporary credentials with secure access to resources in AWS without requiring creation of IAM users. Users are authenticated outside of AWS (for example, through Windows Active Directory).

Use the AWS_CREDENTIAL_SCRIPT parameter in the awsprov_config.json file to specify a path to the script that generates temporary credentials for federated accounts. For example,
AWS_CREDENTIAL_SCRIPT=/shared/dir/generateCredentials.py
LSF executes the script as the primary LSF administrator to generate a temporary credentials before it creates the EC2 instance.

Support starting instances within an IAM Role

IAM roles group AWS access control privileges together. A role can be assigned to an IAM user or an IAM instance profile. IAM Instance Profiles are containers for IAM roles that allow you to associate an EC2 instance with a role through the profile. The EC2 runtime environment contains temporary credentials that have the access control permissions of the profile role.

To make the roles available for resource connector to create instances, use the instanceProfile attribute in the awsprov_templates.json file to specify an AWS IAM instance profile to assign to the requested instance. Jobs running in that instance can use the instance profile credentials to access other AWS resources. Resource connector uses that information to request EC2 compute instances with particular instance profiles. Jobs that run on those hosts use temporary credentials provided by AWS to access the AWS resources that the specified role has privileges for.

Tag attached EBS volumes in AWS

The instanceTags attribute in the awsprov_templates.json file can tag EBS volumes with the same tag as the instance. EBS volumes in AWS are persistent block storage volumes used with an EC2 instance. EBS volumes are expensive, so you can use the instance ID that tags the volumes for the accounting purposes.
Note: The tags cannot start with the string aws:. This prefix is reserved for internal AWS tags. AWS gives an error if an instance or EBS volume is tagged with a keyword starting with aws:. Resource connector removes and ignores user-defined tags that start with aws:.

Resource connector demand policies in queues

The RC_DEMAND_POLICY parameter in the lsb.queues file defines threshold conditions to determine whether demand is triggered to borrow resources through resource connector for all the jobs in the queue. As long as pending jobs at the queue meet at least one threshold condition, LSF expresses the demand to resource connector to trigger borrowing.

The demand policy defined by the RC_DEMAND_POLICY parameter can contain multiple conditions, in an OR relationship. A condition is defined as [ num_pend_jobs[,duration]]. The queue has more than the specified number of eligible pending jobs that are expected to run at least the specified duration in minutes. The num_pend_jobs option is required, and the duration is optional. The default duration is 0 minutes.

View the status of provisioned hosts with the bhosts -rc command

Use the bhosts -rc or the bhosts -rconly command to see the status of resources provisioned by LSF resource connector.

To use the -rc and -rconly options, the mosquitto binary file for the MQTT broker must be installed in LSF_SERVERDIR, and running (check with the ps -ef | grep mosquitto command). The the LSF_MQ_BROKER_HOSTS parameter must be configured in the lsf.conf file.

For hosts provisioned by resource connector, the RC_STATUS, PROV_STATUS, and UPDATED_AT columns show appropriate status values and a timestamp. For other hosts in the cluster, these columns are empty.

For example,
bhosts -rc
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV RC_STATUS      PROV_STATUS    UPDATED_AT
ec2-35-160-173-192 ok              -      1      0      0      0      0      0 Allocated      running        2017-04-07T12:28:46CDT
lsf1.aws.          closed          -      1      0      0      0      0      0

The -l option shows more detailed information about provisioned hosts.
bhosts -rc -l
HOST  ec2-35-160-173-192.us-west-2.compute.amazonaws.com
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV RC_STATUS      PROV_STATUS    UPDATED_AT             DISPATCH_WINDOW
ok              60.00     -      1      0      0      0      0      0 Allocated      running        2017-04-07T12:28:46CDT      -

 CURRENT LOAD USED FOR SCHEDULING:
                r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem  slots
 Total           1.0   0.0   0.0    1%   0.0    33    0     3 5504M    0M  385M      1
 Reserved        0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M      -

The -rconly option shows the status of all hosts provisioned by LSF resource connector, no matter if they have joined the cluster or not.