Resource planning for compute intensive workloads like EDA has always been a challenge that requires compromise.
The middle ground between the cost of compute resources and the cost of delayed decisions is hard to find. With the advent of cloud bursting, a new flexibility has arrived to break this standoff. Now, when your data center capacity is strained, your existing IBM Spectrum LSF cluster can be extended to the IBM Cloud, where virtually unlimited resources are available and you only pay for what you use.
By adding cloud bursting to your LSF Cluster, you can choose how much capacity to employ to suit your business needs. When time is money, the cloud is ready. When demand is low, the meter stops and the cloud will wait.
In this blog post, we will talk about building a proof-of-concept cloud-bursting EDA workload environment using an IBM Spectrum LSF cluster that is running in an existing on-premises data center, IBM-provided automation scripting and documentation, and, most importantly, the IBM Cloud.
Start with a pre-existing, on-premises cluster
The cluster we use is located in a Yorktown, NY lab. This cluster is minimal and consists of two nodes—a master and a worker. A production cluster or a next-stage proof of concept might contain hundreds of nodes and thousands of cores. For the purposes of this introductory proof of concept, the size of the on-premises cluster is unimportant. Our main interest is seeing how work can be shifted to cloud resources.
Use our existing automation to build the cloud cluster
The on-premises cluster can be either an existing test or production cluster or possibly a minimal cluster you have put together for the purposes of exploring cloud bursting. The next step is building the cloud portion of the multicluster. Again, size is probably not important at this point, so a minimal cluster is a good place to start. Our reference cloud cluster has three nodes—one master and two workers.
If you set out to build a cloud cluster (the cloud half of a multicluster) from scratch, there is a long list of provisioning and configuration tasks, including the following (and probably more):
- Installing the required software packages on your deployer
- Provisioning a VPC
- Provisioning the virtual instances (master(s) and workers)
- Provisioning a DNS service
- Configuring and connecting a VPN that connects the on-premises cluster to the cloud
- Installing and configuring IBM Spectrum LSF
Luckily, IBM has created and made available comprehensive automation that is built on the modern cloud workhorses—Ansible and Terraform. Not only does this automation make the initial setup of a proof-of-concept cluster straightforward, it is the underlying toolset that will deliver the fast provisioning and teardown of resources that define cloud bursting.
In addition, a tutorial was written to provide step-by-step instructions on using the automation to create the cluster.
Build the multicluster
If you have not already, at this point, you can use the tutorial and automation to build the cluster on IBM Cloud. After this cloud cluster is built, you will have a functioning multicluster environment. The next step is to extend an EDA workload across the clusters.
Bring in the workload
As proof of concept, we have chosen two common EDA packages to run on our multicluster: Optical Proximity Correction (OPC) and Design Rule Checking (DRC). Depending on your EDA vendor and the packages you intend to run, you will likely encounter specific challenges in bringing up your workload on a multicluster that are beyond the scope of this blog entry. Hopefully, this general discussion of how we ran our workload and some of the challenges we overcame will help you in building your cloud-bursting proof of concept.
Run your workload on the cloud cluster
Before you can run your workload, you will need to prepare the cloud cluster by installing the software for your EDA workload or, alternatively, use data dependencies to bring over needed software as part of the
bstage in process (or, possibly, some combination of the two).
You will need to give the cluster access to the license service. See the Certificates and License Management section for more information.
Depending on your workload characteristics, you may need to ensure that jobs are sent to a particular node for processing. This can be accomplished with the
bsub -R command. This was useful for our workload since the initial deployment of a job is very resource-intensive because work is first divided into tiles (subtasks) and then distributed to workers.
Moving EDA tasks to the cloud requires careful attention to data management for several reasons:
- It is likely that the two clusters will not share a single filesystem.
- The connection between the on-premises and cloud clusters will, to a varying extent, be bandwidth-limited.
- Depending on your terms of service, minimizing data movement on and off the cloud can reduce cost.
The Spectrum LSF data manager should have been installed and configured on both the on-premises and cloud clusters as part of the Deployment step of the automated setup process. The following were some of the key points in configuring data management for our workload:
- There is an additional setup step that, for security reasons, requires manual intervention. Each user will need to log in to the cloud master, obtain their ssh public key, and add that key to the authorized keys of the on-premises master.
- When a job is submitted, the user will need to point out data dependencies employing the -data option to the LSF
- The user's LSF jobs will need to make the input data available for processing with the
bstage incommand and make the job output available for post-processing by using the
bstage outcommand. This can be as simple as wrapping the existing run-script in
bstage in/outcommands to transfer all required data. This can include binaries or scripts as well, as long as they are executed after
Certificates and license management
Licensing an EDA workload for a multicluster that spans on-premises and cloud domains is a fairly new and developing domain for license management. For our proof-of-concept workload, we used a FlexLM floating license tied to a cloud server in the IBM Cloud London data center. By configuring the transit gateway to span availability zones, we were able to run our workload in the Dallas data center (where the Cloud portion of our multicluster resides), with licensing provided by a license server in the London data center. This scenario is, of course, particular to our workload vendor and licensing terms, but is used only to illustrate how features of the IBM Cloud VPC can be employed to assist in license management.
Monitor the work
Since the multicluster consists of two cooperating clusters, once a job is sent to the cloud cluster—and until it completes—the on-premises cluster’s job and queue monitoring commands will have limited information about the job’s status. There may be times when you would like to see detailed status information while the job is in progress. This can be accomplished by logging into the cloud cluster’s console and running monitoring commands there.
Happy cloud bursting!
Besides the instructions on EDA workload setup that we have provided in this blog, much of what is needed to start cloud bursting your workload is handled by the automation scripts. Together, they should provide you with much of what you will need to set up your own proof-of-concept EDA environment that makes use of IBM Cloud. We hope that you’ll give it a try!