December 16, 2021 By Takeshi Yoshimura 5 min read

An examination of the experimental results of genomics on Red Hat OpenShift.

TL;DR

  • Cloud reduces the cost of genome analysis.
  • OpenShift simplifies hybrid cloud management with standardized interfaces.
  • We extended Cromwell for OpenShift and multiple cloud object storage.
  • A GATK workflow for Google Cloud can run on IBM Cloud with our extension.
  • ClusterAutoscaler further improved cost-efficiency.

Genome analysis

Understanding human DNA structures is an essential element of medical and life sciences. DNA structures are represented as a string of initials of four nucleic acids, such as TACTTGATC.  Their variant discovery (insertion, deletion, etc.) helps understand diseases, for example.

This blog post focuses on GATK, a popular genome analysis toolkit with essential utilities and command lines for genome analysis. Also, we focus on Cromwell, a genomics workflow engine. Users can write a WDL file consisting of multiple GATK commands so that they can reproduce or reuse genome analysis using Cromwell.

From the perspective of computing infrastructure, a challenge of genome analysis is its data scale and required computation power. A single genome file is generated as a 100 GB dataset, including errors and redundancy. Typical genomics workflows start from data preprocessing and eventually step into the variant discovery of input datasets with parallelization.

Genomics in the cloud: Pros and cons

These characteristics of genomics show typical data analytics matching to on-demand availability of computation and storage, a well-known feature of the cloud. You only need to pay when you need additional Linux machines. Using cloud object storage virtually provides cheap and unlimited storage without wear. In contrast, using physical machines means you need to purchase machines with fixed hardware configurations before starting any workloads. Specifically, GATK and Cromwell support various cloud backends, such as AWS Batch and Google Cloud life sciences. These cloud backends manage computing resources in their public cloud infrastructure.

A drawback of using the cloud is that it’s not practical to switch to use other clouds after you place a huge amount of data on a particular cloud object storage. Another problematic scenario is that you may also need to run genome analysis on your “private” cloud to meet constraints of data location and privacy. For both cases, you need to learn how to use different cloud systems, since many cloud functionalities are not standardized.

Genomics on Red Hat OpenShift

Red Hat OpenShift standardizes how to manage a computing cluster, like a single Linux machine. It abstracts backend cloud infrastructure and provides advanced features to efficiently run applications on a cloud cluster. Users can simplify cluster management regardless of underlying cloud infrastructure, including public and private ones.

So, we extend Cromwell as an OpenShift application to meet such complex requirements of hybrid cloud usages. Our extension leverages OpenShift as the job dispatcher for genomics workflows, assuming cloud object storage is mounted in a cluster. Our extension enables us to use multiple cloud object storage and simplified job deployments. Consequently, we confirmed that IBM Cloud can run existing workflows that are written for Google Cloud and AWS. We demonstrate this in the next section.

Another benefit of using OpenShift is its high customizability. An example is CloudAutoscaler, which enables OpenShift to automatically manage the number of computing nodes in a cluster according to recent resource usages. OpenShift can also be customized to mount cloud object storage as a normal Linux filesystem using csi-s3. This feature is practically important since many genomics workflows depend on Linux command lines and scripts running on local files.

Demonstration

As a demonstration, we run a best-practice workflow distributed by Broad Institute. The workflow runs data preprocessing and variant discovery with parallelized HaplotypeCaller on public datasets in Google Storage, while we configured outputs written into IBM Cloud Object Storage in Tokyo. 

Our demonstration runs on Red Hat OpenShift on IBM Cloud. We set up 1 – 24 bx2.32×128 worker nodes (32 vcores of Cascadelake and 128GB RAM) of OpenShift 4.6 in a Tokyo availability zone with  ClusterAutoscaler. For comparison, we also run a static cluster running on 24 worker nodes. We used a very aggressive deletion policy — deleting nodes with three minutes of unused time.

Cost and performance

The cost and performance of a best-practice workflow is summarized in the following list:

  • Runtime hours          
    • Autoscaling: 3.4        
    • Static: 2.8
  • Node hours        
    • Autoscaling: 26.9        
    • Static: 66.4
  • Billing node hours        
    • Autoscaling: 50.0         
    • Static: 72.0
  • Estimated server cost        
    • Autoscaling: $25        
    • Static: $36
  • Estimated cost per hour        
    • Autoscaling: $7.4/h        
    • Static: $12.9/h

Autoscaling increased the runtime hours of the workflow by 21% (3.4 hours vs. 2.8 hours) but reduced the total server cost by 44% ($25 vs $36). “Node hours” are the total hours of node runtime during the experimental period. Many of the nodes under autoscaling stop within an hour, so the node hours of the autoscaling cluster were much lower than a static cluster.

“Billing node hours” represent an estimated cost based on the hourly billing model. It charges even one second from node start to termination as one-hour node usage. The autoscaling cluster should need $0.5/hour x 50.0 = $25. The static cluster should cost $0.5/hour x 72 = $36.

Number of nodes

The below figure shows our trace of the number of nodes for the genome workflow:

This GATK workflow has two big spikes and one small spike in terms of resource consumption. These spikes were derived from “scatter” parallelization. After these scatters, the workflow uses much fewer resources to “gather” phase. These workload characteristics highly motivate using ClusterAutoscaler of OpenShift.

Cluster CPU/memory utilization

The Cromwell backend requests resource usage according to the WDL file definition. A WDL file has a section to specify the required amounts of CPU and memory. Other backends — such as AWS Batch and Google Life Sciences APIs — also use this WDL feature to decide the size of worker nodes:

The above figures show the ratio of actual resource usage from the total reserved one in the cluster. Autoscaling increased memory utilization — especially around the end of this workflow — by reducing the number of nodes. However, CPU utilization was still up to around 35%. This number was lower than we expected. A major reason is that required resources specified in WDL files are overestimated. Precise resource estimation is practically challenging, but we believe other features of OpenShift (like Horizontal Pod Autoscaler) solve this issue.

Summary

In this blog post, we present our experimental results of genomics on Red Hat OpenShift. OpenShift, which enables genomics on various underlying infrastructures with customizations like storage software and cluster autoscaling. As a demonstration, we reused an existing workflow for Google Cloud on Red Hat OpenShift on IBM Cloud. The experimental results showed that cluster autoscaling improved resource utilization.

Learn more about Red Hat Openshift on IBM Cloud.

Learn more about IBM Cloud Object Storage.

Was this article helpful?
YesNo

More from Cloud

Enhance your data security posture with a no-code approach to application-level encryption

4 min read - Data is the lifeblood of every organization. As your organization’s data footprint expands across the clouds and between your own business lines to drive value, it is essential to secure data at all stages of the cloud adoption and throughout the data lifecycle. While there are different mechanisms available to encrypt data throughout its lifecycle (in transit, at rest and in use), application-level encryption (ALE) provides an additional layer of protection by encrypting data at its source. ALE can enhance…

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters