Genomics on Red Hat OpenShift

5 min read

An examination of the experimental results of genomics on Red Hat OpenShift.

TL;DR

  • Cloud reduces the cost of genome analysis.
  • OpenShift simplifies hybrid cloud management with standardized interfaces.
  • We extended Cromwell for OpenShift and multiple cloud object storage.
  • A GATK workflow for Google Cloud can run on IBM Cloud with our extension.
  • ClusterAutoscaler further improved cost-efficiency.

Genome analysis

Understanding human DNA structures is an essential element of medical and life sciences. DNA structures are represented as a string of initials of four nucleic acids, such as TACTTGATC.  Their variant discovery (insertion, deletion, etc.) helps understand diseases, for example.

This blog post focuses on GATK, a popular genome analysis toolkit with essential utilities and command lines for genome analysis. Also, we focus on Cromwell, a genomics workflow engine. Users can write a WDL file consisting of multiple GATK commands so that they can reproduce or reuse genome analysis using Cromwell.

From the perspective of computing infrastructure, a challenge of genome analysis is its data scale and required computation power. A single genome file is generated as a 100 GB dataset, including errors and redundancy. Typical genomics workflows start from data preprocessing and eventually step into the variant discovery of input datasets with parallelization.

Genomics in the cloud: Pros and cons

These characteristics of genomics show typical data analytics matching to on-demand availability of computation and storage, a well-known feature of the cloud. You only need to pay when you need additional Linux machines. Using cloud object storage virtually provides cheap and unlimited storage without wear. In contrast, using physical machines means you need to purchase machines with fixed hardware configurations before starting any workloads. Specifically, GATK and Cromwell support various cloud backends, such as AWS Batch and Google Cloud life sciences. These cloud backends manage computing resources in their public cloud infrastructure.

A drawback of using the cloud is that it's not practical to switch to use other clouds after you place a huge amount of data on a particular cloud object storage. Another problematic scenario is that you may also need to run genome analysis on your "private" cloud to meet constraints of data location and privacy. For both cases, you need to learn how to use different cloud systems, since many cloud functionalities are not standardized.

Genomics on Red Hat OpenShift

Red Hat OpenShift standardizes how to manage a computing cluster, like a single Linux machine. It abstracts backend cloud infrastructure and provides advanced features to efficiently run applications on a cloud cluster. Users can simplify cluster management regardless of underlying cloud infrastructure, including public and private ones.

So, we extend Cromwell as an OpenShift application to meet such complex requirements of hybrid cloud usages. Our extension leverages OpenShift as the job dispatcher for genomics workflows, assuming cloud object storage is mounted in a cluster. Our extension enables us to use multiple cloud object storage and simplified job deployments. Consequently, we confirmed that IBM Cloud can run existing workflows that are written for Google Cloud and AWS. We demonstrate this in the next section.

Another benefit of using OpenShift is its high customizability. An example is CloudAutoscaler, which enables OpenShift to automatically manage the number of computing nodes in a cluster according to recent resource usages. OpenShift can also be customized to mount cloud object storage as a normal Linux filesystem using csi-s3. This feature is practically important since many genomics workflows depend on Linux command lines and scripts running on local files.

Demonstration

As a demonstration, we run a best-practice workflow distributed by Broad Institute. The workflow runs data preprocessing and variant discovery with parallelized HaplotypeCaller on public datasets in Google Storage, while we configured outputs written into IBM Cloud Object Storage in Tokyo. 

Our demonstration runs on Red Hat OpenShift on IBM Cloud. We set up 1 - 24 bx2.32x128 worker nodes (32 vcores of Cascadelake and 128GB RAM) of OpenShift 4.6 in a Tokyo availability zone with  ClusterAutoscaler. For comparison, we also run a static cluster running on 24 worker nodes. We used a very aggressive deletion policy — deleting nodes with three minutes of unused time.

Cost and performance

The cost and performance of a best-practice workflow is summarized in the following list:

  • Runtime hours          
    • Autoscaling: 3.4        
    • Static: 2.8
  • Node hours        
    • Autoscaling: 26.9        
    • Static: 66.4
  • Billing node hours        
    • Autoscaling: 50.0         
    • Static: 72.0
  • Estimated server cost        
    • Autoscaling: $25        
    • Static: $36
  • Estimated cost per hour        
    • Autoscaling: $7.4/h        
    • Static: $12.9/h

Autoscaling increased the runtime hours of the workflow by 21% (3.4 hours vs. 2.8 hours) but reduced the total server cost by 44% ($25 vs $36). "Node hours" are the total hours of node runtime during the experimental period. Many of the nodes under autoscaling stop within an hour, so the node hours of the autoscaling cluster were much lower than a static cluster.

"Billing node hours" represent an estimated cost based on the hourly billing model. It charges even one second from node start to termination as one-hour node usage. The autoscaling cluster should need $0.5/hour x 50.0 = $25. The static cluster should cost $0.5/hour x 72 = $36.

Number of nodes

The below figure shows our trace of the number of nodes for the genome workflow:

The below figure shows our trace of the number of nodes for the genome workflow:

This GATK workflow has two big spikes and one small spike in terms of resource consumption. These spikes were derived from "scatter" parallelization. After these scatters, the workflow uses much fewer resources to "gather" phase. These workload characteristics highly motivate using ClusterAutoscaler of OpenShift.

Cluster CPU/memory utilization

The Cromwell backend requests resource usage according to the WDL file definition. A WDL file has a section to specify the required amounts of CPU and memory. Other backends — such as AWS Batch and Google Life Sciences APIs — also use this WDL feature to decide the size of worker nodes:

Other backends — such as AWS Batch and Google Life Sciences APIs — also use this WDL feature to decide the size of worker nodes:
Other backends — such as AWS Batch and Google Life Sciences APIs — also use this WDL feature to decide the size of worker nodes:

The above figures show the ratio of actual resource usage from the total reserved one in the cluster. Autoscaling increased memory utilization — especially around the end of this workflow — by reducing the number of nodes. However, CPU utilization was still up to around 35%. This number was lower than we expected. A major reason is that required resources specified in WDL files are overestimated. Precise resource estimation is practically challenging, but we believe other features of OpenShift (like Horizontal Pod Autoscaler) solve this issue.

Summary

In this blog post, we present our experimental results of genomics on Red Hat OpenShift. OpenShift, which enables genomics on various underlying infrastructures with customizations like storage software and cluster autoscaling. As a demonstration, we reused an existing workflow for Google Cloud on Red Hat OpenShift on IBM Cloud. The experimental results showed that cluster autoscaling improved resource utilization.

Learn more about Red Hat Openshift on IBM Cloud.

Learn more about IBM Cloud Object Storage.

Be the first to hear about news, product updates, and innovation from IBM Cloud