An examination of the experimental results of genomics on Red Hat OpenShift.

TL;DR

  • Cloud reduces the cost of genome analysis.
  • OpenShift simplifies hybrid cloud management with standardized interfaces.
  • We extended Cromwell for OpenShift and multiple cloud object storage.
  • A GATK workflow for Google Cloud can run on IBM Cloud with our extension.
  • ClusterAutoscaler further improved cost-efficiency.

Genome analysis

Understanding human DNA structures is an essential element of medical and life sciences. DNA structures are represented as a string of initials of four nucleic acids, such as TACTTGATC.  Their variant discovery (insertion, deletion, etc.) helps understand diseases, for example.

This blog post focuses on GATK, a popular genome analysis toolkit with essential utilities and command lines for genome analysis. Also, we focus on Cromwell, a genomics workflow engine. Users can write a WDL file consisting of multiple GATK commands so that they can reproduce or reuse genome analysis using Cromwell.

From the perspective of computing infrastructure, a challenge of genome analysis is its data scale and required computation power. A single genome file is generated as a 100 GB dataset, including errors and redundancy. Typical genomics workflows start from data preprocessing and eventually step into the variant discovery of input datasets with parallelization.

Genomics in the cloud: Pros and cons

These characteristics of genomics show typical data analytics matching to on-demand availability of computation and storage, a well-known feature of the cloud. You only need to pay when you need additional Linux machines. Using cloud object storage virtually provides cheap and unlimited storage without wear. In contrast, using physical machines means you need to purchase machines with fixed hardware configurations before starting any workloads. Specifically, GATK and Cromwell support various cloud backends, such as AWS Batch and Google Cloud life sciences. These cloud backends manage computing resources in their public cloud infrastructure.

A drawback of using the cloud is that it’s not practical to switch to use other clouds after you place a huge amount of data on a particular cloud object storage. Another problematic scenario is that you may also need to run genome analysis on your “private” cloud to meet constraints of data location and privacy. For both cases, you need to learn how to use different cloud systems, since many cloud functionalities are not standardized.

Genomics on Red Hat OpenShift

Red Hat OpenShift standardizes how to manage a computing cluster, like a single Linux machine. It abstracts backend cloud infrastructure and provides advanced features to efficiently run applications on a cloud cluster. Users can simplify cluster management regardless of underlying cloud infrastructure, including public and private ones.

So, we extend Cromwell as an OpenShift application to meet such complex requirements of hybrid cloud usages. Our extension leverages OpenShift as the job dispatcher for genomics workflows, assuming cloud object storage is mounted in a cluster. Our extension enables us to use multiple cloud object storage and simplified job deployments. Consequently, we confirmed that IBM Cloud can run existing workflows that are written for Google Cloud and AWS. We demonstrate this in the next section.

Another benefit of using OpenShift is its high customizability. An example is CloudAutoscaler, which enables OpenShift to automatically manage the number of computing nodes in a cluster according to recent resource usages. OpenShift can also be customized to mount cloud object storage as a normal Linux filesystem using csi-s3. This feature is practically important since many genomics workflows depend on Linux command lines and scripts running on local files.

Demonstration

As a demonstration, we run a best-practice workflow distributed by Broad Institute. The workflow runs data preprocessing and variant discovery with parallelized HaplotypeCaller on public datasets in Google Storage, while we configured outputs written into IBM Cloud Object Storage in Tokyo. 

Our demonstration runs on Red Hat OpenShift on IBM Cloud. We set up 1 – 24 bx2.32×128 worker nodes (32 vcores of Cascadelake and 128GB RAM) of OpenShift 4.6 in a Tokyo availability zone with  ClusterAutoscaler. For comparison, we also run a static cluster running on 24 worker nodes. We used a very aggressive deletion policy — deleting nodes with three minutes of unused time.

Cost and performance

The cost and performance of a best-practice workflow is summarized in the following list:

  • Runtime hours          
    • Autoscaling: 3.4        
    • Static: 2.8
  • Node hours        
    • Autoscaling: 26.9        
    • Static: 66.4
  • Billing node hours        
    • Autoscaling: 50.0         
    • Static: 72.0
  • Estimated server cost        
    • Autoscaling: $25        
    • Static: $36
  • Estimated cost per hour        
    • Autoscaling: $7.4/h        
    • Static: $12.9/h

Autoscaling increased the runtime hours of the workflow by 21% (3.4 hours vs. 2.8 hours) but reduced the total server cost by 44% ($25 vs $36). “Node hours” are the total hours of node runtime during the experimental period. Many of the nodes under autoscaling stop within an hour, so the node hours of the autoscaling cluster were much lower than a static cluster.

“Billing node hours” represent an estimated cost based on the hourly billing model. It charges even one second from node start to termination as one-hour node usage. The autoscaling cluster should need $0.5/hour x 50.0 = $25. The static cluster should cost $0.5/hour x 72 = $36.

Number of nodes

The below figure shows our trace of the number of nodes for the genome workflow:


This GATK workflow has two big spikes and one small spike in terms of resource consumption. These spikes were derived from “scatter” parallelization. After these scatters, the workflow uses much fewer resources to “gather” phase. These workload characteristics highly motivate using ClusterAutoscaler of OpenShift.

Cluster CPU/memory utilization

The Cromwell backend requests resource usage according to the WDL file definition. A WDL file has a section to specify the required amounts of CPU and memory. Other backends — such as AWS Batch and Google Life Sciences APIs — also use this WDL feature to decide the size of worker nodes:



The above figures show the ratio of actual resource usage from the total reserved one in the cluster. Autoscaling increased memory utilization — especially around the end of this workflow — by reducing the number of nodes. However, CPU utilization was still up to around 35%. This number was lower than we expected. A major reason is that required resources specified in WDL files are overestimated. Precise resource estimation is practically challenging, but we believe other features of OpenShift (like Horizontal Pod Autoscaler) solve this issue.

Summary

In this blog post, we present our experimental results of genomics on Red Hat OpenShift. OpenShift, which enables genomics on various underlying infrastructures with customizations like storage software and cluster autoscaling. As a demonstration, we reused an existing workflow for Google Cloud on Red Hat OpenShift on IBM Cloud. The experimental results showed that cluster autoscaling improved resource utilization.

Learn more about Red Hat Openshift on IBM Cloud.

Learn more about IBM Cloud Object Storage.

More from Cloud

Clients can strengthen defenses for their data with IBM Storage Defender, now generally available

2 min read - We are excited to inform our clients and partners that IBM Storage Defender, part of our IBM Storage for Data Resilience portfolio, is now generally available. Enterprise clients worldwide continue to grapple with a threat landscape that is constantly evolving. Bad actors are moving faster than ever and are causing more lasting damage to data. According to an IBM report, cyberattacks like ransomware that used to take months to fully deploy can now take as little as four days. Cybercriminals…

2 min read

Integrating data center support: Lower costs and decrease downtime with your support strategy

3 min read - As organizations and their data centers embrace hybrid cloud deployments, they have a rapidly growing number of vendors and workloads in their IT environments. The proliferation of these vendors leads to numerous issues and challenges that overburden IT staff, impede clients’ core business innovations and development, and complicate the support and operation of these environments.  Couple that with the CIO’s priorities to improve IT environment availability, security and privacy posture, performance, and the TCO, and you now have a challenge…

3 min read

Using advanced scan settings in the IBM Cloud Security and Compliance Center

5 min read - Customers and users want the ability to schedule scans at the timing of their choice and receive alerts when issues arise, and we’re happy to make a few announcements in this area today: Scan frequency: Until recently, the IBM Cloud® Security and Compliance Center would scan resources every 24 hours, by default, on all of the attachments in an account. With this release, users can continue to run daily scans—which is the recommended option—but they also have the option for…

5 min read

Modernizing child support enforcement with IBM and AWS

7 min read - With 68% of child support enforcement (CSE) systems aging, most state agencies are currently modernizing them or preparing to modernize. More than 20% of families and children are supported by these systems, and with the current constituents of these systems becoming more consumer technology-centric, the use of antiquated technology systems is archaic and unsustainable. At this point, families expect state agencies to have a modern, efficient child support system. The following are some factors driving these states to pursue modernization:…

7 min read