An examination of the experimental results of genomics on Red Hat OpenShift.

TL;DR

  • Cloud reduces the cost of genome analysis.
  • OpenShift simplifies hybrid cloud management with standardized interfaces.
  • We extended Cromwell for OpenShift and multiple cloud object storage.
  • A GATK workflow for Google Cloud can run on IBM Cloud with our extension.
  • ClusterAutoscaler further improved cost-efficiency.

Genome analysis

Understanding human DNA structures is an essential element of medical and life sciences. DNA structures are represented as a string of initials of four nucleic acids, such as TACTTGATC.  Their variant discovery (insertion, deletion, etc.) helps understand diseases, for example.

This blog post focuses on GATK, a popular genome analysis toolkit with essential utilities and command lines for genome analysis. Also, we focus on Cromwell, a genomics workflow engine. Users can write a WDL file consisting of multiple GATK commands so that they can reproduce or reuse genome analysis using Cromwell.

From the perspective of computing infrastructure, a challenge of genome analysis is its data scale and required computation power. A single genome file is generated as a 100 GB dataset, including errors and redundancy. Typical genomics workflows start from data preprocessing and eventually step into the variant discovery of input datasets with parallelization.

Genomics in the cloud: Pros and cons

These characteristics of genomics show typical data analytics matching to on-demand availability of computation and storage, a well-known feature of the cloud. You only need to pay when you need additional Linux machines. Using cloud object storage virtually provides cheap and unlimited storage without wear. In contrast, using physical machines means you need to purchase machines with fixed hardware configurations before starting any workloads. Specifically, GATK and Cromwell support various cloud backends, such as AWS Batch and Google Cloud life sciences. These cloud backends manage computing resources in their public cloud infrastructure.

A drawback of using the cloud is that it’s not practical to switch to use other clouds after you place a huge amount of data on a particular cloud object storage. Another problematic scenario is that you may also need to run genome analysis on your “private” cloud to meet constraints of data location and privacy. For both cases, you need to learn how to use different cloud systems, since many cloud functionalities are not standardized.

Genomics on Red Hat OpenShift

Red Hat OpenShift standardizes how to manage a computing cluster, like a single Linux machine. It abstracts backend cloud infrastructure and provides advanced features to efficiently run applications on a cloud cluster. Users can simplify cluster management regardless of underlying cloud infrastructure, including public and private ones.

So, we extend Cromwell as an OpenShift application to meet such complex requirements of hybrid cloud usages. Our extension leverages OpenShift as the job dispatcher for genomics workflows, assuming cloud object storage is mounted in a cluster. Our extension enables us to use multiple cloud object storage and simplified job deployments. Consequently, we confirmed that IBM Cloud can run existing workflows that are written for Google Cloud and AWS. We demonstrate this in the next section.

Another benefit of using OpenShift is its high customizability. An example is CloudAutoscaler, which enables OpenShift to automatically manage the number of computing nodes in a cluster according to recent resource usages. OpenShift can also be customized to mount cloud object storage as a normal Linux filesystem using csi-s3. This feature is practically important since many genomics workflows depend on Linux command lines and scripts running on local files.

Demonstration

As a demonstration, we run a best-practice workflow distributed by Broad Institute. The workflow runs data preprocessing and variant discovery with parallelized HaplotypeCaller on public datasets in Google Storage, while we configured outputs written into IBM Cloud Object Storage in Tokyo. 

Our demonstration runs on Red Hat OpenShift on IBM Cloud. We set up 1 – 24 bx2.32×128 worker nodes (32 vcores of Cascadelake and 128GB RAM) of OpenShift 4.6 in a Tokyo availability zone with  ClusterAutoscaler. For comparison, we also run a static cluster running on 24 worker nodes. We used a very aggressive deletion policy — deleting nodes with three minutes of unused time.

Cost and performance

The cost and performance of a best-practice workflow is summarized in the following list:

  • Runtime hours          
    • Autoscaling: 3.4        
    • Static: 2.8
  • Node hours        
    • Autoscaling: 26.9        
    • Static: 66.4
  • Billing node hours        
    • Autoscaling: 50.0         
    • Static: 72.0
  • Estimated server cost        
    • Autoscaling: $25        
    • Static: $36
  • Estimated cost per hour        
    • Autoscaling: $7.4/h        
    • Static: $12.9/h

Autoscaling increased the runtime hours of the workflow by 21% (3.4 hours vs. 2.8 hours) but reduced the total server cost by 44% ($25 vs $36). “Node hours” are the total hours of node runtime during the experimental period. Many of the nodes under autoscaling stop within an hour, so the node hours of the autoscaling cluster were much lower than a static cluster.

“Billing node hours” represent an estimated cost based on the hourly billing model. It charges even one second from node start to termination as one-hour node usage. The autoscaling cluster should need $0.5/hour x 50.0 = $25. The static cluster should cost $0.5/hour x 72 = $36.

Number of nodes

The below figure shows our trace of the number of nodes for the genome workflow:

This GATK workflow has two big spikes and one small spike in terms of resource consumption. These spikes were derived from “scatter” parallelization. After these scatters, the workflow uses much fewer resources to “gather” phase. These workload characteristics highly motivate using ClusterAutoscaler of OpenShift.

Cluster CPU/memory utilization

The Cromwell backend requests resource usage according to the WDL file definition. A WDL file has a section to specify the required amounts of CPU and memory. Other backends — such as AWS Batch and Google Life Sciences APIs — also use this WDL feature to decide the size of worker nodes:

The above figures show the ratio of actual resource usage from the total reserved one in the cluster. Autoscaling increased memory utilization — especially around the end of this workflow — by reducing the number of nodes. However, CPU utilization was still up to around 35%. This number was lower than we expected. A major reason is that required resources specified in WDL files are overestimated. Precise resource estimation is practically challenging, but we believe other features of OpenShift (like Horizontal Pod Autoscaler) solve this issue.

Summary

In this blog post, we present our experimental results of genomics on Red Hat OpenShift. OpenShift, which enables genomics on various underlying infrastructures with customizations like storage software and cluster autoscaling. As a demonstration, we reused an existing workflow for Google Cloud on Red Hat OpenShift on IBM Cloud. The experimental results showed that cluster autoscaling improved resource utilization.

Learn more about Red Hat Openshift on IBM Cloud.

Learn more about IBM Cloud Object Storage.

Categories

More from Cloud

IBM Cloud inactive identities: Ideas for automated processing

4 min read - Regular cleanup is part of all account administration and security best practices, not just for cloud environments. In our blog post on identifying inactive identities, we looked at the APIs offered by IBM Cloud Identity and Access Management (IAM) and how to utilize them to obtain details on IAM identities and API keys. Some readers provided feedback and asked on how to proceed and act on identified inactive identities. In response, we are going lay out possible steps to take.…

IBM Cloud VMware as a Service introduces multitenant as a new, cost-efficient consumption model

4 min read - Businesses often struggle with ongoing operational needs like monitoring, patching and maintenance of their VMware infrastructure or the added concerns over capacity management. At the same time, cost efficiency and control are very important. Not all workloads have identical needs and different business applications have variable requirements. For example, production applications and regulated workloads may require strong isolation, but development/testing, training environments, disaster recovery sites or other applications may have lower availability requirements or they can be ephemeral in nature,…

IBM accelerates enterprise AI for clients with new capabilities on IBM Z

5 min read - Today, we are excited to unveil a new suite of AI offerings for IBM Z that are designed to help clients improve business outcomes by speeding the implementation of enterprise AI on IBM Z across a wide variety of use cases and industries. We are bringing artificial intelligence (AI) to emerging use cases that our clients (like Swiss insurance provider La Mobilière) have begun exploring, such as enhancing the accuracy of insurance policy recommendations, increasing the accuracy and timeliness of…

IBM NS1 Connect: How IBM is delivering network connectivity with premium DNS offerings

4 min read - For most enterprises, how their users access applications and data is an essential part of doing business, and how they service those application and data responses has a direct correlation to revenue generation.    According to We Are Social’s Digital 2023 Global Overview Report, there are 5.19 billion people around the world using the internet in 2023. There’s an imperative need for businesses to trust their networks to deliver meaningful content to address customer needs.  So how responsive is the…