Filling in the hole in whole genome analysis

Share this post:

Genome analysis is a linchpin in the development of more effective and more personalized medical treatments for the entire world.   The healthcare community has gone full bore on genome research to drive new patient care and therapy.   At the core of this genomic-based research are analytical workloads including Genome Analysis Toolkit (GATK) Best Practices that are both compute and data-intensive.  The dual nature of genomic workloads is creating challenges for traditional high-performance computing (HPC) architectures and technologies that are often found in many research facilities (Figure 1).

processing time for the whole human genome, Whole Genome Analysis

Figure 1.  Compute and data challenges from genomics pipeline analysis

Traditional HPC ecosystems are often inflexible and cannot scale

Traditional HPC systems have technical deficiencies that are widening the chasm between the researcher’s computing needs and the IT department’s ability to deliver them.

  • HPC systems are often dedicated and siloed at many research organizations, creating artificial scalability limits even if there is sufficient overall capacity in the department.
  • Older workload and resource schedulers and even newer open-source schedulers often cannot effectively manage large-scale analytics that are both compute and data-intensive or are missing other required capabilities.

As a result, long-running jobs may take longer than expected to complete if at all.  Significant breakthroughs in genome-based therapies for cancer and others may be delayed for years.

Case in point: The Icahn School of Medicine at Mt. Sinai

Unlike the “typical” HPC workload where there may be only a couple of jobs using all of the system’s cores and writing to only a few files, researchers at Mt. Sinai were surprised by the short job runtime and sheer number of files for many job submissions.   For Mt. Sinai, their genomic workloads consisted of long-running multi-core jobs, short-running jobs and single-core jobs.  For highly parallel workloads such as GATK pipelines, each job contained up to tens of thousands of short jobs.

The Mt. Sinai IT administrators discovered that their TORQUE 2.5.11 scheduler could not scale to handle the volume of jobs.  In many cases, scheduling took longer to complete than it took for the job to complete.  Furthermore, TORQUE core dumped at least one time each month, resulting in the researchers’ inability to start new jobs until administrators deleted job control and log files. Future versions were even buggier, resulting in up to 30 hours of wait time per month for new job submissions.

Schedulers like SLURM have scalability but may lack key APIs such as the Distributed Resource Management Application API (DRMAA), a specification for the submission and control of jobs.  IBM Spectrum LSF has proven scalability and APIs that support DRMAA.   At Mt. Sinai, IT administrators redesigned Minerva, their HPC system, by replacing TORQUE with IBM Spectrum LSF, and by implementing a multi-tiered storage architecture using inodes, Flash and IBM Spectrum Scale, a high-performance parallel file system.  These improvements reduced core dumps to zero over a one-year period while increasing scalability to 500,000 jobs per queue.

applications, frameworks, compute and storage infrastructure, Whole Genome Analysis

Figure 2.  IBM provides end-to-end foundation for genomics research and medicine

Evolving to a collaborative development model

Universities, pharmaceuticals and governments have now fallen in lockstep by jointly participating in large-scale genomic research.   Projects such as The 100,000 Genomes Project will accelerate the development of cheaper, more effective healthcare.   Vendors including IBM have a role to play by collaborating with researchers on their needs from improving technologies to developing new architectures. An example of collaboration is the IBM Reference Architecture for Genomics as seen in the figure above.  For more information on how you can participate in the evolution of this architecture, please visit our community page or visit us at Edge 2016.

More Storage stories

The hot storage trends for 2020

Hybrid cloud, Multicloud, Storage

Now that 2019 has ended, we anticipate incredible storage advancements to come in 2020. Storage is the essential foundation for all your application, workloads, and data sets. If your storage is not reliable, resilient, performant, and flexible, the value of your most critical business asset–your data–decreases dramatically.  Read on to see what is coming your more

The Top 10 storage moments of 2019

Cloud computing, Multicloud, Storage

2019 was a big year for IBM Storage, with a slew of exciting launches of new solutions, fascinating and valuable reports, and deep dives into the ways in which storage can help your organization continue to innovate and drive value from your oceans of data. But amongst all that great news, what stands out as more

A future of powerful clouds

Hybrid cloud storage, Multicloud, Storage

In a very good way, the future is filled with clouds. In the realm of information technology, this statement is especially true. Already, the majority of organizations worldwide are taking advantage of more than one cloud provider.[1] IBM calls this a “hybrid multicloud” environment – “hybrid” meaning both on- and off-premises resources are involved, and more