Share this post:
Genome analysis is a linchpin in the development of more effective and more personalized medical treatments for the entire world. The healthcare community has gone full bore on genome research to drive new patient care and therapy. At the core of this genomic-based research are analytical workloads including Genome Analysis Toolkit (GATK) Best Practices that are both compute and data-intensive. The dual nature of genomic workloads is creating challenges for traditional high-performance computing (HPC) architectures and technologies that are often found in many research facilities (Figure 1).
Figure 1. Compute and data challenges from genomics pipeline analysis
Traditional HPC ecosystems are often inflexible and cannot scale
Traditional HPC systems have technical deficiencies that are widening the chasm between the researcher’s computing needs and the IT department’s ability to deliver them.
- HPC systems are often dedicated and siloed at many research organizations, creating artificial scalability limits even if there is sufficient overall capacity in the department.
- Older workload and resource schedulers and even newer open-source schedulers often cannot effectively manage large-scale analytics that are both compute and data-intensive or are missing other required capabilities.
As a result, long-running jobs may take longer than expected to complete if at all. Significant breakthroughs in genome-based therapies for cancer and others may be delayed for years.
Unlike the “typical” HPC workload where there may be only a couple of jobs using all of the system’s cores and writing to only a few files, researchers at Mt. Sinai were surprised by the short job runtime and sheer number of files for many job submissions. For Mt. Sinai, their genomic workloads consisted of long-running multi-core jobs, short-running jobs and single-core jobs. For highly parallel workloads such as GATK pipelines, each job contained up to tens of thousands of short jobs.
The Mt. Sinai IT administrators discovered that their TORQUE 2.5.11 scheduler could not scale to handle the volume of jobs. In many cases, scheduling took longer to complete than it took for the job to complete. Furthermore, TORQUE core dumped at least one time each month, resulting in the researchers’ inability to start new jobs until administrators deleted job control and log files. Future versions were even buggier, resulting in up to 30 hours of wait time per month for new job submissions.
Schedulers like SLURM have scalability but may lack key APIs such as the Distributed Resource Management Application API (DRMAA), a specification for the submission and control of jobs. IBM Spectrum LSF has proven scalability and APIs that support DRMAA. At Mt. Sinai, IT administrators redesigned Minerva, their HPC system, by replacing TORQUE with IBM Spectrum LSF, and by implementing a multi-tiered storage architecture using inodes, Flash and IBM Spectrum Scale, a high-performance parallel file system. These improvements reduced core dumps to zero over a one-year period while increasing scalability to 500,000 jobs per queue.
Figure 2. IBM provides end-to-end foundation for genomics research and medicine
Evolving to a collaborative development model
Universities, pharmaceuticals and governments have now fallen in lockstep by jointly participating in large-scale genomic research. Projects such as The 100,000 Genomes Project will accelerate the development of cheaper, more effective healthcare. Vendors including IBM have a role to play by collaborating with researchers on their needs from improving technologies to developing new architectures. An example of collaboration is the IBM Reference Architecture for Genomics as seen in the figure above. For more information on how you can participate in the evolution of this architecture, please visit our community page or visit us at Edge 2016.