Porting a massively parallel bioinformatics pipeline to the cloud
A case study in transferring, stabilizing, and managing massive data sets
From the developerWorks archives
Date archived: November 29, 2016 | First published: February 20, 2013
Recent breakthroughs in genomics have significantly reduced the cost of short-read genomic sequencing (determining the order of the nucleotide bases in a molecule of DNA). Therefore, to a large extent, the task of full genomic reassembly—often referred to as secondary analysis (and familiar to those with parallel processing experience)—has become an IT challenge in which the issues are about transferring massive amounts of data over WANs and LANs, managing it in a distributed environment, ensuring stability of massively parallel processing pipelines, and containing the processing cost. In this article to applied science investigation, the authors describe their experiences porting a commercial, high-performance-computing-based application for genomic reassembly to a cloud environment; they outline the key architectural decisions they made and the path that took them from a purely HPC-type design to what they like to call the Big Data design.
This content is no longer being updated or maintained. The full article is provided "as is" in a PDF file. Given the rapid evolution of technology, some steps and illustrations may have changed.