Porting a massively parallel bioinformatics pipeline to the cloud

A case study in transferring, stabilizing, and managing massive data sets

From the developerWorks archives

Dima Rekesh, Thanh Pham, Jacques Labrie, Jeffrey Rodriguez, Shilpi Ahuja, Eugene Hung, and Bobbie Cochrane

Date archived: May 14, 2019 | First published: February 20, 2013

Recent breakthroughs in genomics have significantly reduced the cost of short-read genomic sequencing (determining the order of the nucleotide bases in a molecule of DNA). Therefore, to a large extent, the task of full genomic reassembly—often referred to as secondary analysis (and familiar to those with parallel processing experience)—has become an IT challenge in which the issues are about transferring massive amounts of data over WANs and LANs, managing it in a distributed environment, ensuring stability of massively parallel processing pipelines, and containing the processing cost. In this article to applied science investigation, the authors describe their experiences porting a commercial, high-performance-computing-based application for genomic reassembly to a cloud environment; they outline the key architectural decisions they made and the path that took them from a purely HPC-type design to what they like to call the Big Data design.

This content is no longer being updated or maintained. The full article is provided "as is" in a PDF file. Given the rapid evolution of technology, some content, steps, or illustrations may have changed.

Zone=Cloud computing
ArticleTitle=Porting a massively parallel bioinformatics pipeline to the cloud