Porting a massively parallel bioinformatics pipeline to the cloud

A case study in transferring, stabilizing, and managing massive data sets

Recent breakthroughs in genomics have significantly reduced the cost of short-read genomic sequencing (determining the order of the nucleotide bases in a molecule of DNA). Therefore, to a large extent, the task of full genomic reassembly—often referred to as secondary analysis (and familiar to those with parallel processing experience)—has become an IT challenge in which the issues are about transferring massive amounts of data over WANs and LANs, managing it in a distributed environment, ensuring stability of massively parallel processing pipelines, and containing the processing cost. In this article to applied science investigation, the authors describe their experiences porting a commercial, high-performance-computing-based application for genomic reassembly to a cloud environment; they outline the key architectural decisions they made and the path that took them from a purely HPC-type design to what they like to call the Big Data design.

Share:

Dima Rekesh (dima@us.ibm.com), Senior Technical Staff Member, IBM

Dima is part of the strategic IBM Cloud Computing Enterprise Initiatives team. In the past, he worked on a number of large-scale Web application platforms, emerging technologies, as well as leadership data centers and green data center strategies.



Thanh Pham (thanhp@us.ibm.com), Solution Architect, IBM

Thanh Pham is a Solution Architect in the IBM Information Management Advanced Technology. His current focus is to help customers build applications using IBM Mashup Center product and IBM cloud computing. Before this role, he was an architect for the ECM/Filenet Business Process Framework.



Jacques Labrie (jlabrie@us.ibm.com), Senior Software Engineer, IBM

Jacques Labrie is a senior software engineer at the IBM Silicon Valley Lab in San Jose, CA. Jacques has been a manager, team lead, and developer on multiple IBM warehousing and metadata products since 1984. Currently, Jacques is responsible for prototype development of integration scenarios with the IBM WebSphere family of products.



Jeffrey Rodriguez (jeffreyr@us.ibm.com), Software Engineer, IBM

Jeffrey Rodriguez is a software engineer at the IBM Silicon Valley Lab in San Jose, CA. Jeff is a big data security lead, J2EE developer and web applications expert. Jeff has been an open source developer at the XML Xerces project and has worked on many advanced technology projects at IBM.



Shilpi Ahuja (shilpi@us.ibm.com), Solution Architect, IBM

Shilpi Ahuja is a solution architect in the Business Support System (BSS), a part of the IBM cloud computing platform used by LotusLive. Her current focus is to extend BSS into a master customer information system that receives, stores, and distributes customer data from/to other customer centric systems. Prior to this role she designed and developed components and features for IBM Information Server and WebSphere Information Integrator family of products.



Eugene Hung (eyhung@us.ibm.com), Senior Software Engineer, IBM

Eugene Hung is a senior software engineer at the IBM Silicon Valley Lab in San Jose, CA. His current focus is working on text analytics tools in IBM BigInsights. Before this role, he was the technical lead for the IBM/Google Cloud Computing Academic Initiative.



Bobbie Cochrane (bobbiec@almaden.ibm.com), Technology Architect, IBM

Dr. Bobbie Cochrane is leads strategic initiatives for IBM's Information Management portfolio, championing new innovations, such as the impact of NoSQL, cloud, and Web 2.0, for information access and delivery. Throughout her career, she has been a leader in delivering new innovations from IBM research and industry to product, including PureXML, materialized views, triggers and constraints. Dr. Cochrane has authored several journal articles and many papers in leading database conferences, and played a major role in the definiton of the SQL3 standard for triggers, constraints, and cubes.



20 February 2013

Also available in Chinese

The goal of this project was to prepare a commercial genome analysis application for massive scalability while containing the associated costs. The application had been designed to run on an internal high-performance computing (HPC)-type infrastructure, and the capacity of that infrastructure was approaching its limit. Meanwhile, the volume of analysis was projected to rapidly increase. Therefore, there was a desire to experiment with porting the application to a cloud environment.

In addition, time constraints prevented us from redesigning the original application, allowing us to make superficial changes to the way the application was orchestrated. Let's start with a brief overview of the problem that computational genomics poses from an IT perspective.

One conventional approach to an organism's genomic sequencing involves the following steps:

  1. Shred multiple copies (typically 30-60) of the original genome chemically into a large number of random overlapping fragments, each with a fixed small length (for example, 30-200 base pairs).
  2. Read the sequence of each short fragment, which produces a large number of small files.
  3. Use a previously known genome of the organism (the "reference genome") to make the best guess as to the location of each sequenced fragment on the reference sequence. This is a plausible approach, because genomes of each species typically do not differ much.
  4. Use statistical means to determine the most likely base pair in each position for the assembled genome. For data-compression purposes, this approach can be expressed in terms of deltas: single-nucleotide polymorphisms, representing a mutation in a given location, or insertions or deletions (indels), representing a change in the overall genomic length.

Because many genomes are quite large—the human genome, for instance, is 3.3 billion base pairs long—secondary analysis represents a significant computational and data challenge. When the read-quality data is factored in, each incoming base pair is encoded by 1 byte of information. Therefore, an incoming data set containing 60 shredded, full DNA molecules, would contain about 3.3*109 * 60, or around 200GB of data, and on the order of 500-2,500 core-hours of computation, assuming a considerable aggregate I/O connectivity between the CPUs and the storage medium.

In the past, dealing with this range of problems was the realm of supercomputers or HPC. A fast and large centralized file system would be required; the input data sets would be placed in it; and the participating, largely stateless server farm would perform the computation.

Although processing one such data set appears manageable, processing thousands of such data sets within a reasonable wall-clock time frame presents a challenge. One of the gating factors is the capital investment required to build and operate such a substantial processing farm to keep up with the increasing demand as the cost of genomic sequencing continues to fall.

In this context, cloud computing represents an appealing paradigm. It has the potential to offer large amounts of computing capacity with variable pricing, allowing you to rent servers when you need them and return them when you do not. To fully leverage the cloud, however, you must overcome several challenges:

  • The data needs to be efficiently transferred to and from the cloud over a WAN, necessitating the appropriate set of tools.
  • The right combination of cloud storage offering types needs to be chosen, because the fast and expensive HPC-type storage is unlikely to be available in the cloud.
  • Job orchestration needs to factor in storage structure and be augmented accordingly.
  • The fundamentally horizontal cloud scaling model must be reflected in the architecture.
  • If possible, an optimal combination of cloud hardware, software, and virtualization must be chosen.

The rest of this article is structured as follows:

  • Introduction: Provides an overview of related work and cover additional background.
  • Section 2: Introduces the basics of IBM® SmartCloud™ Enterprise.
  • Section 3: Presents the system architecture that we selected for the ported system.
  • Section 4: Presents results and includes comparisons to the status quo architecture as well as some alternatives.
  • Section 5: Discusses results and lessons learned.
  • Conclusion: Describes the direction for potential future work and summarizing the key points of the article.

Today, several cloud providers are able to offer large amounts of computing capacity on demand under the pay-as-you-go model. In some cases, the customer can choose the underlying hardware so as to ensure congruency with the workload.

Consequently, there have also been successful examples of genomic workflows designed to run in the cloud in recent years. The approach that these researchers chose is to select a cloud system (such as Amazon Elastic Compute Cloud, Amazon EC2, treat it "as is," and make deep changes to the application to take advantage of the power of the cloud.

Our work essentially differs in the following ways:

  • We had a limited ability to alter the original application in the time frame of our work. Although design imperfections certainly resulted from this constraint, we feel that it is a common problem to have.
  • Because of the strong focus on overall costs, we approached the design holistically by tracing the bytes of the data and attempting to ensure that they travel the shortest possible distance and by focusing on performance, identifying and eliminating any bottlenecks.
  • We conducted all of the work presented in this article in IBM SmartCloud Enterprise. Because of our close connection with the teams supporting and developing IBM SmartCloud Enterprise, we had a unique opportunity to treat the infrastructure of the cloud as a white box, using this knowledge as a guide.
  • We had some ability to experiment with tuning the configuration of the cloud so that it would better support such data-intensive workloads in the future.

All in all, many data-intensive applications had previously been written with custom and expensive supercomputers or HPC clusters in mind. We hope that our experiences and decision-making process can help those seeking to increase the scalability and decrease processing costs of tasks that these applications enable.

Introducing the cloud environment

IBM SmartCloud Enterprise is a virtualized Infrastructure as a Service (IaaS) offering that enables its users to borrow resources under the pay-as-you-go model, with hourly pricing for most resources. Its compute nodes use Kernel Virtual Machine (KVM) as the hypervisor and direct-attached hard drives to provide ephemeral storage to its virtual machines (VMs) at an optimal cost-to-performance ratio. In addition, IBM SmartCloud Enterprise provides network-attached storage block storage that can be attached to one VM at a time. VMs are connected to block storage as well as to each other at 1Gbps.

IBM SmartCloud Enterprise consists of multiple pods of capacity located around the world. In many cases, it is desirable to assemble a deployment topology in just one pod—that is, the one closest to the source of data. However, multi-pod topologies are not uncommon, because sometimes, data resides in multiple locations around the world and available capacity may differ among the pods.

IBM SmartCloud Enterprise offers several types of VMs, ranging from 32-bit Copper to 64-bit Platinum, each with a different allocation of ephemeral storage for an hourly charge. Persistent (block) cloud storage is also available in multiple increments and is billed by capacity as well as by the number of I/O operations per second. Data transfers in and out of IBM SmartCloud Enterprise also incur charges based on the volume of traffic.

At the time this work was conducted, IBM SmartCloud Enterprise offered multiple versions of SUSE Linux® and Red Hat Enterprise Linux (RHEL). We predominantly used RHEL version 5.5. For more information about the IBM SmartCloud Enterprise offering, its resources, and pricing, see Resources.


Searching for opportunities in the cloud stack

To minimize processing cost, we had to look for opportunities throughout the stack. The key areas were:

  • Data transfer in and out of the cloud. These transfers occur over WANs and are expensive. The volume of data actually transferred is important, leading to challenges related to network errors, the ability to deal with live data sets, network latency (when transferring data over long distances), data compression, and security.
  • Data transfer within the cloud. Depending on the combination of cloud resources and storage offerings in use, these costs can also be nontrivial. For instance, you ought to consider the life cycle of the computing nodes, ensuring that they do not idle too long waiting for data load to finish before computation can commence.
  • Structure of the various tiers of cloud storage. When dealing with big data, it makes sense to segregate it into multiple independent tiers, each with a potentially different service level. Figure 1 describes the data flow at a high level: All cloud storage tiers are elastic, and layers with a longer life span are shown in a darker color. The data is loaded into the cloud over a WAN, and then moved between the layers over either the LAN or the WAN (if the tiers are located in geographically separated cloud pods.)
    Figure 1. High-level data flow
    Image showing high-level data flow
  • Workflow orchestration. It had to be cloud-aware, incorporating the structure of all of the storage tiers, data transfers, speed, and recovery from failure.
  • Optimization of the cloud infrastructure stack. As we mentioned previously, the cloud needs to be optimized for big data.

These issues are covered in greater detail in the following subsections.

Elastic storage tiers in the cloud

The main requirement imposed on the storage system appears simple: Storage needs to be cheap when storing data for a long period of time, it should be able to support an elastic compute tier (and be able to supply it with data rapidly after its creation, minimizing the initialization time), it should be fast during computation, and it needs to be secure.

Clearly, these requirements compel one to select a multitiered architecture, as no one storage tier is likely to satisfy all of the above requirements (see Figure 1).

This is the three-tiered approach that we used:

  • Long-term storage tier. The long-term storage was based on a set of persistent storage volumes (2TB each) that archives genomic input data and results. The goal here was to minimize cost. The relatively large persistent storage volumes, which offer modest data throughput rates yet the lowest overall storage cost, are mounted through inexpensive 32-bit Copper instances. Depending on the use case, this tier could support data transfers over a WAN. Alternatively, it could be used to back up the transient data cache, because it is impervious to the loss of any of the VMs. Figure 2 illustrates the latter scenario, where the tier consists of several 32-bit Copper VMs, each attached to a 2TB storage block. At the time, only one storage block could be attached to one VM. We experimented with aggregating the storage blocks into one contiguous space but generally avoided that, favoring simplicity.
    Figure 2. Long-term storage tier
    Image showing the long-term storage tier
  • Transient data cache. Using the distributed file system General Purpose File System-Shared Nothing Cluster (GPFS-SNC) and the ephemeral storage of mid-sized instances, this tier is both performance and cost-effective. It was typically used to accept data transfers through the WAN and to supply, through the use of parallel data transfers, data at high throughput rates to the computing tier. In addition, we were able to scale the tier horizontally to accommodate a need either for more storage space or more performance. We typically used the 64-bit Bronze VMs, each with 0.85TB of ephemeral disk and a replication factor of 2 to ensure that we were protected against a failure in any of the VMs that made up this tier.
    Figure 3. Data cache and run time cluster storage
    Image showing data cache and run time cluster storage
  • Compute cluster (run time) storage. During computation, we altered the application to use (almost exclusively) local, direct attached storage in each compute node for temporary data. We usually used 64-bit Gold VMs in this tier, each with 1TB of ephemeral storage. Because the original application required contiguous addressable storage space, we used GPFS-SNC to emulate it. This tier was optimized for performance and therefore not replicated. If any VM failed (a relatively rare event), we simply deleted the cluster and restarted computation from the beginning. Note that this storage tier is the most expensive and, therefore, the most elastic, because it flickers in and out of existence (see Figure 3).

Data transfer in and out of the cloud

At the time we executed this work, a shared object storage service was not yet available in the IBM cloud. However, after we determined that the destination of the data coming through the WAN would be the transient data cache (refer to Figure 1), we realized that we could engineer a simple solution and base it on any of the tools conventionally available for one of the Linux platforms.

We still needed to overcome the key challenges, which were minimizing the volume of data actually transferred (data compression, de-duplication, and so on), the ability to maximize the throughput regardless of latency, recovery from errors, the ability to support live data sets, and security.

It is interesting to observe that Amazon Simple Storage Service (Amazon S3) uses HTTP to transfer the data—hardly an optimal mechanism—while one of the most popular interfaces to it emulates rsync. Among the other solutions are those that use protocols other than TCP—for instance, Hybrid SCP (HSCP), an open source product. A range of commercial solutions are also available.

The raw input data is essentially short, random read sequences, fragments of the organism's DNA, and because there are many such fragments (approaching 4fragment_length), we assumed that such data would not lend itself easily to de-duplication.

Based on the tests performed using a network simulator, we found that simply compressing data using gzip was sufficient to reduce the size of the data transmitted. We also found that parallel FTP was able to saturate the network bandwidth given a sufficient number of parallel pipes.

We eventually chose to use rsync over Secure Shell (SSH) as the base for our Data Transfer Utility (DTU), augmenting it as follows:

  1. We created a layer that spawned multiple, configurable numbers of rsync transfers in parallel to saturate the available bandwidth.
  2. We implemented a scheduling mechanism, using the UNIX®cron utility, that would restart failed transfers even if the source node were rebooted. This also helped us deal with live data sets: rsync would know that new files appeared and transfer only those files.
  3. We created an "exit hook" for the DTU to execute a specified command on one or more of the data target nodes so as to make possible end-to-end orchestration.

Because of our choice of rsync as the underlying engine, the DTU was simple and intuitive to use. We were able, for instance, to enable most of the rsync flags useful in filtering the data.

Data transfer within the cloud

Optimal data transfer of large data sets within a cloud also poses unique challenges.

First, the data transfer must be fundamentally parallel and aligned with the "many-to-many" concept. This is similar to the problem discussed in the previous section but driven in this case by the fact that the data is likely to be spread out across a plurality of source and target nodes (see Figure 3), each with a capped maximum bandwidth throughput (for example 1Gbps, as was the case in IBM SmartCloud Enterprise). By orchestrating the data transfer between N source and M target VMs, we are able to increase the maximum throughput up to max(N,M) times.

Second, to be truly parallel, the data-transfer system needs to understand and enforce affinity rules between the data and the VMs. Put differently, the system needs to be smart enough to calculate the source and target VMs for any data subset and orchestrate its transfer.

Third, the transfer mechanism must be robust and able to recover from failure within a reasonable time frame. Any cloud environment involves a certain non-zero failure rate of its VMs and network.

Fourth, the system needs to be flexible enough to support transfers over short distances (LANs) as well long distances (WANs) in case the storage and compute farms need to be geographically separated. This could happen, for instance, when one of the cloud pods is experiencing maintenance and a compute farm cannot be provisioned there.

Figure 4 illustrates a case when data is transferred between source and target clusters of four VMs each. Storage is direct-attached to each VM (compare with Figure 3). Data affinity rules govern the distribution of data in the source and target clusters, precluding any intracluster network use. Each of the four data transfers runs in parallel with no gateways.

Figure 4. 4x4 parallel file transfer
Image showing 4x4 parallel file transfer

When the Apache Hadoop Distributed File System is used in both source and target data clusters, distcp, is commonly leveraged to accelerate the data transfer. Because of our dependency on a fully Portable Operating System Interface [for UNIX] (POSIX)-compliant file system on both ends of the transfer, we eventually chose to implement a custom data-transfer subsystem on the basis of rsync with SSH. rsync took care of the relatively speedy recovery from temporary network outages while providing an advanced filtering capability enabling us to enforce data locality rules. SSH provided end-to-end security. Because this mechanism was TCP based, it did not suffer from a catastrophic performance loss if it had to be executed over a WAN—when the source and target clusters had to be geographically separated.

Workflow orchestration

The overall processing consisted of two steps that are quite different in nature: Data load into and from the cloud and the data processing in the cloud.

Data exchange with the cloud

As described earlier, we designed and developed a stand-alone and "componentized" rsync-based DTU to handle data transfers from and to the cloud. It had to be installed on each client server that mounted the file systems containing raw input data and scheduled through the UNIX cron utility to watch over these file systems for any incoming files.

Upon discovering a new data set, the DTU initiates its transfer through several parallel rsync processes (executing a data push). When the transfer is finished, the DTU calls an orchestrator module, the Elastic HPC Service (eHPCs), indicating that it can now start processing the data.

Figure 5 describes the flow. All of the nodes shown need to exist at the time of the transfer. The DTU connects to one of the redundant gateways in the cloud, which also act as a client for the cloud data cache storage cluster. The active gateway writes the data into the GPFS-SNC cluster as it receives it. As the transfer is under way, the data is being striped between the nodes of the storage cache. It should be noted that the DTU did not incorporate the intelligence of enforcing data-to-node affinity in a way that was compatible with the rest of the orchestration. It therefore relied on GPFS-SNC to perform striping. Consequently, an extra step was required to enforce this affinity. We experimented with including or excluding this step. In addition, we experimented with using GPFS-SNC itself to facilitate this type of data transfer.

Figure 5. Data load and preparation
Image showing data load and preparation

After the data has been processed and results deposited into the data cache tier, the DTU picks it up and pulls it into the client nodes, cleaning up space in the data cache file system (not shown in Figure 5). The number of nodes in the data cache tier was ultimately a function of the data processing throughput, and we scaled it manually. If there were a need to process data coming from a different geographic region, we simply replicated this "unit of processing" topology in the IBM SmartCloud Enterprise pod closest to the data, as the units of processing are independent of one another.

Data processing in the cloud

Although the data cache tier needs to exist before the transfer can commence, the more expensive compute tier does not: It can be created as needed. As we knew our workload to be quite variable, this design point of the elastic compute tier was a critical one toward our goal of cost containment.

Figure 6 outlines the processing workflow in the cloud:

  1. The eHPCs orchestrator provisions a new compute cluster. If the policy allows cluster reuse, it will use an idle cluster instead or schedule the workflow on a busy one, unless its queue is full.
  2. Input data is loaded into the compute cluster.
  3. Data is processed in the compute cluster. Because of data partitioning, there's little network traffic during this step, except to shuffle the data once as described below. Just like the inputs, the results are automatically partitioned between the compute cluster nodes.
  4. The results are moved to the data cache tier through a parallel transfer in accordance with the affinity rules.
  5. eHPCs either collapses the cluster or allows it to remain idle (that is, caches it), depending on the policy.
Figure 6. Processing workflow in the cloud
Image showing processing workflow in the cloud

Figure 7 focuses on the data fragments as they make their way through the processing steps in the cloud. Nodes are shown in a darker color. For simplicity, the number of compute nodes is shown to be equal to the number of the nodes in the data cache cluster. Notice the clean enforcement of data-to-node affinity rules to minimize network traffic.

After the alignment (by chromosome and region) results are available for all fragments, we need to compute new affinity rules so as to re-distribute the chromosome regions equally between the nodes again. When this data shuffle step is complete, the nodes are nearly autonomous again—each processes only the fragments assigned to it. This workflow is conceptually similar to Langmead et al., although not orchestrated with Hadoop or MapReduce.

Figure 7. Data flow in the cloud
Image showing data flow in the cloud

The Elastic HPC Service

Earlier, we introduced the fundamental problem of having to run a workflow in the cloud with the understanding that the required compute cluster may or may not exist at the time of workload submission. As this problem was not specific to the application we were porting or to bioinformatics, we set out to define and create a separate component that could accomplish it. We called the component Elastic HPC Service to differentiate it from Amazon Elastic MapReduce (Amazon EMR), which is based on Hadoop.

Our basic requirements and architecture principles were:

  • The component should be implemented as a service with a neutral interface.
  • It should be able to support multistep workflows.
  • The service should be able to provision multinode clusters and run workflows on them.
  • The service should be able to support multiple cluster types (for example, Hadoop, Platform LSF, Open Grid Scheduler.
  • The service should be extensible to support multiple types of clouds.
  • The service should be lightweight, intuitive, and easy to use.

Figure 8 outlines the object model for eHPCs. There are only four main objects. The Job object is the unit of work. It has an associated executable as well as input and output parameters. A workflow consists of one or more Jobs plus the logic connecting them. Each workflow must specify a Cluster that it needs to run. A cluster is a collection of Nodes in a cloud. A Cluster also has a type—for instance Hadoop Cluster, GPFS-SNC Cluster, and so forth. A node is a unit of compute capacity. For simplicity, eHPCs does not have its own authentication system: It leverages the authentication system of the underlying cloud.

Figure 8. The eHPCs object model
Image showing the eHPCs object model

When designing eHPCs, we tried to incorporate the best features of other workflow engines in this space—particularly Amazon EMR, Jaql, and Oozie—without incurring a feature overload. Amazon Cloud Formation, a recent feature-rich topology automation framework, is also able to provision groups of resources, although it ultimately targets a different set of goals.

Figure 9 illustrates how eHPCs provisions a Hadoop cluster. Similar to Amazon EMR, eHPCs does not hide the powerful feature set of the compute engine that it provisions (Hadoop, in this example). Users can still access the engine directly. However, they don't need to, necessarily: eHPCs is able to orchestrate workflows in a way that is not tied to Hadoop, IBM Tivoli® Workload Scheduler LoadLeveler®, or Open Grid Scheduler. We did our best, however, to leverage the capability of the underlying engine to manage its cluster to increase eHPCs scalability.

Figure 9. eHPCs provisioning a Hadoop cluster
Image showing eHPCs provisioning a Hadoop cluster

Figure 10 shows a sample workflow submission request. This particular workflow consists of two steps that need to be executed sequentially. The first step is to issue the sleep 600 command; the second is to issue the sleep 1200 command. The workflow stipulates a dependency on a cluster of type 1 (64-bit RHEL), consisting of two Copper nodes located in Germany. The cluster_placement=3 parameter implies that eHPCs has the freedom to manage the life cycle of the cluster: If such a cluster already exists and is available, eHPCs reuse it; otherwise, it creates it.

Figure 10. eHPCs workflow submission
Image showing eHPCs workflow submission

eHPCs was quite useful, as it helped us manage multitudes of compute clusters across several clouds in the process of development, testing, and production work.


The real-world implementation

The combination of the techniques described earlier was quite effectual in getting the application accurately ported to the cloud. The enforcement of data to compute affinity was the critical ingredient that helped bring down the unnecessary network traffic. eHPCs ensured the resource elasticity required to deal with spiky traffic. In this section, we briefly cover specific quantifiable outcomes, such as data transfer, storage, and cloud VM performance.

Application performance

Figure 11 provides a Ganglia Monitoring chart of the genomic sequencing pipeline running on an eight-node 64-bit Gold cluster. It started with (step 1) data loaded from the data cache tier into the compute cluster (a green peak shows the input network traffic).

Figure 11. Application performance profile
Image showing the application performance profile

Then (step 2), processing of the genomic data occurs. In this case, job execution took about 30 hours: the loading of input data (step 1) was about 30 minutes, while processing (step 2) took 29 hours.

During processing, the network was largely quiet except during the data load, shuffle, and results transfer (some intracluster network traffic sometimes took place because of space constraints on some nodes). CPUs were fully loaded during the alignment phase of step 2, yet less so during chromosome assembly. Note that there is extensive I/O against the direct-attached ephemeral disk (not shown). We had to increase the GPFS-SNC cache size to 4GB on each node for optimal performance.

WAN data transfer performance

We performed two groups of tests to identify the optimal data-transfer mechanism. First, using a Simena network emulator, we created a 10Mbps WAN with a 90ms round-trip time (RTT) to simulate a transatlantic data transfer. We then selected a data folder containing 64 files, which was 128MB in size. We transferred it over the emulated WAN uncompressed and tar+gzipped it (whereby its size dropped to 66MB). Figure 12 shows the results.

Figure 12. Emulated WAN data transfer speed tests
Image showing emulated WAN data transfer speed tests

We used the Riverbed Steelhead network optimizer, Filezilla, HSCP, and regular SCP. The combination of gzip and FTP (two parallel transfers were used in the tests) had the best overall time and saturated the link. As we mentioned, de-duplication is not likely to work when dealing with massive amounts of genomic data.

The second group of tests was conducted in IBM SmartCloud Enterprise and involved live VMs located in the Durham, North Carolina, and Ehningen data centers. Transfer rates varied depending on the data center load at a given time, yet Figure 13 represents a typical outcome, where throughput (MB/s) is plotted against the number of parallel transfers. FTP is slightly faster than rsync because of the overhead of the latter (start-up, encryption), yet the peak throughput was the same in both tests at around 85MB/s. Because physical nodes have 1GE adapters, the speed of light was around 125MB/s. We used 64-bit Bronze nodes running RHEL 5.5, and RTT was measured at 103ms; a folder containing five 1GB files was transferred.

Figure 13. rsync throughput vs. the number of threads
Image showing rsync throughput vs. the number of threads

In general, the DTU, based on parallel rsync, was offering good performance—at times approaching the 125MB/s (per node) limit. We were also compressing the inputs and reducing the number of files to offset the rsync start-up time.

Storage performance

We generally followed the NxM parallel model when transferring data between the data cache and compute tiers, both running GPFS-SNC. This step essentially combined data transfer with the enforcement of the data affinity required for the computation; hence, results were somewhat suboptimal. Figure 14 illustrates this point in the case of an 8x8 transfer. Each of the eight compute nodes initiates a data pull from the corresponding storage node. Because the standard GPFS-SNC striping method was used when placing the data in the storage cluster, internal network traffic ensues (green line, area 1).

Figure 14. Data load into the compute cluster
Image showing data load into the compute cluster

To further improve performance, there were two alternatives:

  • Knowing the application characteristics, we could enforce the data-affinity rules, followed by parallel rsync/SSH to perform the NxM data transfer. As can be seen in Figure 14, the rsync/SSH process incurs a higher system and user CPU usage.
  • By mounting the file system of the storage cluster directly into the compute cluster, GPFS-SNC will take care of locating each block and perform the data movement in an optimal way. Because the storage file system is mounted on all nodes, the data-transfer method is a simple copy command. Compared to the first alternative, the data-transfer performance is faster, and the CPU usage is lower (area 2 of Figure 14).

We have also validated that GPFS-SNC performs on par with the local file system when the data is local. Figure 15 compares GPFS-SNC to the third extended file system (ext3) for both reads and writes. This test has been done with IOzone on a single 64-bit Gold VM. The file size was 1G, and the record size was 1MB. The transfer mode is O_DIRECT, and the GPFS-SNC replication factor was 1, 2, and 3 on an eight-node cluster, respectively. The results show that GPFS-SNC is just as fast as ext3 with local write affinity. When data replication is enabled, write throughput is limited by the 1GE adapter in the node. This ability of GPFS-SNC to create an illusion of a contiguous file system with local write affinity was paramount to our porting effort.

Figure 15. GPFS-SNC vs. ext3 performance
Image showing GPFS-SNC vs. ext3 performance

Experimental Iridium VM size

Because the IBM SmartCloud Enterprise VM resources (CPU, ephemeral disk I/O) were uncapped at the time, we experienced performance variability in running the workload in shared cloud pods. This is widely known as the "noisy neighbor effect". We therefore experimented with the "dedicated" VM size—the one that would occupy an entire physical node (dubbed "Iridium"). Figure 16 and Figure 17 show that the Iridium VMs exhibited a substantially reduced performance variability across several runs. In each case, a cluster of eight nodes was used, and the processing times are reported in wall-clock hours. We confirmed that runs 1 and 2 in Figure 16 collided on at least one physical node. Run 4 had one obvious laggard VM, but it did not share a physical node with any of other VMs in our clusters.

Figure 16. Processing times on 64-bit Gold VMs
Image showing processing times on 64-bit Gold VMs
Figure 17. Processing times on 64-bit Iridium VMs
Image showing processing times on 64-bit Iridium VMs

Even though plotted are run times for different samples, the variability of the data sizes and quality alone did not exceed 15% and therefore cannot account for the variation in the processing times.

Processing costs

The overall cost is made up of the several main categories: data/results transfer in and out over the WAN, temporary storage in the data cache tier, and the computation itself. Figure 18 illustrates the components of processing costs of one sample data set using our architecture in IBM SmartCloud Enterprise. In this example, we were using 64-bit SUSE Bronze VMs for the data cache storage tier and 64-bit SUSE Gold VMs for the compute tier. The listed VM pricing is for reference only and may vary in actuality. For simplicity, we are leaving out the long-term storage cloud tier; we assume that all of the data had been stored in the cache tier for 48 hours.

Figure 18. Sample processing cost breakdown
Image showing a sample processing cost breakdown

What we learned

As you can glean from Figure 11, the custom orchestration did a reasonable job of using system resources during processing in the cloud. CPU is the bottleneck most of the time, local disk I/O accounts for most of the rest, while the network I/O is perhaps the most controlled of all and comes in a distant third.

The data presented hints at opportunities for further improvement in multiple areas. When it comes to the CPU clock (accessible by a VM) side, a careful study of the hypervisor effects on a disk I/O-intensive load is required. In addition to optimizing a virtualized cloud to deal with the big data, it is almost certainly advisable to devise a big data cloud that does not use virtualization.

Similarly, a study comparing disk I/O with and without virtualization and optimization of the delta is warranted. Ultimately, a big data environment is likely to need a higher spindle-to-core ratio than what IBM SmartCloud Enterprise was able to offer at the time.

Additional work is indicated on the data transfer within the cloud, as well. Figure 14 is a strong case in favor of using GPFS-SNC for the cluster data load, yet it would not fare well when the storage and compute clusters are geographically separated.

As we show in Figure 16, the impact of VM performance variability on the processing times can be significant. If it is not possible to control VM performance, you must resort to the Hadoop-like techniques of replicating data across multiple nodes to be able to reschedule jobs away from the laggard nodes.

In Figure 18, we show that most of the processing cost in the cloud comes from the computation itself. However, we feel that this component will eventually be dealt with via the choice of the proper hardware and algorithmic improvements, while the WAN transfer component is a lot more difficult to control and is likely going to dominate the overall genomic processing cost in the cloud.

The big data problem

The current definition of big data is tied to datasets so large that the conventional tools fail to treat it adequately. It was important to point out that the typical size of one individual dataset was just around 200GB. As genomes are mutually independent, we observed early on that with proper automation enabled by the cloud, it was possible to run processing pipelines independently, rather than create and maintain tools that can accommodate aggregate 10x or 100x larger data and compute volumes.

The optimal size of the compute and data storage unit for us, therefore, was sufficient for processing of just one genome. The compute unit could quickly be requested from the cloud and returned to the cloud without incurring much idle time. The economies of scale in this case are reduced to the limit of the cloud's elasticity (for example, provisioning/de-provisioning time) as well as to the potentially more efficient packing of jobs into the compute nodes (for example, there are times where the CPUs are not 100% utilized, as you can see in Figure 11). This would come at the expense of increased complexity and security concerns, as different data sets would likely need to be isolated from one another.

MapReduce similarities

It is important to note that the architecture of our processing pipeline we ended up adopting was not altogether dissimilar to the work of Langmead, Schatz, et al., as enabled by Hadoop. Fundamentally, the movement of data for the purposes of this task costs more than the movement of computation; therefore, a big data approach is more suitable here than the stateless node HPC approach.

As we described in the introduction, time constraints prevented us from modifying the application. Therefore, we were not able to port it to the Hadoop framework. However, we were able to address with custom orchestration the tasks that Hadoop addresses generically (for example, task rescheduling on storage failure) and tune these settings to the IBM cloud. Indeed, storage failures were not common, and we were, therefore, justified in not replicating run time storage.

Acknowledgments

Our thanks to Brian Snitzer, Kristi Schultz, Curtis Hrischuk, Reshu Jain, and Kevin Pare. You made this work possible.

Future work

It is reasonably clear that the cloud is a good fit for bioinformatics workloads in general. It also is clear that the cloud hardware needs to be carefully chosen to properly support large, data-intensive workloads. One would envision, for instance, a nonvirtualized environment with a larger number of direct-attached spindles per core. Node interconnects may or may not need to be upgraded from the 1Gbps level. More testing would be required to determine the optimal range of hardware configurations.

As the price of compute and storage continues its decline, we expect the data transfer costs over the network to play an ever-greater role in the overall cost case. We ought to, therefore, consider a tiered cloud option with smaller, modular pods closer to the sources of data and larger overflow pods on the cloud provider premises.

Another area to explore is a detailed study on the frequency of appearance of different DNA fragments for different genomes for the purposes of data compression. This problem in general is tough, but opportunities may present themselves in the case of specific organisms.

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • developerWorks Labs

    Experiment with new directions in software development.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing
ArticleID=858695
ArticleTitle=Porting a massively parallel bioinformatics pipeline to the cloud
publish-date=02202013