Metabolomics is the emerging field of science and technology concerned with revealing the roles of small molecules or metabolites in health and disease.
It looks at how they fuel the cells, pass signals from one cell to each other, or turn healthy cells into cancer cells. Metabolomics is a field of emerging importance and has already enabled breakthrough discoveries in biology and medicine by providing diagnostics, helping drug discoveries, and supporting therapies to fight cancer, diabetes, autoimmune diseases, and infections.
Spatial metabolomics—or detection of metabolites in cells, tissues, and organs in their spatial context—is the current frontier in the field. To put it simply, spatial metabolomics provides a microscope that can show not only what cells look like, but what they contain. Long before a cell changes its look—for example, while turning into a cancer cell—its metabolic content can already be used to predict its fate. Spatial metabolomics provides information about thousands of metabolites (as opposed to regular images containing just a few channels), but with this provided power comes the burden of handling gigabytes to terabytes of data generated from a tiny piece of tissue section (1×1 mm). This puts as-high-as-never requirements to the algorithms for the data analysis.
To address the big data challenge in spatial metabolomics, we present how using IBM Cloud Functions can dramatically simplify the big data processing for spatial metabolomics.
Understanding the unknown
The Alexandrov team at the international organization European Molecular Biology Laboratory (EMBL) unites biologists, chemists, computer scientists, and software engineers who develop bioinformatics methods and cloud software to address key challenges of spatial metabolomics. We have recently developed an algorithm that enables finding molecules hidden in gigabytes of spatial metabolomics data and implemented it into an open source cloud software called METASPACE.
In METASPACE, the general flow starts when a scientist uploads their medical images or datasets to the cloud application, provides metadata, and selects key parameters, including the choice of the molecular database and the specific classes of metabolites on which to focus (e.g., small molecules, lipids, or drugs).
Next, the METASPACE engine comes into action. First, the dataset and molecular database need to be preprocessed for the next stage of high-performance computing (HPC) analytics. In particular it sorts the dataset and database, chunking data into smaller segments. Once all segments are generated, the real “magic” happens. The METASPACE annotation engine executes dedicated algorithms to scan through the segments to identify which molecules are encoded in this data and which spatial regions they correspond to. Once all molecules are found, the engine generates easily explainable images showing their location that can be overlaid with histological or medical images to support biological discoveries, diagnosis, or drug development.
Below is just one example from millions of images generated by METASPACE that shows the localization of a metabolite within an animal:
From data to data
METASPACE is a big data cloud platform used by scientists from all over the globe to submit multi-gigabyte datasets generated in various experiments by various instruments, often with a high uncertainty in the quality or content of the data. The critical aspect in its operation is that the data is sometimes too big to load entirely into memory. Every pixel in the spatial metabolomics image can be considered as a data point containing thousands of molecules, with the number of pixels reaching as high as a million. This generates huge amounts of data (sometimes larger than 1TB) that can’t be loaded entirely into memory but still needs to be sorted, chunked, annotated, and collected together. The challenges this presents are how to process data in concurrent tasks running on different machines and how to assemble the results and represents an example of the big data analysis.
Obviously, we don’t have the luxury of having a super machine with unlimited memory and computing power to load all data into memory and process it there. But even if we did, it would still be challenging to efficiently utilize the computing resources of this super machine since developers would need to take care of running multiple threads, processes, etc. in coordination.
To address the big data challenge, METASPACE is currently powered by the Apache Spark technology. However, even this implementation is not scalable enough to deal with the big data of ever-growing and diverse spatial metabolomics datasets, in particular due to the rising popularity of our platform. The main challenge of using Apache Spark is the need to hard-code or predict in advance the resources needed at every point in time of processing the data. This is exactly the deficiency that can be addressed by the serverless processing paradigm. Thus, we partnered with IBM Research, which recently developed novel serverless approaches integrated with IBM Cloud Functions and IBM Cloud Object Storage to process massive spatial metabolomics datasets.
The easy step into serverless
IBM Cloud Functions is IBM’s Functions-as-a-Service (FaaS) platform. Using IBM Cloud Functions, we can almost instantaneously allocate large amounts computing resources and only pay for these actually used resources. However, there is still the remaining challenge of how to effectively scale the METASPACE engine for serverless processing of a dataset stored in IBM Cloud Object Storage, how to monitor the executions, and execute all tasks as a single “logical” job in IBM Cloud Functions.
In the context of the European Horizon2020 project CloudButton and together with IBM Research, we have developed our solution based on the open source PyWren-IBM framework. PyWren-IBM introduces serverless computing with a minimal effort and brings automated scalability for massive data processing. The goal of PyWren-IBM is to enable an easy move to serverless by providing a “push to the cloud” experience: Users can focus on their code, while PyWren-IBM focuses on the code execution in the cloud. PyWren-IBM is a Python library and can be used to execute any Python function with its dependencies. Read more about PyWren-IBM-Cloud here.
Serverless computing for spatial metabolomics
Now, let’s bring the pieces together and see how it works. The figure below shows a high-level approach of our design. Our core approach is decentralized and completely serverless, where we let the Python-IBM framework determine the appropriate scale of parallelism needed to process input datasets. To achieve this, our code evaluates the input datasets and then decides on the number of serverless actions required, with the aim to maximize performance and the costs of the processing. This approach allows us to dynamically adjust the amount of compute resources while the data is being processed—which is in contrast to a Spark-based approach, where the amount of compute resources is determined before starting the data processing and can’t be adjusted as the processing progresses:
Our initial benchmarks showed that we managed to process certain datasets in less than an hour, while it took at least four hours in an Apache Spark cluster deployed over 4 VMs, 32GB each. We also observed that having a larger Apache Spark cluster will not improve the performance, as we are still bottlenecked by the METASPACE code running in Spark driver.
The serverless solution developed together with IBM allowed us to process datasets which were previously out of reach, and without additional efforts for infrastructure maintenance, configuration, and deployment. Moreover, what is even more important is that it allowed us to start developing novel scientific approaches to find molecules in the so-called “dark molecular matter”—or the 95% of data which was previously not annotated.
Try it yourself
Whether you are a molecular biology expert, technology guru, or computer scientist eager to learn new skills, you are welcome to try our solution yourself. We created a public open source repository outlining all steps of how to execute spatial metabolomics flows with PyWren-IBM on the IBM Cloud. Please check out the repository and execute the demo notebook.
What’s next?
Using serverless solutions allowed us to address key challenges in spatial metabolomics and annotate dark molecular matter that will have an impact on our research and the research of hundreds of scientists from across the globe using METASPACE in various biomedical applications. This is ongoing work, and we will continue improving our solution.
In this age of big data generation, the biotechnology and biomedical industry and academia face the challenge of big data processing. The serverless computing paradigm has emerged as an attractive alternative and was proven to make flows more efficient—yet many scientists and technologists are unsure of what is required to make the move and what it includes. By sharing how we used serverless to address a key challenge in spatial metabolomics, we hope to provide an example of how existing code can be exploited serverless without a painful process of starting from scratch and learning new skills.
Acknowledgments
Special thanks to Omer Belhasin (IBM Research) and Lachlan Stuart (EMBL) who worked hard to make this happen and to Josep Sampe (URV) for his contributions to PyWren-IBM.
This work was co-funded by the EU Horizon2020 project, CloudButton.