Today my American friends in the United States and across the global diaspora will enjoy Thanksgiving. Part of this festivity is the annual feast with family members, which traditionally includes turkey, vegetables, and many wheat-based products including buttermilk biscuits, pies and my favourite, stuffing.
While we enjoy the variety of delicious food available to us, scientists worry about the future supply of food, especially, staple food like wheat, and the quality of agricultural soil.
Food security for the future is a major challenge of our time, due to the factors like, rising population, climate change, drought and other natural and man-made disasters. Scientists are seeking to address these concerns through the science of genomics, where computers are the core engine driving this scientific revolution.
Genomics has roots in microbiology, when people started seeing life and living beings in various forms, and cells were considered the fundamental unit of life. It took several decades to establish that cells contain the hereditary information in form of DNA, and it is the DNA that provides the blueprint for understanding the genetic makeup of an organism.
Today, as we enter the genomic era, much of the biology has turned into a giant interdisciplinary endeavour producing a mountain of data that must be analysed on powerful machines using sophisticated algorithms. Biology, now is a Big Data enterprise.
Genomics Big Data Advances as part of the IBM Research and Hartree Centre Collaboration
As part of the collaborative programme between IBM Research and the Science and Technology Facilities Council’s (STFC) Hartree Centre, we have been tackling the big data challenge presented by genomics and the related disciplines.
Powered by the supercomputing facilities at the Hartree Centre, a team of IBM researchers, along with the colleagues from STFC, are engaged in a variety of industrial and academic collaborations to develop novel solutions to meet the demands of Big Data in Genomics.
Our approach is centred around the principles of Data-Centric Cognitive Systems – a paradigm that provides a new and powerful approach to computing. The Data-Centric part facilitates a flexible compute infrastructure to deal with the structured and unstructured data in biology and minimizes data movements by co-locating computation with data.
The Cognitive aspect leverages the benefits of data-centric architecture and applies machine learning techniques on scale, exploiting both CPUs and GPUs, to enable multi-layered understanding of organisms under study. Both of these approaches are essential in context to biological analysis, not only due to the nature of data, but also due to our limited understanding of the complex living systems. Computer architecture and inference algorithms must be seamlessly combined together for speed and efficiency.
Our programme also focuses on various aspects of data handling that are specifically tailored for biological datasets, like portability of multi-stage workflows, efficient metadata usage, knowledge representation and exploitation, and skills transfer to our colleagues and collaborators.
At the Hartree Centre, as part of our genomics engagement, food security is a major focus area for us. We are engaged in understanding the wheat genome and the metagenomics of soil – both are highly complex problems in their own rights and suitable for supercomputers
Soil is the most biodiverse environment on earth, where up to 10 billion bacterial cells are expected to reside around each gram of soil. We have a very little understanding of the micro-organisms that are responsible for soil fertility, nutrient recycling and other major functions that are needed to maintain soil health. Identification of the resident microbes and their collective behaviour is the first step towards understanding soil.
Wheat genome on the other hand is a highly complex genome – almost five times larger than a human genome and has six copies of each chromosome, compared to our two. It also contains more than 100,000 genes. In order to understand the molecular mechanisms driving the regulatory mechanism in wheat, it requires data processing and inference at a very large scale.
At the Hartree Centre, we are setting up large HPC-enabled computational pipelines to accelerate processing of large genomics and metagenomics datasets in considerably reduced time. At the same time, we are innovating through our Data-Centric systems to enable a system-wide understanding of soil as a living system, and identification of key regulatory gene networks in wheat that may play an important role in designing the crop varieties of future.
If you happen to be at the 15th Rocky Mountain Bioinformatics Conference next month, you can hear my IBM colleagues talk about assembly-free metagenomics on distributed computing platform, and the wider genomics efforts at Hartree.
I hope, while you enjoy the thanksgiving feast, you also give a thought or two to the effort that goes towards securing food for future. With this, I wish you all a Happy Thanksgiving!
Authored by Dr. Ritesh Krishna, Computational Life Sciences Researcher at IBM Research.