It’s becoming increasingly clear that our health is influenced by our personal complement of microbes – our microbiome. Awareness of the microbiome has grown in leaps and bounds thanks to the massive capacity of scientific instruments that read the DNA of microbes. But many fundamental questions about microbes remain unanswered, even questions that seem like they should be easy. “Has anyone seen this microbe before?” “Where? “When?” After today, answering these questions will be a lot easier. The problem was not lack of data, but that each microbiome dataset was an island onto itself and not easily compared to the others. Working as a team of microbial ecologists, computational scientists, bioinformaticians, and statisticians, we analyzed the largest collection of microbiome data (by 100 times). In the current issue of Nature, we report the first-of-a-kind microbiome database that lets researchers track microbes across the planet, even if the microbes don’t have a name (as is usually the case).
In 2010, many scientists were sequencing DNA from microbes in all sorts of environments, yet individual studies were not easily compared. Rob Knight (U.C. San Diego), Jack Gilbert (University of Chicago), Janet Jansson (Pacific Northwest National Laboratory), and 26 other leading microbiome scientists put the “crowd” to work generating exactly the right data to answer basic questions about where each microbe lives. The Earth Microbiome Project (EMP) was the audacious name they gave to the endeavor to catalog all microbial life. The EMP dataset described in Nature this week continues to grow as researchers share more studies. The dataset already contains 100 times more samples and 100,000 times more DNA sequences than a previous analysis. Open data and advanced bioinformatics algorithms are a powerful combination.
Using a microbe “barcoding” technique called 16S rRNA sequencing, we tracked 308,000 unique 16S gene fragments. Because the 16S gene is essential for bacterial life and mutates at a known rate it is commonly used as a DNA barcode for identifying bacteria. These barcodes were found in 27,000 samples (of soil, water, leaves, stool, etc.) from 97 studies on seven continents, in 43 countries. Each of the 308,000 unique DNA barcodes references a particular microbe. How the microbes are organized in different environments including soil, animal gut, non-saline sediment, and many other environments is reported in detail.
This study puts our current state of knowledge about microbes in sharp relief. Forty-six percent of the DNA barcodes had an exact match to a sequence in a public database indicating that we found almost half of known microbes in just under 100 studies. Meanwhile, the barcodes matching the database comprised only 12 percent of the barcodes in our dataset, indicating that a large majority of detected microbes are not recorded in 16S gene databases. More than 88 percent of the microbes that we found are unnamed. Using barcodes we tracked these microbes, too.
Large-scale microbiome data collection was used to spot global trends in the microbial colonization of the planet. Importantly, large-scale microbiome data collection also enables researchers to use artificial intelligence (AI) and machine learning to more deeply comprehend the outsized role that microbes play in the health of people and our planet.
IBM and UC San Diego recently announced a collaboration to explore AI for Healthy Living (AIHL). We are working together to develop AI and machine learning for the human microbiome. Large datasets like the EMP provide a critical mass of microbiome data that, augmented with additional data, allows us to apply cutting-edge AI and machine learning methods.