For most of us, chemistry is a distant childhood memory that takes us back to our school days where we got to experiment with chemical reactions. I mean who didn’t love the school science fair? It was the one occasion we were allowed to make a mess in the kitchen by mixing baking soda, vinegar, water and red dye to make a volcano explode.
Chemistry is everywhere. From the vital ingredients in consumer products such as Asprin to the raw materials of products such as Nylon, it plays an essential role in the products and technologies we most likely can’t imagine living without. However, what most of us perhaps don’t realize is that, on average, it takes at least 10 years to discover a new material and bring it to market, and that the estimated production costs are around 10 million USD. Take Nylon as an example. Research began in 1927 and it was first used in a toothbrush in 1938. Or vitamin B12, whereby synthesis required 12 years and a workforce of more than 100 people, including postdoctoral and PhD students.
Synthetic chemistry, or the art of making materials, remains a most traditional discipline in terms of digitization and acquisition of new technologies. Chemists still rely on many of the same protocols and little progress has been made to modernize the ancient practices of trial and error to enable a new era of accelerated discovery.
A dynamic group of scientists at IBM Research Europe set out to change this using modern tools such as artificial intelligence (AI), cloud technology and robotics.
IBM Scientists Change the Game
It all started three years ago when we began to develop machine learning models to predict chemical reactions. After few months of internal development, we launched the service for free via the IBM Cloud in August 2018 and the response was incredible. We called it RXN for Chemistry.
The magic behind the RXN for Chemistry is a state-of-the-art neural machine learning translation method that can predict the most likely outcome of a chemical reaction using neural machine translation architectures. Similar to translating Italian to English, our method translates the language of chemistry converting reactants and reagents to products, using the SMILE representation to describe chemical entities.
Using SMILES, this molecule is translated into BrCCOC1OCCCC1
Since the launch we have been refining the training of the architecture and today, after two years, RXN for Chemistry is still the best performing data-driven AI method for forward reaction prediction, with more than 90% top-1 accuracy. But don’t take our word for it – just ask the 15,000 users who in total have generated more than 760,000 machine-learning predictions of chemical reactions in the past two years.
More recently in 2019, we began to collaborate with a group of synthetic organic chemists at the University of Pisa, Italy to integrate a retrosynthetic architecture into the RXN tool. To explain this, think about making a pizza. The retrosynthetic architecture tells you the ingredients for the pizza as well as provides high-level guidelines to create it in the right order. Working with the team in Pisa we added this feature to RXN for Chemistry last October.
The Research Behind the Autonomous Lab
Going back to making pizza, the general guidelines given by retrosynthetic analysis may not be always to taste. There’s always a few secret ingredients or detailed techniques that distinguish a gourmet pizza from a regular one – like pre-mixing part of the ingredients to form a polish preferment, then mixing in the rest of the ingredients at a second stage. These are the kind of tips you pick up directly from cooks with more experience or from reading your favorite cookbooks. A chemist is no different when it comes to collecting tips.
And then you might wonder why it’s necessary to knead the pizza dough. This is probably the most tedious task, but it is also the most important for developing the texture. Still, mixing it all together and tossing it around may be fun once or twice, but doing it 50-60 times a day is tiresome and time consuming. That time and energy could be better spent elsewhere. The same is true for a chemist synthesizing molecules.
So, how can we make chemistry fun again? We did it by reinventing the way chemistry is done altogether. All it took was a combination of AI, cloud technology and chemistry automation. This mixture led to the creation of RoboRXN: machine learning algorithms autonomously designing (AI) and executing (automation) the production of molecules in a laboratory remotely accessible (Cloud) with as little human intervention as possible.
So, do you remember the secrets of making pizza? The main challenge in chemistry is that lots of operational details on how to “cook” the chemical ingredients are reported in prose or in the form of unstructured data, which thwarts a straightforward analysis and interpretation. To be able to construct an AI model with the ability to learn the right steps of the chemical procedures, we first had to address the following challenge: designing an algorithm that specifically extracts the synthesis information for organic chemistry and converts it into a structured and automation-friendly format.
As for the entire RXN framework approach, we opted for a purely data-driven scheme. This means that once the machine learning algorithm acquires enough examples, it will be able to figure out on its own which words to pay attention to in order to extract the right production steps. To provide the training data for the machine-learning model, we set up an annotation framework that enabled us to generate examples of sentences related to synthesis procedures and corresponding operations.The major advantage of such a data-driven approach is that it relies on data only. To improve it, one simply needs more examples.
In contrast other approaches, our deep-learning model converts experimental procedures as a whole into a structured, automation-friendly format, instead of scanning texts in search of relevant pieces of information. Moreover, it does not rely on the identification of individual entities in sentences, nor does it require specifying which words or word groups the synthesis actions correspond to, which makes the model more flexible and reliable.
The construction of a ground truth dataset for chemical procedures allowed us to build the core of the RoboRXN technology, namely an AI model that, trained on a large number of chemical recipes, learns the specifics of chemicals to be able to recommend the correct sequence of operations to “cook” a specific target molecule.
Coming back to the pizza analogy: imagine an AI model that can not only retrieve your favorite recipes upon request, it can also automatically draw from its embedded knowledge to deliver an optimal list of instructions to make that gourmet pizza that will surely impress your dinner guests.
In IT terms, this is equivalent to having an AI architecture writing programs to make molecules (or cook food). Our aim in building RoboRXN was to use this AI model to eliminate the tedious human task of programing commercial automation hardware. And to make the RoboRXN system even more convenient and user-friendly, we implemented the entire set of services on the IBM Cloud in order to make it accessible everywhere there is an internet connection.
Revolutionizing Industrial Chemistry
The result is a reliable and autonomous infrastructure, integrating technologies such as cloud, AI and automation to assist chemists in not only predicting chemical reactions, but also in executing the production of a molecule or substance from anywhere in the world – which is particularly critical as we remain working from home.
What are the implications of this, you might ask? Just imagine if an automated system like RoboRXN could help chemists cut the discovery period of a new treatment for COVID-19 or any other virus in half.
Or what if RoboRXN could help accelerate the development of a fertilizer which doesn’t require consuming 1-2% of the world’s annual energy supply for its production.
The possibilities are endless when it’s humans + machines.
Laino et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy, Chemical Science, Chem. Sci., 2020,11, 3316-3325, https://doi.org/10.1039/C9SC05704H
At the annual Conference on Empirical Methods in Natural Language Processing (EMNLP), IBM Research AI is presenting 30 papers in the main conference and 12 findings that together aim to advance the field of natural language processing (NLP).
Researchers from our IBM Research labs around the world and from IBM Watson Health have contributed a total of 47 workshops, papers, posters and panels that will be presented at AMIA 2020. These contributions cover a wide range of topics but reflect our overarching goal of driving the usefulness of AI in Healthcare.
Capturing and structuring common knowledge from the real world to make it available to computer systems is one of the foundational principles of IBM Research. The real-world information is often naturally organized as graphs (e.g., world wide web, social networks) where knowledge is represented not only by the data content of each node, but also […]