Novel AI tools to accelerate cancer research

Share this post:

Cancer is the second leading cause of death worldwide[i], with an estimated 18.1 million new cases and 9.6 million deaths attributed to it in 2018[ii]. The search for more effective anti-cancer drugs is a global effort involving academia and industry. In our Computational Systems Biology group at the IBM Research lab in Zurich, we are building machine learning approaches that can potentially help to accelerate our understanding of the leading drivers and molecular mechanisms of these complex diseases, as well as the differences in tumor composition occurring across various cancer types. Our goal is to deepen our understanding of cancer to equip industries and academia with the knowledge that could potentially one day help fuel new treatments and therapies.

At the 18th European Conference on Computational Biology (ECCB) and the 27th Conference on Intelligent Systems for Molecular Biology (ISMB) to be held from July 21 -25 in Basel, Switzerland, IBM will present significant, novel research that led to the implementation of three machine learning solutions aimed at accelerating and guiding cancer research. Below is a brief discussion of each of those three advancements.

PaccMann: How deep learning can help predict and explain the efficacy of drugs

Developing and getting even one single anticancer drug approved can cost hundreds of millions of US dollars[iii], which speaks to the potential for greater efficiency along the development pipeline. One way to potentially cut costs in the search for new and better drugs would involve identifying at an early stage which candidate compounds will most-likely prove effective at fighting the targeted disease. For this purpose, we developed PaccMann (Prediction of anticancer compound sensitivity with Multi-modal attention-based neural networks), a multimodal deep learning solution that uses data from disparate sources to help predict how cells in diseased tissue will respond to a given drug. In our research, we applied PaccMann to predict the sensitivity of cancer cell lines to known drugs, achieving a superior predictive power compared to existing algorithms[iv].

PaccMann exploits data on gene expression in the cell line under study, as well as information on the molecular structure of an anticancer compound. Additionally, PaccMann takes into account prior knowledge on protein interactions that are relevant to understanding how a cell line responds to a drug. This previous knowledge is independent on the cell line-drug pair under consideration.  In the underlying studywe used PaccMann to predict drug sensitivity on a large number of drug-cell line pairs (more than 200,000 pairs) contained in the GDSC database. On these data, PaccMann not only predicted sensitivity for the drug-cell line pairs more accurately than alternative tools, it also offered explainability, highlighting which specific genes and which portions of the compound’s molecular structure it paid the most attention to while performing the predictions. This information can be used by researchers as a guide to potentially help them improve or repurpose existing drugs, as well as to develop new ones.

Researchers can try it out now and predict anti-cancer drug sensitivity with our web service. More details about PaccMann are available on the projects web site, as well as the open source implementation.

Using PaccMann, the sensitivity of a cell line to a candidate drug can be predicted with high accuracy.

Using PaccMann, the sensitivity of a cell line to a candidate drug can be predicted with high accuracy.

INtERAcT: A tool to automate knowledge extraction from scientific publications

INtERAcT (Interaction Network infErence from vectoR representATions of words) is a novel approach to extract information on protein-protein interactions from scientific papers in a completely unsupervised way.

Normally, the way proteins interact is well-regulated. In diseases like cancer, however, disturbed biological processes are reflected in – and possibly caused by – altered protein-protein interactions. A huge amount of knowledge on these crucial protein interactions is buried as unstructured text, in images and charts contained in scientific publications. While a comprehensive knowledge on protein interactions is fundamental in biomedical research, it would be overwhelming for any scientist to try to access that information by actually reading through all existing publications. According to a recent bibliometric study, around 17,000 scientific articles are published on average every year in the field of cancer research alone – and the output keeps growing exponentially.

INtERAcT leverages the concept of word embeddings[v] process text from a large body of biomedical publications and defines a new metric to quantify the interactions between proteins. The most remarkable feature of INtERAcT is that no annotation or manual curation of the text is needed. In a recent study, we compared INtERAcT with a number of standard metrics, and it significantly outperformed those metrics on a collection of 10 cancer types using the database STRING as a reference.[vi] Moreover, INtERAcT, when benchmarked against the text-mined interactions contained in STRING, showed a stronger correlation with interactions validated using different evidence levels, including experiments.

A particular strength of INtERAcT is its capability to infer interactions in the context of a specific disease. The comparison with the normal interactions in healthy tissue may potentially help to obtain insight into the disease mechanisms.

INtERAcT is open source and available as a web service. Learn all about it on the project web site.

By sifting through unstructured text data in scientific publications, INtERAcT is able to extract valuable information on protein-protein interactions much faster than any human could.

By sifting through unstructured text data in scientific publications, INtERAcT is able to extract valuable information on protein-protein interactions much faster than any human could.

PIMKL: Towards interpretable phenotype prediction using pathway information

Predicting disease progression based on molecular data obtained from diseased tissue samples and stratifying, or classifying, patients accordingly is a crucial step to help clinicians better personalize and design effective treatments. Although a number of machine algorithms have been proposed to solve this task, many of them have failed in providing interpretable predictions, a key aspect for adoption of such methodologies in the health care industry.

IBM has created PIMKL (pathway-induced multiple kernel learning), a new machine learning algorithm that can produce highly predictive performance and interpretability in predicting phenotypes based on molecular data. PIMKL enables this by exploiting prior knowledge on molecular interactions. Specifically, we feed PIMKL information on molecular networks and pathways describing well-defined biomolecular process. Using a machine learning technique known as multiple kernel learning, PIMKL identifies molecular pathways that are important for the classification of patient groups. The insights on differences between patient groups provided, thanks to the interpretability of the model, could therefore lead to better understanding of cancer progression.

In our study, we specifically addressed the task of predicting whether a breast cancer patient will suffer a relapse within five years after first treatment. In order to benchmark PIMKL, we compared it to 14 other similar algorithms that have been previously applied to six breast cancer cohorts. PIMKL consistently outperformed its counterparts or ranked among the top-performing algorithms on each single cohort.[vii]

The molecular signatures identified by PIMKL turned out to also be highly relevant when we applied the algorithm on unseen data from a different cohort to that used for training. This is an indication that the knowledge obtained on disease progression can be transferred across datasets and cohorts. A particular strength of PIMKL showed up when we trained it with noisy data from different omic layers, i.e.  genetic, epigenetic or protein data. Even then, PIMKL proved able to discard the noise and select the most informative data without reducing its performance.

We deployed PIMKL on the IBM Cloud. Researchers can access the web service or use the open source code and use their own data to run experiments. and obtain stable molecular signatures. Learn more about PIMKL on the project web site.

PIMKL exploits a machine learning technique called multiple kernel learning and prior knowledge on molecular interactions to predict phenotypes from molecular data of diverse origins.

PIMKL exploits a machine learning technique called multiple kernel learning and prior knowledge on molecular interactions to predict phenotypes from molecular data of diverse origins.

The three presented algorithms demonstrate how machine learning approaches can be exploited to advance biomedical research on complex diseases such as cancer. Our work also shows that it is possible to incorporate explainabilty into the algorithms, thereby reinforcing trust while also guiding the search for the underlying disease mechanisms. We continue working to further refine these solutions. By making them publicly available, we hope to maximize their positive impact in the scientific community.


[i] Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980–2015: a systematic analysis for the Global Burden of Disease Study 2015, Lancet. 2016 Oct 8; 388(10053): 1459–1544.

[ii] Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA Cancer J Clin. 2018 Nov;68(6):394-424.

[iii] Research and Development Spending to Bring a Single Cancer Drug to Market and Revenues After Approval, JAMA Intern Med. 2017 Nov; 177(11): 1569–1575.

[iv] Matteo Manica, Ali Oskooei, Jannis Born, Vigneshwari Subramanian, Julio Sáez- Rodríguez, María Rodríguez Martínez, “Towards Explainable Anticancer Compound Sensitivity Prediction via Multimodal Attention-based Convolutional Encoders”, Workshop on Computational Biology ICML, 2019.

[v] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems, NIPS’13 3111–3119

[vi] Matteo Manica, Roland Mathis, Joris Cadow, María Rodríguez Martínez, “Context-specific interaction networks from vector representation of words“, Nature Machine Intelligence 1(4), 181–190, 2019.

[vii] Matteo Manica, Joris Cadow, Roland Mathis, María Rodríguez Martínez,
PIMKL: Pathway-Induced Multiple Kernel Learning”,
npj Systems Biology and Applications 5(1), 8, 2019.


Research Staff Member in Cognitive Health Care and Life Sciences, IBM Research-Zurich

Joris Cadow

Data Scientist, IBM Research-Zurich

More Healthcare stories

IBM RXN: New AI model boosts mapping of chemical reactions

Today, Nature Machine Intelligence is featuring, "Mapping the Space of Chemical Reactions Using Attention-Based Neural Networks", research from IBM Research Europe and the University of Bern that investigates deep learning models to classify chemical reactions and visualizes the chemical reaction space.

Continue reading

RoboRXN: Automating Chemical Synthesis

For most of us, chemistry is a distant childhood memory that takes us back to our school days where we got to experiment with chemical reactions. I mean who didn’t love the school science fair? It was the one occasion we were allowed to make a mess in the kitchen by mixing baking soda, vinegar, […]

Continue reading

New IBM and Intel Blockchain Security Feature Targets 5G Auctions

A new security feature developed by IBM and Intel extends blockchain capabilities and helps increase trust in high-stakes markets such as wireless spectrum auctions. As telecom companies start rolling out the fifth generation of wireless networks, the term 5G is becoming omnipresent in the news linking it to the prospect of higher data transfer speeds. […]

Continue reading