DREAM Challenge results: Can machine learning help improve accuracy in breast cancer screening?

Share this post:






Breast Cancer is the most common cancer in women. It is estimated that one out of eight women will be diagnosed with breast cancer in their lifetime. The good news is that 99 percent of women whose breast cancer was detected early (stage 1 or 0) survive beyond five years after diagnosis 1 , leading countries around the world to implement breast cancer screening programs for early detection.

Mammography screening, however, is not a perfect procedure. Of the 40 million women undergoing annual mammogram testing to screen for breast cancer in the United States, an estimated 4 million women are unnecessarily called back for further testing. This high rate of false-positive exams leads to added anxiety, unnecessary biopsies for the individual, and may lead to overtreatment such as surgical excision. There is also an estimated rate of 13 percent false negatives — that is, cases of breast cancer that radiologists fail to detect2.

At IBM Research, we’ve made strides in harnessing cognitive computing and machine learning to examine medical images for insights that help health professionals treat patients. We therefore wondered whether these technologies could be used to increase the accuracy of mammography screening. To find out, we co-organized a coalition of oncology and technology partners to pose this challenge to the science community.

The Digital Mammography DREAM Challenge is an open-science, crowdsourced challenge that invited the machine learning community to contribute their algorithms to address this important medical problem. Funded in part by the Laura and John Arnold foundation (LJAF), the challenge will award up to $1.2 million in cash prizes to the team who develops predictive algorithms that achieve milestones related to reducing the recall rate of mammography screening. The Challenge is one of the activities  to support the White House’s Cancer Moonshot Initiative.

The challenge started with a competitive phase that gave a world-wide community of machine learning practitioners access to more than 640,000 de-identified digital mammography images, corresponding to 146,000 mammography exams of 86,000 women, including demographic, clinical and longitudinal data contributed by Kaiser Permanente Washington, as well as an independent dataset with 15,000 images, 3,200 exams and 1,400 women provided by the Icahn School of Medicine at Mount Sinai. This data was used to challenge the more than 1,150 coders, developers and scientists that registered to participate in the Challenge. The Challenge sought methods to determine:

  1. The cancer status of each breast of a subject, given only a screening digital mammography exam (without access to previous exams or clinical/demographic information).
  2. The cancer status of each breast of a subject, given a screening exam, a panel of clinical/demographic information, and if available, previous screening exam(s).

The models submitted in response to these questions were objectively scored using independent validation datasets, with scores posted on the Challenge leader board throughout the competitive phase. The results of the first three rounds of submissions are very encouraging (see Figure 1).  Notably, the first, second and third place teams gained predictive accuracy as the Challenge progressed.

On March 28, the organizers provided further data for training, and the final phase of the competitive Challenge started, in which models were tested against a completely new independent validation data set (see Figure 1). The results of this validation were very close to the best results in Round 3, showing that the predictive ability of the algorithms, which had been trained in a different data set, can be generalized to new cases.

DREAM Challenge 2

Figure 1. The results of submissions for the teams placing first, second and third in the in three rounds which finished on March 28, 2017, and in an independent validation set. The performance metric was the Area under the ROC curve (AUC), a common metric for evaluation of classification performance used in machine learning. If the AUC is 0.5, the classification is no better than random, whereas the perfect classification would get an AUC of 1.0. The best team got an AUC of 0.9 in the third round, which indicates an excellent performance, and an AUC of 0.87 in the independent validation set, which indicates a good generalization ability.

The leaders of the best performer teams were in the final validation round were Yaroslav Nikulin, and Yuanfang Guan. Yaroslav Nikulin, of the French imaging company Therapixel, led the team that received top honors for their work on the first task and tied for first place in the second task. In the first task, they developed an algorithm with a predictive accuracy of 80.3 percent, which is 5 percent more accurate than the runner up. In the second task, Nikulin and his team developed an algorithm that was 80.4 percent accurate. Tied for first place in the second task was Yuanfang Guan, Assistant Professor in the Department of Computational Medicine and Bioinformatics at the University of Michigan, Ann Arbor. Dr Guan developed an algorithm with a predictive accuracy of 77.5 percent and outperformed the runner-up by more than 2 percent. Though the difference in accuracy between Guan’s and Nikulin’s teams was 2.9 percent, their performance was indistinguishable in the other metrics used to score the algorithms. Both winning teams used “Deep Learning,” one of the most advanced artificial intelligence techniques capable of analyzing and interpreting images.

The Challenge brought together researchers with different backgrounds. For example, Yaroslav Nikulin is an engineer at Therapixel, a company that aims at improving mammography reading by providing radiologists with assisting AI-based algorithms. On the other hand, Yuanfang Guan a veteran DREAM participants who has won ten previous Challenges was more of a novice in this field. In Guan’s words: “In the past 4 years, I have gained tremendously through these challenges. I had wanted to learn computer vision for many years now, and the Digital Mammography Challenge finally set me off to do so, and gave an objective evaluation of whether I had learned properly.”

Other teams achieved good performance during the leaderboard and final validation runs. The first, second and third teams during the leaderboard and validation phases will receive monetary awards that add up to $200,000.

Now that the competitive phase of the Digital Mammography DREAM Challenge has ended, we are starting the collaborative phase, in which the best performing teams are invited to work together to achieve improved results. These participants will have a chance to receive a cash prize of $1M to be divided amongst the teams if they are able to demonstrate significant improvement over the results of the competitive phase.

Following the completion of the collaborative phase, a scientific paper will be written outlining the results of the Challenge in comparison with radiologist performance. Then, the next phase of exploring if these algorithms can be of clinical utility will start. As is customary in DREAM Challenges, the top-performing algorithms will become public, open-source material that any party can potentially develop into clinically usable products.

The best performing teams will also present their work at the DREAM conference in November 2017, where the community will engage in a dialogue of the lessons learned and the path forward to improve algorithms to interpret the images produced in mammography screening.

Another notable component of the Digital Mammography DREAM Challenge is the IBM Soft Layer Cloud/Watson Health Cloud solution developed by IBM researchers in collaboration with co-organizers from Sage Bionetworks. The solution was implemented using Docker container technology, which allowed Challenge participants to submit their machine learning code, along with the environment that made the code run, through Sage Bionetworks’ collaborative Synapse challenge platform. The solution allowed participants to train their algorithms without downloading the data, so even if the data could not be fully shared, it could be used by a wide community of users to solve data-intensive problems.

In addition to setting up the Cloud, teams from IBM Research in Yorktown, Haifa and Australia, created the data splits that resulted in different datasets for training, validation and testing, as well as the determination of whether there was signal in the data to carry on the challenge.

About DREAM Challenges

Challenge leadership was provided by IBM Research through the DREAM Challenges and by Sage Bionetworks. The DREAM (Dialogue for Reverse Engineering Assessment and Methods) Challenges, founded in 2006 by IBM Researcher Gustavo Stolovitzky, are a community-driven, open science, crowdsourcing effort that has hosted about 45 Challenges so far, addressing questions that range from predictive models for disease progression to developing models for cell signaling networks. DREAM Challenges are organized by groups of scientists coming from academia, industry, government, and non-profit organizations working together to organize and implement the challenge. In 2013 DREAM partnered with Sage Bionetworks, a non-profit research organization that seeks to accelerate health research through the creation of platforms that empower patients and researchers to share and interpret data on a massive scale. One of the technologies developed by Sage Bionetworks is Synapse, a collaborative platform where all DREAM Challenges take place. The partnership between the DREAM Challenges and Sage Bionetworks has allowed this open science community effort to take on tougher and more complicated challenges like the Digital Mammography DREAM Challenge. 



[1] National Cancer Institute, Cancer Stat Facts – Female Breast Cancer,
[2] Breast Cancer Surveillance Consortium, Sensitivity, Specificity, and False negative rate for 1,682,504 Screening Mammography
Examinations from 2007 – 2013  – Based on BCSC data through 2013,


More Healthcare stories

A new supercomputing-powered weather model may ready us for Exascale

In the U.S. alone, extreme weather caused some 297 deaths and $53.5 billion in economic damage in 2016. Globally, natural disasters caused $175 billion in damage. It’s essential for governments, business and people to receive advance warning of wild weather in order to minimize its impact, yet today the information we get is limited. Current […]

Continue reading

Computational Neuroscience

New Issue of the IBM Journal of Research and Development   Understanding the brain’s dynamics is of central importance to neuroscience. Our ability to observe, model, and infer from neuroscientific data the principles and mechanisms of brain dynamics determines our ability to understand the brain’s unusual cognitive and behavioral capabilities. Our guest editors, James Kozloski, […]

Continue reading

The Flame Challenge – explaining complex science to an 11-year-old

The Alan Alda Center for Communicating Science at Stony Brook University trains scientists and health professionals to communicate more effectively with the public. I helped found the Alda Center after a stint hosting “Scientific American Frontiers,” the PBS show dedicated to explored cutting-edge advances in science and technology. As we shot the show, I really […]

Continue reading